…rectory ingest
Stage 4 of the parallel-add roadmap (VectifyAI#151). `openkb add <dir>` now routes
through a worker-safe prepare / serial commit split: prepare converts into
private .openkb/staging/prepare output without the KB mutation lock and without
touching official raw/, wiki/, or .openkb/ state; the serial owner commits under
kb_ingest_lock, resolving the final name and publishing. `--jobs` (Stage 5) is
not included.
Batch coordinator (openkb/cli.py):
- add_directory_serial runs a serial prepare -> commit loop. `add` already holds
kb_ingest_lock for its whole body via @_with_kb_lock, so prepare runs under
that one outer lock; the reaper (first lock acquisition, before this batch's
staging exists) cannot collide with a live prepare. Per-file failure continues
the batch; DirtyRollbackError stops it.
Prepare (openkb/add_prepare.py, openkb/converter.py):
- convert_document_for_prepare: lock-free conversion into private staging under
a placeholder doc_name (sanitized stem); returns ConvertResult without
registering the hash or resolving the final name.
- prepare_document owns the staging-dir lifecycle (rmtree on interrupt).
prepare/commit are coordinator-internal, called only by add_directory_serial.
Serial commit (openkb/cli.py):
- commit_prepared_document requires kb_ingest_lock held (reentrant acquire).
- The prepared branch of _add_single_file_locked re-validates under the lock
because prepare ran without it: re-decides skip from live registry state,
re-hashes the source, and re-converts when the source changed or prepare had
short-circuited with no artifacts (the stale-prepare contract).
- _retarget_prepared_document_artifacts renames staged raw/source/images from
the placeholder name to the owner-resolved final name.
Reaper (openkb/locks.py):
- _reap_prepare_staging reclaims orphaned prepare staging at first exclusive
acquisition; skips symlinks, unlinks stray files, and logs INFO on success /
WARNING on failure. No per-prepare marker is needed: directory add holds the
lock across its whole batch, so a live batch's staging is never visible to
another reaper.
Tests: prepare writes only private staging and takes no lock; commit resolves the
final name under the owner and requires its lock; the reaper reaps orphans and
skips symlinks; stale skip and source-changed (TOCTOU) re-prepare at commit;
`add <dir>` end-to-end lands every file via prepare/commit; a prepare failure
isolates the file while the batch continues.
Background
#142 made serial
openkb addcrash-safe (journals, touched-path snapshots, staged publish, rollback, recovery drain), and #156 gave those mechanics an explicit owner —AddMutationPlan/run_add_mutation— so one serial path is responsible for every mutation of official KB state, while anything else may only write disposable staging.That split is what makes parallel ingest safe to build. But there is no real "prepare" step yet:
addstill converts and commits in a single pass under the lock. Before--jobscan run conversions in parallel, prepare has to become a lock-free phase of its own — one that writes only private staging and never touches the registry or officialraw//wiki/. The serial owner then commits under the lock, resolving the final name and publishing, just as it does today.This PR adds that prepare/commit split for
openkb add <dir>. Ingestion stays serial for now — prepare still runs one file at a time under the single heldkb_ingest_lock— but it no longer needs the lock or touches official state, which is exactly the property Stage 5 will parallelize.This is the Stage 4 slice from the parallel architecture roadmap: #151.
Summary
No user-visible behavior change:
openkb add <dir>produces identical results — this only factors conversion into a lock-free prepare + serial commit so Stage 5 can parallelize prepare.convert_document_for_prepare: lock-free conversion into private.openkb/staging/prepare/under a placeholderdoc_name(sanitized stem); returnsConvertResultwithout registering the hash or resolving the final nameprepare_documentowning the staging-dir lifecycle (rmtreeon interrupt); prepare/commit are coordinator-internal, called only byadd_directory_serialadd_directory_serial: a serial prepare → commit loop foropenkb add <dir>, run under theaddcommand's existing@_with_kb_lockcommit_prepared_document: requireskb_ingest_lockheld (reentrant acquire), then commits via the prepared branch of_add_single_file_locked_reap_prepare_staging: reclaim orphaned prepare staging at first exclusive-lock acquisition; skip symlinks, unlink stray files, no per-prepare marker neededScope
In scope:
openkb add <dir>add <dir>, and per-file failure isolationOut of scope:
openkb add <dir> --jobs N(Stage 5)convert_documentdirectly under the coordinator)Why this matters
After this PR, the prepare phase is a self-contained, lock-free, worker-safe unit that writes only disposable staging. Stage 5 can then add
--jobsby running prepares in parallel and feedingPreparedDocuments into the unchanged serial commit owner — final-name resolution, registry writes, commit signals, rollback, and cleanup all stay under the single owner established in #156.Verification
UV_PYTHON=3.13 UV_CACHE_DIR=/tmp/uv-cache uv run --extra dev pytest tests/test_add_prepare.py tests/test_add_command.py tests/test_converter.py tests/test_locks.py -qUV_PYTHON=3.13 UV_CACHE_DIR=/tmp/uv-cache uv run --extra dev mypy openkbgit diff --check main...HEAD