Skip to content

refactor(mutation): split lock-free prepare from serial commit for directory ingest#172

Open
gwokhou wants to merge 2 commits into
VectifyAI:mainfrom
gwokhou:pr/bounded-parallel-directory-add
Open

refactor(mutation): split lock-free prepare from serial commit for directory ingest#172
gwokhou wants to merge 2 commits into
VectifyAI:mainfrom
gwokhou:pr/bounded-parallel-directory-add

Conversation

@gwokhou

@gwokhou gwokhou commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Background

#142 made serial openkb add crash-safe (journals, touched-path snapshots, staged publish, rollback, recovery drain), and #156 gave those mechanics an explicit owner — AddMutationPlan / run_add_mutation — so one serial path is responsible for every mutation of official KB state, while anything else may only write disposable staging.

That split is what makes parallel ingest safe to build. But there is no real "prepare" step yet: add still converts and commits in a single pass under the lock. Before --jobs can run conversions in parallel, prepare has to become a lock-free phase of its own — one that writes only private staging and never touches the registry or official raw//wiki/. The serial owner then commits under the lock, resolving the final name and publishing, just as it does today.

This PR adds that prepare/commit split for openkb add <dir>. Ingestion stays serial for now — prepare still runs one file at a time under the single held kb_ingest_lock — but it no longer needs the lock or touches official state, which is exactly the property Stage 5 will parallelize.

This is the Stage 4 slice from the parallel architecture roadmap: #151.

Summary

No user-visible behavior change: openkb add <dir> produces identical results — this only factors conversion into a lock-free prepare + serial commit so Stage 5 can parallelize prepare.

  • add convert_document_for_prepare: lock-free conversion into private .openkb/staging/prepare/ under a placeholder doc_name (sanitized stem); returns ConvertResult without registering the hash or resolving the final name
  • add prepare_document owning the staging-dir lifecycle (rmtree on interrupt); prepare/commit are coordinator-internal, called only by add_directory_serial
  • add add_directory_serial: a serial prepare → commit loop for openkb add <dir>, run under the add command's existing @_with_kb_lock
  • add commit_prepared_document: requires kb_ingest_lock held (reentrant acquire), then commits via the prepared branch of _add_single_file_locked
  • re-validate under the lock at commit because prepare ran without it: re-decide skip from live registry state, re-hash the source, and re-convert when the source changed or prepare short-circuited with no artifacts (the stale-prepare contract)
  • retarget staged raw/source/images from the placeholder name to the owner-resolved final name before publish
  • add _reap_prepare_staging: reclaim orphaned prepare staging at first exclusive-lock acquisition; skip symlinks, unlink stray files, no per-prepare marker needed

Scope

In scope:

  • existing openkb add <dir>
  • lock-free prepare into private staging
  • serial commit under the existing coordinator owner
  • prepare-staging reaper
  • tests for prepare (private staging, no lock), commit (final-name resolution, lock required), reaper (orphans + symlinks), stale-prepare and source-changed TOCTOU re-prepare, end-to-end add <dir>, and per-file failure isolation

Out of scope:

  • openkb add <dir> --jobs N (Stage 5)
  • worker pools / parallel prepare
  • deterministic parallel commit queues
  • routing single-file add, URL add, or PageIndex Cloud import through the prepare/commit split (they keep using convert_document directly under the coordinator)

Why this matters

After this PR, the prepare phase is a self-contained, lock-free, worker-safe unit that writes only disposable staging. Stage 5 can then add --jobs by running prepares in parallel and feeding PreparedDocuments into the unchanged serial commit owner — final-name resolution, registry writes, commit signals, rollback, and cleanup all stay under the single owner established in #156.

Verification

  • UV_PYTHON=3.13 UV_CACHE_DIR=/tmp/uv-cache uv run --extra dev pytest tests/test_add_prepare.py tests/test_add_command.py tests/test_converter.py tests/test_locks.py -q
  • UV_PYTHON=3.13 UV_CACHE_DIR=/tmp/uv-cache uv run --extra dev mypy openkb
  • git diff --check main...HEAD

@gwokhou gwokhou force-pushed the pr/bounded-parallel-directory-add branch from 12c933e to efa80bc Compare July 4, 2026 04:25
@gwokhou gwokhou changed the title feat(mutation): split lock-free prepare from serial commit for directory ingest refactor(mutation): split lock-free prepare from serial commit for directory ingest Jul 4, 2026
…rectory ingest

Stage 4 of the parallel-add roadmap (VectifyAI#151). `openkb add <dir>` now routes
through a worker-safe prepare / serial commit split: prepare converts into
private .openkb/staging/prepare output without the KB mutation lock and without
touching official raw/, wiki/, or .openkb/ state; the serial owner commits under
kb_ingest_lock, resolving the final name and publishing. `--jobs` (Stage 5) is
not included.

Batch coordinator (openkb/cli.py):
- add_directory_serial runs a serial prepare -> commit loop. `add` already holds
  kb_ingest_lock for its whole body via @_with_kb_lock, so prepare runs under
  that one outer lock; the reaper (first lock acquisition, before this batch's
  staging exists) cannot collide with a live prepare. Per-file failure continues
  the batch; DirtyRollbackError stops it.

Prepare (openkb/add_prepare.py, openkb/converter.py):
- convert_document_for_prepare: lock-free conversion into private staging under
  a placeholder doc_name (sanitized stem); returns ConvertResult without
  registering the hash or resolving the final name.
- prepare_document owns the staging-dir lifecycle (rmtree on interrupt).
  prepare/commit are coordinator-internal, called only by add_directory_serial.

Serial commit (openkb/cli.py):
- commit_prepared_document requires kb_ingest_lock held (reentrant acquire).
- The prepared branch of _add_single_file_locked re-validates under the lock
  because prepare ran without it: re-decides skip from live registry state,
  re-hashes the source, and re-converts when the source changed or prepare had
  short-circuited with no artifacts (the stale-prepare contract).
- _retarget_prepared_document_artifacts renames staged raw/source/images from
  the placeholder name to the owner-resolved final name.

Reaper (openkb/locks.py):
- _reap_prepare_staging reclaims orphaned prepare staging at first exclusive
  acquisition; skips symlinks, unlinks stray files, and logs INFO on success /
  WARNING on failure. No per-prepare marker is needed: directory add holds the
  lock across its whole batch, so a live batch's staging is never visible to
  another reaper.

Tests: prepare writes only private staging and takes no lock; commit resolves the
final name under the owner and requires its lock; the reaper reaps orphans and
skips symlinks; stale skip and source-changed (TOCTOU) re-prepare at commit;
`add <dir>` end-to-end lands every file via prepare/commit; a prepare failure
isolates the file while the batch continues.
@gwokhou gwokhou force-pushed the pr/bounded-parallel-directory-add branch from efa80bc to ee22914 Compare July 4, 2026 04:52
Three cross-platform defects in the lock-free prepare path (ee22914),
each with a POSIX-simulated regression test (CI is Linux-only):

- reaper stalled the KB on a locked loose file: the file branch's
  unlink(missing_ok=True) only swallows FileNotFoundError, so a
  PermissionError (AV/indexer on Windows) escaped kb_lock and crashed
  every exclusive-lock command. Now wrapped in try/except OSError.
- read-only staging never self-healed: copy2 preserves a read-only
  source's bit, and rmtree(ignore_errors=True) can't delete read-only
  entries on Windows, so the orphan re-logged 'Could not fully reap' on
  every lock acquisition. rmtree now passes an onerror that clears the
  read-only bit and retries.
- retarget rewrote line endings: write_text without newline= translated
  \n to \r\n on Windows, so a collision-renamed source became CRLF
  while all others stayed LF. Switched to atomic_write_text (LF-preserving).
@gwokhou

gwokhou commented Jul 4, 2026

Copy link
Copy Markdown
Contributor Author

6969833 — Windows hardening for the lock-free prepare path

Follow-up to ee22914. That commit introduced the prepare/commit split (_reap_prepare_staging, _retarget_prepared_document_artifacts) with three latent Windows-only defects. Landed as a separate commit rather than force-pushing an amend into the already-pushed ee22914.

1. Reaper no longer stalls the whole KB on a locked loose file.
_reap_prepare_staging's file branch used unlink(missing_ok=True), which only swallows FileNotFoundError. On Windows, an AV/indexer holding a loose file in staging/prepare/ makes unlink raise PermissionError, which escaped kb_lock and crashed every exclusive-lock command (add/remove/recompile/chat) until the handle released. The directory branch was already safe (rmtree(ignore_errors=True)); the file branch is now wrapped in try/except OSError to match. (Trigger is narrow — staging normally holds dirs, not loose files — but the consequence was a full KB stall.)

2. Read-only staging now self-heals. shutil.copy2 preserves a read-only source's attribute into staging; on Windows rmtree(ignore_errors=True) can't delete read-only entries, so an orphaned staging dir re-logged "Could not fully reap…" on every lock acquisition and never cleared. rmtree now passes an onerror that clears the read-only bit and retries. POSIX is unaffected (a read-only bit never blocks unlink there).

3. Retarget preserves LF. _retarget_prepared_document_artifacts used Path.write_text without newline=, so on Windows \n\r\n and a collision-renamed source became CRLF while every other source (written via atomic_write_text) stayed LF. Switched to atomic_write_text.

Testing: CI is Linux-only, so each fix has a POSIX-simulated regression test that forces the Windows condition — monkeypatch Path.unlink/os.unlink to raise PermissionError, and Path.write_text to do \n\r\n. All three were verified RED (reproducing the exact bug) before the fix.

Out of scope (follow-up): long file names still silently skip on Windows — _sanitize_stem doesn't truncate, and ee22914's deeper staging/prepare/<idx>-<stem>-<uuid>/ path pushes borderline names over MAX_PATH (260). The root cause predates this PR, so it's left for a separate change with a proper truncation policy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant