feat(cli): import existing PageIndex Cloud indices via `add --from-pageindex-cloud` (closes #88) by KylinMountain · Pull Request #97 · VectifyAI/OpenKB

KylinMountain · 2026-06-13T04:34:09Z

Summary

Implements #88 — import a document that is already indexed in PageIndex Cloud into a local OpenKB knowledge base, with no local PDF and no re-processing:

openkb add --from-pageindex-cloud <DOC_ID>

Fetches structure + OCR'd page content from PageIndex Cloud by doc_id (get_document / get_page_content), bypassing the local convert → raw → page-count → col.add pipeline entirely.
Writes the same wiki artifacts as the local long-doc path (shared _write_long_doc_artifacts) and compiles concepts via compile_long_doc.
Registers a raw-less registry entry typed pageindex_cloud with a synthetic identity key pageindex-cloud:<doc_id> (sha256-keyed; re-import of the same doc-id is idempotent/skipped).
openkb add runs under the exclusive KB lock, so the cloud-import path is serialized like every other ingest.
openkb remove on an imported doc cleans up local artifacts only — the user's cloud corpus is never touched (the existing long_pdf gate already excludes the new type; proven by a regression test).
Dependency pageindex switched to track git+https://github.com/VectifyAI/PageIndex.git@dev for the cloud API surface (lock pins a concrete commit). Follow-up: re-pin to an exact tag once a new PageIndex release is published.

Notes

Builds on the collision-resistant doc_name groundwork (resolve_doc_name, registry path identity), now in main. Rebased onto main after fix: collision-resistant doc_name — same-stem documents no longer overwrite each other #96 merged; base is main.
Known limitation: cloud OCR returns text only, so imported docs have no per-page images (images: []). Concept compilation reads the summary, so output quality is unaffected.

Test Plan

Automated (full suite green — 742 passed on the rebased branch):

uv run --extra dev pytest -q
Targeted: uv run --extra dev pytest tests/test_converter.py tests/test_indexer.py tests/test_add_command.py tests/test_remove.py -v

Manual (requires PAGEINDEX_API_KEY and a real cloud doc-id):

export PAGEINDEX_API_KEY=... then openkb add --from-pageindex-cloud <DOC_ID> → wiki/summaries + wiki/sources written, concepts compiled, doc appears in openkb list.
Re-run the same command → [SKIP] Already imported (idempotent).
openkb add foo.pdf --from-pageindex-cloud X → errors "not both"; openkb add with neither → "Provide a PATH…".
Unset PAGEINDEX_API_KEY then import → clear error, nothing written.
openkb remove <doc> on the imported doc → local artifacts removed, cloud doc still present in PageIndex.

…keys

Resolve conflicts: - cli.py: keep top-level compile_long_doc import (PR) + main's expanded openkb.config import (extra_headers/timeout/litellm resolvers). - indexer.py: keep the PR's _write_long_doc_artifacts helper but thread main's new description= through it (used by both local + cloud paths). - pyproject.toml / uv.lock: take main's pinned deps (incl. pageindex==0.3.0.dev1, portalocker) over the PR's loose/git@dev pins.

…DEX_API_KEY These tests drive the LOCAL indexing path with a fake PDF. When a developer has PAGEINDEX_API_KEY set, index_long_document takes the cloud branch and _get_pdf_page_count tries to open the fake PDF and raises (FileDataError). Add a class-scoped autouse fixture that unsets the key by default; the cloud-path tests in the class re-enable it via monkeypatch.setenv.

+def test_remove_cloud_doc_never_touches_pageindex(tmp_path):
+    """A pageindex_cloud doc removes only local artifacts; the cloud is
+    never contacted even when a pageindex.db happens to exist."""
+    import json


…; recompile/list recognize pageindex_cloud Code-review fixes on the cloud-import path: - indexer: import_cloud_document requested get_page_content(doc_id, "1-<N>") with N=_max_page_index(structure, default=100000). PageIndex's parse_pages caps a single range at 1000 pages, so import HARD-FAILED for any cloud tree without integer page indices (default 100000) or any doc >1000 pages. get_page_content fetches the whole doc and uses the range only as a client-side filter, so fetch fixed 1000-page windows via _fetch_cloud_pages and concatenate. Each window over-covers ("1-1000"), which also fixes last-page truncation when the tree's max index is 0-based/short (verified live: attention paper imports 15 pages, was 13). - cli: recompile misclassified pageindex_cloud docs as SHORT (== "long_pdf" checks), routing them to the short branch that looks for a .md source cloud imports never write (.json). Add _is_long_doc({long_pdf, pageindex_cloud}) and use it in recompile's _classify + dispatch. remove's local-PageIndex cleanup stays long_pdf-only by design. Add pageindex_cloud to _TYPE_DISPLAY_MAP so `openkb list` shows "pageindex", not the raw type string. - pyproject: drop now-dead [tool.hatch.metadata] allow-direct-references (the dep is a pinned version; no direct git/url references remain). - tests: windowing (>1000 + no-index), cloud recompile-as-long, predicate/display.

… on failed import - import_from_pageindex_cloud writes the summary + JSON source (in import_cloud_document) and compiles concept/entity pages BEFORE adding the registry entry, so a compile failure stranded wiki artifacts that `openkb remove` (registry-only resolution) could never reach. On failure, best-effort clean them via _cleanup_failed_cloud_import — reusing remove's by-doc_name wiki helpers (summary / source.json / concept + entity pages / index), touching neither the registry (no entry was added, so a retry stays clean) nor PageIndex (the cloud doc is the user's asset). - add: the cloud-import outcome was discarded, so `openkb add --from-pageindex-cloud` exited 0 even on failure. ctx.exit(1) on "failed" so scripted/CI callers can detect it. - tests: compile-failure orphan cleanup + the failure/success exit codes.

…e tree's max index _fetch_cloud_pages bounded its loop by the tree's max page index (_max_page_index). When the cloud tree under-reports the page count — a real case (e.g. a paper whose tree stops a couple pages short of the references) — the loop exited before fetching a later window and silently DROPPED pages. At the exact 1000 boundary this was an off-by-one (a 1001-page doc whose tree maxes at 1000 lost page 1001); more generally any >1000-page doc with an under-reporting tree was truncated. Drop the max-index bound: request fixed 1000-page windows and stop as soon as a window comes back SHORT (PageIndex page numbers are sequential, so a short window means we've passed the last page). The common ≤1000-page doc stays a single request, a larger doc fetches every page, and an under-reported tree no longer truncates. Remove the now-unused _max_page_index. Tests updated + a full-window-triggers-next-fetch regression.

Resolve conflicts between the concurrent-ingestion/recovery branch and upstream's PageIndex Cloud import feature (VectifyAI#97) + config/compiler/skill changes. cli.py: - imports: keep HEAD's staged-converter helpers (_convert_document_locked, _sanitize_stem, resolve_doc_name); add upstream's import_cloud_document. - add(): keep HEAD's removal of @_with_kb_lock (the concurrent directory batch must release the mutation lock between per-file commits) and adopt upstream's from_pageindex_cloud param. Wrap the cloud-import call in _kb_mutation_lock so it matches every other add mutation path — the auto-merge otherwise left cloud import unlocked after the decorator was dropped, dropping both mutual exclusion and journal draining on acquire. - _add_single_file_locked: remove dead local imports (body now delegates to _prepare_add_file / _commit_prepared_add). test_add_command.py: both sides added disjoint tests at the same site; keep all of them (HEAD recovery/concurrency tests + upstream cloud tests). Full suite: 910 passed.

KylinMountain added 7 commits June 14, 2026 19:21

refactor(converter): extract resolve_doc_name_from_key for synthetic …

e999ea5

…keys

refactor(indexer): extract _write_long_doc_artifacts shared writer

39722bb

feat(indexer): import existing PageIndex Cloud doc by doc_id

fc8db5e

feat(cli): orchestrate raw-less PageIndex Cloud import

188badd

feat(cli): add --from-pageindex-cloud flag to add command

cc62097

test(remove): cloud-imported docs never contact PageIndex

34e7ec0

chore(deps): track VectifyAI/PageIndex@dev for cloud import APIs

127e75e

KylinMountain force-pushed the feat/import-pageindex-cloud branch from f2d427e to 127e75e Compare June 14, 2026 11:23

KylinMountain changed the base branch from fix/doc-name-collision to main June 14, 2026 11:23

gwokhou mentioned this pull request Jun 18, 2026

feat: concurrent add pipeline with crash-safe mutation recovery #104

Closed

KylinMountain added 2 commits June 25, 2026 19:10

github-code-quality Bot found potential problems Jun 25, 2026

View reviewed changes

Comment thread tests/test_remove.py

def test_remove_cloud_doc_never_touches_pageindex(tmp_path):

"""A pageindex_cloud doc removes only local artifacts; the cloud is

never contacted even when a pageindex.db happens to exist."""

import json

KylinMountain added 3 commits June 25, 2026 20:17

KylinMountain merged commit ab14109 into main Jun 25, 2026
1 check passed

KylinMountain deleted the feat/import-pageindex-cloud branch June 25, 2026 23:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(cli): import existing PageIndex Cloud indices via `add --from-pageindex-cloud` (closes #88)#97

feat(cli): import existing PageIndex Cloud indices via `add --from-pageindex-cloud` (closes #88)#97
KylinMountain merged 12 commits into
mainfrom
feat/import-pageindex-cloud

KylinMountain commented Jun 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

KylinMountain commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notes

Test Plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KylinMountain commented Jun 13, 2026 •

edited

Loading