Skip to content

feat(cli): import existing PageIndex Cloud indices via add --from-pageindex-cloud (closes #88)#97

Merged
KylinMountain merged 12 commits into
mainfrom
feat/import-pageindex-cloud
Jun 25, 2026
Merged

feat(cli): import existing PageIndex Cloud indices via add --from-pageindex-cloud (closes #88)#97
KylinMountain merged 12 commits into
mainfrom
feat/import-pageindex-cloud

Conversation

@KylinMountain

@KylinMountain KylinMountain commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Summary

Implements #88 — import a document that is already indexed in PageIndex Cloud into a local OpenKB knowledge base, with no local PDF and no re-processing:

openkb add --from-pageindex-cloud <DOC_ID>
  • Fetches structure + OCR'd page content from PageIndex Cloud by doc_id (get_document / get_page_content), bypassing the local convert → raw → page-count → col.add pipeline entirely.
  • Writes the same wiki artifacts as the local long-doc path (shared _write_long_doc_artifacts) and compiles concepts via compile_long_doc.
  • Registers a raw-less registry entry typed pageindex_cloud with a synthetic identity key pageindex-cloud:<doc_id> (sha256-keyed; re-import of the same doc-id is idempotent/skipped).
  • openkb add runs under the exclusive KB lock, so the cloud-import path is serialized like every other ingest.
  • openkb remove on an imported doc cleans up local artifacts only — the user's cloud corpus is never touched (the existing long_pdf gate already excludes the new type; proven by a regression test).
  • Dependency pageindex switched to track git+https://github.com/VectifyAI/PageIndex.git@dev for the cloud API surface (lock pins a concrete commit). Follow-up: re-pin to an exact tag once a new PageIndex release is published.

Notes

Test Plan

Automated (full suite green — 742 passed on the rebased branch):

  • uv run --extra dev pytest -q
  • Targeted: uv run --extra dev pytest tests/test_converter.py tests/test_indexer.py tests/test_add_command.py tests/test_remove.py -v

Manual (requires PAGEINDEX_API_KEY and a real cloud doc-id):

  • export PAGEINDEX_API_KEY=... then openkb add --from-pageindex-cloud <DOC_ID> → wiki/summaries + wiki/sources written, concepts compiled, doc appears in openkb list.
  • Re-run the same command → [SKIP] Already imported (idempotent).
  • openkb add foo.pdf --from-pageindex-cloud X → errors "not both"; openkb add with neither → "Provide a PATH…".
  • Unset PAGEINDEX_API_KEY then import → clear error, nothing written.
  • openkb remove <doc> on the imported doc → local artifacts removed, cloud doc still present in PageIndex.

@KylinMountain KylinMountain force-pushed the feat/import-pageindex-cloud branch from f2d427e to 127e75e Compare June 14, 2026 11:23
@KylinMountain KylinMountain changed the base branch from fix/doc-name-collision to main June 14, 2026 11:23
Resolve conflicts:
- cli.py: keep top-level compile_long_doc import (PR) + main's expanded
  openkb.config import (extra_headers/timeout/litellm resolvers).
- indexer.py: keep the PR's _write_long_doc_artifacts helper but thread
  main's new description= through it (used by both local + cloud paths).
- pyproject.toml / uv.lock: take main's pinned deps (incl. pageindex==0.3.0.dev1,
  portalocker) over the PR's loose/git@dev pins.
…DEX_API_KEY

These tests drive the LOCAL indexing path with a fake PDF. When a developer has
PAGEINDEX_API_KEY set, index_long_document takes the cloud branch and
_get_pdf_page_count tries to open the fake PDF and raises (FileDataError). Add a
class-scoped autouse fixture that unsets the key by default; the cloud-path tests
in the class re-enable it via monkeypatch.setenv.
Comment thread tests/test_remove.py
def test_remove_cloud_doc_never_touches_pageindex(tmp_path):
"""A pageindex_cloud doc removes only local artifacts; the cloud is
never contacted even when a pageindex.db happens to exist."""
import json
…; recompile/list recognize pageindex_cloud

Code-review fixes on the cloud-import path:

- indexer: import_cloud_document requested get_page_content(doc_id, "1-<N>") with
  N=_max_page_index(structure, default=100000). PageIndex's parse_pages caps a
  single range at 1000 pages, so import HARD-FAILED for any cloud tree without
  integer page indices (default 100000) or any doc >1000 pages. get_page_content
  fetches the whole doc and uses the range only as a client-side filter, so fetch
  fixed 1000-page windows via _fetch_cloud_pages and concatenate. Each window
  over-covers ("1-1000"), which also fixes last-page truncation when the tree's
  max index is 0-based/short (verified live: attention paper imports 15 pages,
  was 13).
- cli: recompile misclassified pageindex_cloud docs as SHORT (== "long_pdf"
  checks), routing them to the short branch that looks for a .md source cloud
  imports never write (.json). Add _is_long_doc({long_pdf, pageindex_cloud}) and
  use it in recompile's _classify + dispatch. remove's local-PageIndex cleanup
  stays long_pdf-only by design. Add pageindex_cloud to _TYPE_DISPLAY_MAP so
  `openkb list` shows "pageindex", not the raw type string.
- pyproject: drop now-dead [tool.hatch.metadata] allow-direct-references (the dep
  is a pinned version; no direct git/url references remain).
- tests: windowing (>1000 + no-index), cloud recompile-as-long, predicate/display.
… on failed import

- import_from_pageindex_cloud writes the summary + JSON source (in
  import_cloud_document) and compiles concept/entity pages BEFORE adding the
  registry entry, so a compile failure stranded wiki artifacts that `openkb
  remove` (registry-only resolution) could never reach. On failure, best-effort
  clean them via _cleanup_failed_cloud_import — reusing remove's by-doc_name wiki
  helpers (summary / source.json / concept + entity pages / index), touching
  neither the registry (no entry was added, so a retry stays clean) nor PageIndex
  (the cloud doc is the user's asset).
- add: the cloud-import outcome was discarded, so `openkb add
  --from-pageindex-cloud` exited 0 even on failure. ctx.exit(1) on "failed" so
  scripted/CI callers can detect it.
- tests: compile-failure orphan cleanup + the failure/success exit codes.
…e tree's max index

_fetch_cloud_pages bounded its loop by the tree's max page index
(_max_page_index). When the cloud tree under-reports the page count — a real
case (e.g. a paper whose tree stops a couple pages short of the references) — the
loop exited before fetching a later window and silently DROPPED pages. At the
exact 1000 boundary this was an off-by-one (a 1001-page doc whose tree maxes at
1000 lost page 1001); more generally any >1000-page doc with an under-reporting
tree was truncated.

Drop the max-index bound: request fixed 1000-page windows and stop as soon as a
window comes back SHORT (PageIndex page numbers are sequential, so a short window
means we've passed the last page). The common ≤1000-page doc stays a single
request, a larger doc fetches every page, and an under-reported tree no longer
truncates. Remove the now-unused _max_page_index. Tests updated + a
full-window-triggers-next-fetch regression.
@KylinMountain KylinMountain merged commit ab14109 into main Jun 25, 2026
1 check passed
@KylinMountain KylinMountain deleted the feat/import-pageindex-cloud branch June 25, 2026 23:08
gwokhou added a commit to gwokhou/OpenKB that referenced this pull request Jun 26, 2026
Resolve conflicts between the concurrent-ingestion/recovery branch and
upstream's PageIndex Cloud import feature (VectifyAI#97) + config/compiler/skill
changes.

cli.py:
- imports: keep HEAD's staged-converter helpers (_convert_document_locked,
  _sanitize_stem, resolve_doc_name); add upstream's import_cloud_document.
- add(): keep HEAD's removal of @_with_kb_lock (the concurrent directory
  batch must release the mutation lock between per-file commits) and adopt
  upstream's from_pageindex_cloud param. Wrap the cloud-import call in
  _kb_mutation_lock so it matches every other add mutation path — the
  auto-merge otherwise left cloud import unlocked after the decorator was
  dropped, dropping both mutual exclusion and journal draining on acquire.
- _add_single_file_locked: remove dead local imports (body now delegates
  to _prepare_add_file / _commit_prepared_add).

test_add_command.py: both sides added disjoint tests at the same site;
keep all of them (HEAD recovery/concurrency tests + upstream cloud tests).

Full suite: 910 passed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant