feat(cli): import existing PageIndex Cloud indices via add --from-pageindex-cloud (closes #88)#97
Merged
Merged
Conversation
f2d427e to
127e75e
Compare
Resolve conflicts: - cli.py: keep top-level compile_long_doc import (PR) + main's expanded openkb.config import (extra_headers/timeout/litellm resolvers). - indexer.py: keep the PR's _write_long_doc_artifacts helper but thread main's new description= through it (used by both local + cloud paths). - pyproject.toml / uv.lock: take main's pinned deps (incl. pageindex==0.3.0.dev1, portalocker) over the PR's loose/git@dev pins.
…DEX_API_KEY These tests drive the LOCAL indexing path with a fake PDF. When a developer has PAGEINDEX_API_KEY set, index_long_document takes the cloud branch and _get_pdf_page_count tries to open the fake PDF and raises (FileDataError). Add a class-scoped autouse fixture that unsets the key by default; the cloud-path tests in the class re-enable it via monkeypatch.setenv.
| def test_remove_cloud_doc_never_touches_pageindex(tmp_path): | ||
| """A pageindex_cloud doc removes only local artifacts; the cloud is | ||
| never contacted even when a pageindex.db happens to exist.""" | ||
| import json |
…; recompile/list recognize pageindex_cloud
Code-review fixes on the cloud-import path:
- indexer: import_cloud_document requested get_page_content(doc_id, "1-<N>") with
N=_max_page_index(structure, default=100000). PageIndex's parse_pages caps a
single range at 1000 pages, so import HARD-FAILED for any cloud tree without
integer page indices (default 100000) or any doc >1000 pages. get_page_content
fetches the whole doc and uses the range only as a client-side filter, so fetch
fixed 1000-page windows via _fetch_cloud_pages and concatenate. Each window
over-covers ("1-1000"), which also fixes last-page truncation when the tree's
max index is 0-based/short (verified live: attention paper imports 15 pages,
was 13).
- cli: recompile misclassified pageindex_cloud docs as SHORT (== "long_pdf"
checks), routing them to the short branch that looks for a .md source cloud
imports never write (.json). Add _is_long_doc({long_pdf, pageindex_cloud}) and
use it in recompile's _classify + dispatch. remove's local-PageIndex cleanup
stays long_pdf-only by design. Add pageindex_cloud to _TYPE_DISPLAY_MAP so
`openkb list` shows "pageindex", not the raw type string.
- pyproject: drop now-dead [tool.hatch.metadata] allow-direct-references (the dep
is a pinned version; no direct git/url references remain).
- tests: windowing (>1000 + no-index), cloud recompile-as-long, predicate/display.
… on failed import - import_from_pageindex_cloud writes the summary + JSON source (in import_cloud_document) and compiles concept/entity pages BEFORE adding the registry entry, so a compile failure stranded wiki artifacts that `openkb remove` (registry-only resolution) could never reach. On failure, best-effort clean them via _cleanup_failed_cloud_import — reusing remove's by-doc_name wiki helpers (summary / source.json / concept + entity pages / index), touching neither the registry (no entry was added, so a retry stays clean) nor PageIndex (the cloud doc is the user's asset). - add: the cloud-import outcome was discarded, so `openkb add --from-pageindex-cloud` exited 0 even on failure. ctx.exit(1) on "failed" so scripted/CI callers can detect it. - tests: compile-failure orphan cleanup + the failure/success exit codes.
…e tree's max index _fetch_cloud_pages bounded its loop by the tree's max page index (_max_page_index). When the cloud tree under-reports the page count — a real case (e.g. a paper whose tree stops a couple pages short of the references) — the loop exited before fetching a later window and silently DROPPED pages. At the exact 1000 boundary this was an off-by-one (a 1001-page doc whose tree maxes at 1000 lost page 1001); more generally any >1000-page doc with an under-reporting tree was truncated. Drop the max-index bound: request fixed 1000-page windows and stop as soon as a window comes back SHORT (PageIndex page numbers are sequential, so a short window means we've passed the last page). The common ≤1000-page doc stays a single request, a larger doc fetches every page, and an under-reported tree no longer truncates. Remove the now-unused _max_page_index. Tests updated + a full-window-triggers-next-fetch regression.
gwokhou
added a commit
to gwokhou/OpenKB
that referenced
this pull request
Jun 26, 2026
Resolve conflicts between the concurrent-ingestion/recovery branch and upstream's PageIndex Cloud import feature (VectifyAI#97) + config/compiler/skill changes. cli.py: - imports: keep HEAD's staged-converter helpers (_convert_document_locked, _sanitize_stem, resolve_doc_name); add upstream's import_cloud_document. - add(): keep HEAD's removal of @_with_kb_lock (the concurrent directory batch must release the mutation lock between per-file commits) and adopt upstream's from_pageindex_cloud param. Wrap the cloud-import call in _kb_mutation_lock so it matches every other add mutation path — the auto-merge otherwise left cloud import unlocked after the decorator was dropped, dropping both mutual exclusion and journal draining on acquire. - _add_single_file_locked: remove dead local imports (body now delegates to _prepare_add_file / _commit_prepared_add). test_add_command.py: both sides added disjoint tests at the same site; keep all of them (HEAD recovery/concurrency tests + upstream cloud tests). Full suite: 910 passed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements #88 — import a document that is already indexed in PageIndex Cloud into a local OpenKB knowledge base, with no local PDF and no re-processing:
doc_id(get_document/get_page_content), bypassing the localconvert → raw → page-count → col.addpipeline entirely._write_long_doc_artifacts) and compiles concepts viacompile_long_doc.pageindex_cloudwith a synthetic identity keypageindex-cloud:<doc_id>(sha256-keyed; re-import of the same doc-id is idempotent/skipped).openkb addruns under the exclusive KB lock, so the cloud-import path is serialized like every other ingest.openkb removeon an imported doc cleans up local artifacts only — the user's cloud corpus is never touched (the existinglong_pdfgate already excludes the new type; proven by a regression test).pageindexswitched to trackgit+https://github.com/VectifyAI/PageIndex.git@devfor the cloud API surface (lock pins a concrete commit). Follow-up: re-pin to an exact tag once a new PageIndex release is published.Notes
doc_namegroundwork (resolve_doc_name, registry path identity), now inmain. Rebased ontomainafter fix: collision-resistant doc_name — same-stem documents no longer overwrite each other #96 merged; base ismain.images: []). Concept compilation reads the summary, so output quality is unaffected.Test Plan
Automated (full suite green — 742 passed on the rebased branch):
uv run --extra dev pytest -quv run --extra dev pytest tests/test_converter.py tests/test_indexer.py tests/test_add_command.py tests/test_remove.py -vManual (requires
PAGEINDEX_API_KEYand a real cloud doc-id):export PAGEINDEX_API_KEY=...thenopenkb add --from-pageindex-cloud <DOC_ID>→ wiki/summaries + wiki/sources written, concepts compiled, doc appears inopenkb list.[SKIP] Already imported(idempotent).openkb add foo.pdf --from-pageindex-cloud X→ errors "not both";openkb addwith neither → "Provide a PATH…".PAGEINDEX_API_KEYthen import → clear error, nothing written.openkb remove <doc>on the imported doc → local artifacts removed, cloud doc still present in PageIndex.