Skip to content

fix(tree_renderer): embed extracted images in the rendered summary#170

Open
designcomputer wants to merge 1 commit into
VectifyAI:mainfrom
designcomputer:fix/pageindex-embed-images
Open

fix(tree_renderer): embed extracted images in the rendered summary#170
designcomputer wants to merge 1 commit into
VectifyAI:mainfrom
designcomputer:fix/pageindex-embed-images

Conversation

@designcomputer

Copy link
Copy Markdown

Closes #166.

Problem

For long documents (doc_type: pageindex), images are correctly extracted to wiki/sources/images/<doc>/ and referenced with correct wiki-relative paths in wiki/sources/<doc>.json — each page object has "images": [{"path": "sources/images/<doc>/pX_imgY.png"}], and the path is also inlined in that page's content.

But render_summary_md — which builds wiki/summaries/<doc>.md, the page a user actually opens in Obsidian — never reads any of this. Net effect: images are on disk and technically "referenced" in a JSON data file, but invisible everywhere a human actually browses the vault — not in the summary, not in any concept/entity page, not in index.md. wiki/sources/<doc>.json isn't rendered as a wiki page by anything, so those references are effectively inert.

Reproduced on openkb 0.4.2 with a 31-page manual with no PAGEINDEX_API_KEY set (local pymupdf fallback, which does extract images): 35 images extracted, 0 surfaced anywhere in the rendered wiki. Full repro details in #166.

Fix

render_summary_md now accepts the same per-page pages list already written to wiki/sources/<doc>.json, builds a page number -> image paths map via a new _build_page_images helper, and embeds each node's page-range images inline. Images are de-duped the same way duplicate summaries already are in this file (a page split across many sibling nodes — common with the "no TOC" fallback — doesn't repeat the same figure at every one of them).

indexer.py's _write_long_doc_artifacts already had pages in scope at the point it calls render_summary_md — this just threads it through (one call-site change).

Testing

Added 4 unit tests to tests/test_tree_renderer.py (basic embedding, no-pages-argument is a no-op, de-dup across sibling nodes sharing a page, ordering with multiple images on one page) and an integration test to tests/test_indexer.py confirming _write_long_doc_artifacts actually threads it through end-to-end. Full suite (931 tests) passes; ruff format, ruff check, mypy openkb/tree_renderer.py openkb/indexer.py all clean.

Verified against a real ingest of a 31-page manual (separately from these unit tests): 35/35 auto-extracted images now appear exactly once each in the rendered summary, correctly positioned by page range, zero duplicates.

Sibling PRs from the same investigation: #167 (title truncation), #168 (source text rendering), #169 (duplicate summary collapsing).

For long documents (doc_type: pageindex), images are correctly
extracted to wiki/sources/images/<doc>/ and referenced with correct
wiki-relative paths in wiki/sources/<doc>.json — but nothing in the
actual Obsidian-rendered wiki (summaries, concept pages, entity pages,
index.md) ever surfaced them. That raw JSON file isn't rendered as a
wiki page, so extracted images were effectively invisible anywhere a
human browses the vault.

render_summary_md now accepts the same per-page `pages` list already
written to wiki/sources/<doc>.json, builds a page-number -> image-path
map, and embeds each node's page-range images inline — de-duped the
same way duplicate summaries are (a page split across many sibling
nodes doesn't repeat the same figure at each one).
_write_long_doc_artifacts already had `pages` in scope when calling
render_summary_md; it just wasn't passing it through.

Fixes VectifyAI#166. Verified against a real 31-page manual: 35/35
auto-extracted images now appear exactly once each in the rendered
summary, correctly positioned by page range, with zero duplicates.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Long-doc (PageIndex) images are extracted but never surface in the rendered wiki

1 participant