fix(tree_renderer): embed extracted images in the rendered summary by designcomputer · Pull Request #170 · VectifyAI/OpenKB

designcomputer · 2026-07-03T01:04:19Z

Closes #166.

Problem

For long documents (doc_type: pageindex), images are correctly extracted to wiki/sources/images/<doc>/ and referenced with correct wiki-relative paths in wiki/sources/<doc>.json — each page object has "images": [{"path": "sources/images/<doc>/pX_imgY.png"}], and the path is also inlined in that page's content.

But render_summary_md — which builds wiki/summaries/<doc>.md, the page a user actually opens in Obsidian — never reads any of this. Net effect: images are on disk and technically "referenced" in a JSON data file, but invisible everywhere a human actually browses the vault — not in the summary, not in any concept/entity page, not in index.md. wiki/sources/<doc>.json isn't rendered as a wiki page by anything, so those references are effectively inert.

Reproduced on openkb 0.4.2 with a 31-page manual with no PAGEINDEX_API_KEY set (local pymupdf fallback, which does extract images): 35 images extracted, 0 surfaced anywhere in the rendered wiki. Full repro details in #166.

Fix

render_summary_md now accepts the same per-page pages list already written to wiki/sources/<doc>.json, builds a page number -> image paths map via a new _build_page_images helper, and embeds each node's page-range images inline. Images are de-duped the same way duplicate summaries already are in this file (a page split across many sibling nodes — common with the "no TOC" fallback — doesn't repeat the same figure at every one of them).

indexer.py's _write_long_doc_artifacts already had pages in scope at the point it calls render_summary_md — this just threads it through (one call-site change).

Testing

Added 4 unit tests to tests/test_tree_renderer.py (basic embedding, no-pages-argument is a no-op, de-dup across sibling nodes sharing a page, ordering with multiple images on one page) and an integration test to tests/test_indexer.py confirming _write_long_doc_artifacts actually threads it through end-to-end. Full suite (931 tests) passes; ruff format, ruff check, mypy openkb/tree_renderer.py openkb/indexer.py all clean.

Verified against a real ingest of a 31-page manual (separately from these unit tests): 35/35 auto-extracted images now appear exactly once each in the rendered summary, correctly positioned by page range, zero duplicates.

Sibling PRs from the same investigation: #167 (title truncation), #168 (source text rendering), #169 (duplicate summary collapsing).

For long documents (doc_type: pageindex), images are correctly extracted to wiki/sources/images/<doc>/ and referenced with correct wiki-relative paths in wiki/sources/<doc>.json — but nothing in the actual Obsidian-rendered wiki (summaries, concept pages, entity pages, index.md) ever surfaced them. That raw JSON file isn't rendered as a wiki page, so extracted images were effectively invisible anywhere a human browses the vault. render_summary_md now accepts the same per-page `pages` list already written to wiki/sources/<doc>.json, builds a page-number -> image-path map, and embeds each node's page-range images inline — de-duped the same way duplicate summaries are (a page split across many sibling nodes doesn't repeat the same figure at each one). _write_long_doc_artifacts already had `pages` in scope when calling render_summary_md; it just wasn't passing it through. Fixes VectifyAI#166. Verified against a real 31-page manual: 35/35 auto-extracted images now appear exactly once each in the rendered summary, correctly positioned by page range, with zero duplicates.

designcomputer mentioned this pull request Jul 3, 2026

Long-doc (PageIndex) images are extracted but never surface in the rendered wiki #166

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(tree_renderer): embed extracted images in the rendered summary#170

fix(tree_renderer): embed extracted images in the rendered summary#170
designcomputer wants to merge 1 commit into
VectifyAI:mainfrom
designcomputer:fix/pageindex-embed-images

designcomputer commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

designcomputer commented Jul 3, 2026

Problem

Fix

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant