fix(tree_renderer): embed extracted images in the rendered summary#170
Open
designcomputer wants to merge 1 commit into
Open
fix(tree_renderer): embed extracted images in the rendered summary#170designcomputer wants to merge 1 commit into
designcomputer wants to merge 1 commit into
Conversation
For long documents (doc_type: pageindex), images are correctly extracted to wiki/sources/images/<doc>/ and referenced with correct wiki-relative paths in wiki/sources/<doc>.json — but nothing in the actual Obsidian-rendered wiki (summaries, concept pages, entity pages, index.md) ever surfaced them. That raw JSON file isn't rendered as a wiki page, so extracted images were effectively invisible anywhere a human browses the vault. render_summary_md now accepts the same per-page `pages` list already written to wiki/sources/<doc>.json, builds a page-number -> image-path map, and embeds each node's page-range images inline — de-duped the same way duplicate summaries are (a page split across many sibling nodes doesn't repeat the same figure at each one). _write_long_doc_artifacts already had `pages` in scope when calling render_summary_md; it just wasn't passing it through. Fixes VectifyAI#166. Verified against a real 31-page manual: 35/35 auto-extracted images now appear exactly once each in the rendered summary, correctly positioned by page range, with zero duplicates.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #166.
Problem
For long documents (
doc_type: pageindex), images are correctly extracted towiki/sources/images/<doc>/and referenced with correct wiki-relative paths inwiki/sources/<doc>.json— each page object has"images": [{"path": "sources/images/<doc>/pX_imgY.png"}], and the path is also inlined in that page'scontent.But
render_summary_md— which buildswiki/summaries/<doc>.md, the page a user actually opens in Obsidian — never reads any of this. Net effect: images are on disk and technically "referenced" in a JSON data file, but invisible everywhere a human actually browses the vault — not in the summary, not in any concept/entity page, not inindex.md.wiki/sources/<doc>.jsonisn't rendered as a wiki page by anything, so those references are effectively inert.Reproduced on
openkb0.4.2 with a 31-page manual with noPAGEINDEX_API_KEYset (local pymupdf fallback, which does extract images): 35 images extracted, 0 surfaced anywhere in the rendered wiki. Full repro details in #166.Fix
render_summary_mdnow accepts the same per-pagepageslist already written towiki/sources/<doc>.json, builds apage number -> image pathsmap via a new_build_page_imageshelper, and embeds each node's page-range images inline. Images are de-duped the same way duplicate summaries already are in this file (a page split across many sibling nodes — common with the "no TOC" fallback — doesn't repeat the same figure at every one of them).indexer.py's_write_long_doc_artifactsalready hadpagesin scope at the point it callsrender_summary_md— this just threads it through (one call-site change).Testing
Added 4 unit tests to
tests/test_tree_renderer.py(basic embedding, no-pages-argument is a no-op, de-dup across sibling nodes sharing a page, ordering with multiple images on one page) and an integration test totests/test_indexer.pyconfirming_write_long_doc_artifactsactually threads it through end-to-end. Full suite (931 tests) passes;ruff format,ruff check,mypy openkb/tree_renderer.py openkb/indexer.pyall clean.Verified against a real ingest of a 31-page manual (separately from these unit tests): 35/35 auto-extracted images now appear exactly once each in the rendered summary, correctly positioned by page range, zero duplicates.
Sibling PRs from the same investigation: #167 (title truncation), #168 (source text rendering), #169 (duplicate summary collapsing).