You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
route .xlsx conversion through a streaming openpyxl reader (read_only=True, data_only=True)
cap scan bounds (max_rows=5000, max_cols=64) to prevent pathological worksheet ranges from exploding memory
stop scanning after sustained empty tails to avoid sparse-sheet runaway processing
Why
Some workbooks report huge used ranges (e.g. max_row=1048571) despite having very little real data, which can cause generic converters to consume excessive RAM.
Result
Significantly lower memory use during XLSX ingest while preserving useful sheet content for KB compilation.
convert_xlsx_streaming silently truncates spreadsheets at max_rows=5000 and max_cols=64 with no warning or signal back to the caller. When a sheet exceeds either bound, the excess is dropped, the function returns a partial markdown string, convert_document writes it to wiki/sources/<doc_name>.md, and the file hash is registered as fully processed — so the next openkb add will skip it as "already known." The user has no way to discover that a 50k-row workbook only ingested its first 5k rows. The early-exit on empty_streak >= 200 has the same property. Every other branch in convert_document either converts the full document or raises.
The hard caps and the partial-success completion path:
Suggested fix: when the row/col cap is reached or empty_streak breaks the scan, append an explicit truncation marker to the returned markdown (e.g. > Truncated: scan limits reached at row N / col M), log a warning, and ideally surface a flag on ConvertResult so the caller can decide whether to register the hash as complete.
Thanks @plasma16, and sorry for the slow follow-up. The memory issue is real, but this no longer merges against main (the ingest/convert path was rewritten by the crash-safe mutation work in #142), and the 5,000-row cap truncates large sheets silently while still marking the source fully ingested — silent data loss in a knowledge base. We will handle XLSX memory separately with explicit truncation signaling. Closing — thanks for surfacing it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
.xlsxconversion through a streamingopenpyxlreader (read_only=True,data_only=True)max_rows=5000,max_cols=64) to prevent pathological worksheet ranges from exploding memoryWhy
Some workbooks report huge used ranges (e.g.
max_row=1048571) despite having very little real data, which can cause generic converters to consume excessive RAM.Result
Significantly lower memory use during XLSX ingest while preserving useful sheet content for KB compilation.