feat: idempotent path-keyed indexing + incremental update demo#314
Open
harshrathod0585 wants to merge 2 commits into
Open
feat: idempotent path-keyed indexing + incremental update demo#314harshrathod0585 wants to merge 2 commits into
harshrathod0585 wants to merge 2 commits into
Conversation
Add PageIndexClient.update(doc_id) for MD docs. Detects changed sections via a section-hash diff and re-summarizes only the changed sections plus their ancestors, reusing cached summaries for the rest. - extract_node_text_content now stamps a hierarchical title_path on each node, giving sections a stable identity across edits. - utils: hash_text, compute_section_hashes, find_ancestors helpers. - index() stores file_hash + section_hashes for MD docs so update() has a baseline; _ensure_doc_loaded restores them on demand. - update() gates on file_hash, then per-section hashes; returns the updated/added/deleted section paths. Markdown only: its heading structure is parsed deterministically, so the new tree shape is free and the LLM runs only on changed sections.
Indexing was non-idempotent: re-ingesting the same file minted a new UUID and wrote a duplicate <doc_id>.json every time, silently bloating the workspace and orphaning prior summaries. index() now resolves a document by its absolute path and reuses the existing doc_id, overwriting in place. New get_doc_id_by_path() exposes this lookup so callers can cleanly branch: index() when new, update() when known. Ships examples/incremental_update_demo.py demonstrating the index-vs-update flow, a PageIndex-themed sample.md, and an examples README.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Indexing was non-idempotent: re-ingesting the same file minted a new UUID and wrote a duplicate
<doc_id>.jsonevery time, silently bloating the workspace and orphaning prior summaries.index()now resolves a document by its absolute path and reuses the existingdoc_id, overwriting in place.get_doc_id_by_path()exposes this lookup so callers can cleanly branch:index()when new,update()when known.examples/incremental_update_demo.pydemonstrating the index-vs-update flow, a PageIndex-themedsample.md, and anexamples/README.md.Note for maintainers
This makes PageIndex more powerful — and with that, potentially more dangerous: because re-indexing now overwrites a document in place by path, an unintended re-ingest can silently replace an existing tree. Worth a deliberate look at the path-matching semantics before merge.
I also have another plan for PageIndex I'd love to build on top of this. Thanks for the consideration! 🙏