Skip to content

feat: idempotent path-keyed indexing + incremental update demo#314

Open
harshrathod0585 wants to merge 2 commits into
VectifyAI:mainfrom
harshrathod0585:feat/incremental-md-update
Open

feat: idempotent path-keyed indexing + incremental update demo#314
harshrathod0585 wants to merge 2 commits into
VectifyAI:mainfrom
harshrathod0585:feat/incremental-md-update

Conversation

@harshrathod0585
Copy link
Copy Markdown

Summary

Indexing was non-idempotent: re-ingesting the same file minted a new UUID and wrote a duplicate <doc_id>.json every time, silently bloating the workspace and orphaning prior summaries.

  • index() now resolves a document by its absolute path and reuses the existing doc_id, overwriting in place.
  • New get_doc_id_by_path() exposes this lookup so callers can cleanly branch: index() when new, update() when known.
  • Adds examples/incremental_update_demo.py demonstrating the index-vs-update flow, a PageIndex-themed sample.md, and an examples/README.md.

Note for maintainers

This makes PageIndex more powerful — and with that, potentially more dangerous: because re-indexing now overwrites a document in place by path, an unintended re-ingest can silently replace an existing tree. Worth a deliberate look at the path-matching semantics before merge.

I also have another plan for PageIndex I'd love to build on top of this. Thanks for the consideration! 🙏

Add PageIndexClient.update(doc_id) for MD docs. Detects changed
sections via a section-hash diff and re-summarizes only the changed
sections plus their ancestors, reusing cached summaries for the rest.

- extract_node_text_content now stamps a hierarchical title_path on
  each node, giving sections a stable identity across edits.
- utils: hash_text, compute_section_hashes, find_ancestors helpers.
- index() stores file_hash + section_hashes for MD docs so update()
  has a baseline; _ensure_doc_loaded restores them on demand.
- update() gates on file_hash, then per-section hashes; returns the
  updated/added/deleted section paths.

Markdown only: its heading structure is parsed deterministically, so
the new tree shape is free and the LLM runs only on changed sections.
Indexing was non-idempotent: re-ingesting the same file minted a new
UUID and wrote a duplicate <doc_id>.json every time, silently bloating
the workspace and orphaning prior summaries.

index() now resolves a document by its absolute path and reuses the
existing doc_id, overwriting in place. New get_doc_id_by_path() exposes
this lookup so callers can cleanly branch: index() when new, update()
when known.

Ships examples/incremental_update_demo.py demonstrating the
index-vs-update flow, a PageIndex-themed sample.md, and an examples
README.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant