Skip to content

fix: return only requested pages from get_page_content on Markdown#280

Open
akhilesharora wants to merge 1 commit into
VectifyAI:mainfrom
akhilesharora:fix/md-discrete-pages-overcollection
Open

fix: return only requested pages from get_page_content on Markdown#280
akhilesharora wants to merge 1 commit into
VectifyAI:mainfrom
akhilesharora:fix/md-discrete-pages-overcollection

Conversation

@akhilesharora
Copy link
Copy Markdown

get_page_content's docstring says '3,8' means pages 3 and 8 (pageindex/retrieve.py:111-119). PDFs honor that. Markdown doesn't: _get_md_page_content takes min(page_nums)/max(page_nums) and returns everything in between.

Same input, before:

md  pages="5,100" -> [5, 10, 50, 100]
pdf pages="5,100" -> [5, 100]

After:

md  pages="5,100" -> [5, 100]
pdf pages="5,100" -> [5, 100]

_parse_pages already returns a discrete sorted list, so the simplest fix is to match against set(page_nums) instead of [min..max]:

wanted = set(page_nums)
...
if ln in wanted and ln not in seen:

Range form ('5-7') still parses to [5,6,7] so it keeps working. Only the comma-list shape changes.

Added tests/test_retrieve_pages.py with four cases. Two of them fail on main and pass with the fix. The other two cover the range and single-page forms to make sure nothing regresses there.

$ python -m pytest tests/test_retrieve_pages.py -v
==================== 4 passed in 1.87s =====================================

Files touched:

  • pageindex/retrieve.py: 4 lines of logic, plus a docstring update so it matches the new behavior.
  • tests/test_retrieve_pages.py: new, self-contained, no LLM or PDF needed.

No PDF behavior change, no API surface change.

Closes #279

get_page_content's docstring describes '3,8' as two discrete pages.
The PDF branch honors that. The Markdown branch was treating the list
as the inclusive range [min..max] and pulling in every heading between
them. Match against the parsed set instead so both branches agree.

Range form '5-7' still parses to [5,6,7] so range queries are unchanged;
only the comma-list shape changes behavior.

Closes VectifyAI#279
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

get_page_content over-collects on Markdown when given a comma-separated page list

1 participant