Skip to content

[Optimization] ~42% LLM call reduction via hybrid heuristics (font detection + fuzzy matching) #232

@ShainaHussain

Description

@ShainaHussain

Hi team! We've been using PageIndex on large documents (50–100+ pages) and found that a significant portion of LLM
calls during indexing can be resolved locally without losing accuracy.

Findings

Many LLM calls are for high-confidence decisions that local heuristics can handle:

Decision Point Current Behavior Proposed Optimization
find_toc_pages 1 LLM call per page scanned Font/layout analysis — dot leaders, digit-ending lines, TOC keywords — resolves obvious cases locally
verify_toc 1 LLM call per TOC entry Fuzzy string matching pre-confirms title presence when confidence is high
check_title_appearance_in_start 1 LLM call per node Fuzzy match on the first ~300 chars of the target page

Initial test on an 86-page paper: ~260 LLM calls → ~150 (~42% reduction) with identical verification accuracy.
Tested on a small set of documents — broader benchmarking across document types would be a useful next step.

Key constraint: heuristics only skip a call when confidence is high. Uncertain cases always escalate to the LLM. The self-verification loop is never bypassed.

Note: Path 4 documents (no TOC, no headings) see minimal
savings — no structural signals exist for heuristics to
exploit, so LLM remains essential there.

Approach

  • Font-based TOC detection using PyMuPDF page layout signals (line lengths, dot-leader patterns, digit-ending ratios, TOC keywords)

  • Fuzzy title matching via rapidfuzz with per-path confidence thresholds

  • LLMCallCounter to track actual savings per document

  • if_use_heuristics config flag (default: yes) or set to no to restore original behavior exactly.
    Fully backward compatible, no breaking changes.

Implementation

Full implementation and test results:
https://github.com/Unizoy/pageindex-optimized

Diff showing exact changes:
Unizoy/pageindex-optimized#1

The implementation is complete and tested. Early results look promising and would love feedback from the team on
edge cases we may not have hit yet. Happy to open a PR or discuss a different integration approach if preferred.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions