Skip to content

Optionally keep original title headers for main content extraction accuracy#1006

Open
mcPear wants to merge 3 commits intomozilla:mainfrom
surferseo:mg/title
Open

Optionally keep original title headers for main content extraction accuracy#1006
mcPear wants to merge 3 commits intomozilla:mainfrom
surferseo:mg/title

Conversation

@mcPear
Copy link
Copy Markdown

@mcPear mcPear commented Apr 17, 2026

Summary

Reader extraction currently rewrites all in-article h1 elements to h2 so the article title can remain the sole top-level heading in the reader UI. Moreover, removes the first similar heading spotted after the title. These normalizations improve classic “reader mode” presentation but weaken the semantic outline of the page: crawlers, SEO tooling, and systems that infer structure from HTML (including retrieval and “reverse engineering” of how a page is organized) rely on stable heading levels that match the publisher’s markup.

This change preserves the original heading tag names and levels in the extracted content wherever we are not explicitly removing noise, so the serialized article HTML stays closer to the source document’s hierarchy. All that is gated behind an option.

What changes (high level)

  • Stop blanket h1h2 replacement in article content
  • Stop duplicate-title header removal
  • Add unit tests

mcPear added 2 commits April 17, 2026 16:19
Keep original title headers for SEO reverse-engineering research accuracy
@mcPear mcPear changed the title chore: optionally keep original title headers Optionally keep original title headers for main content extraction accuracy Apr 17, 2026
Copy link
Copy Markdown
Contributor

@gijsk gijsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting a PR.

I'm confused about this being labeled a "chore". And I think that the relabeling of h1 to h2 and removing headers that match the titles should be split into different options. The h1-to-h2 replacement would fix #863 . I would be tempted to make not swapping it the default, though it would potentially break existing consumers - I can fix Firefox but not sure about others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants