Optionally keep original title headers for main content extraction accuracy by mcPear · Pull Request #1006 · mozilla/readability

mcPear · 2026-04-17T14:21:48Z

Summary

Reader extraction currently rewrites all in-article h1 elements to h2 so the article title can remain the sole top-level heading in the reader UI. Moreover, removes the first similar heading spotted after the title. These normalizations improve classic “reader mode” presentation but weaken the semantic outline of the page: crawlers, SEO tooling, and systems that infer structure from HTML (including retrieval and “reverse engineering” of how a page is organized) rely on stable heading levels that match the publisher’s markup.

This change preserves the original heading tag names and levels in the extracted content wherever we are not explicitly removing noise, so the serialized article HTML stays closer to the source document’s hierarchy. All that is gated behind an option.

What changes (high level)

Stop blanket h1 → h2 replacement in article content
Stop duplicate-title header removal
Add unit tests

Keep original title headers for SEO reverse-engineering research accuracy

gijsk

Thanks for submitting a PR.

I'm confused about this being labeled a "chore". And I think that the relabeling of h1 to h2 and removing headers that match the titles should be split into different options. The h1-to-h2 replacement would fix #863 . I would be tempted to make not swapping it the default, though it would potentially break existing consumers - I can fix Firefox but not sure about others.

mcPear added 2 commits April 17, 2026 16:19

chore: optionally keep original title headers

1d7e86c

Merge pull request #1 from surferseo/mg/title

ca7c406

Keep original title headers for SEO reverse-engineering research accuracy

mcPear changed the title ~~chore: optionally keep original title headers~~ Optionally keep original title headers for main content extraction accuracy Apr 17, 2026

gijsk requested changes Apr 20, 2026

View reviewed changes

fix: preserve all h1s before the article subtree

657a64a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optionally keep original title headers for main content extraction accuracy#1006

Optionally keep original title headers for main content extraction accuracy#1006
mcPear wants to merge 3 commits into
mozilla:mainfrom
surferseo:mg/title

mcPear commented Apr 17, 2026

Uh oh!

gijsk left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mcPear commented Apr 17, 2026

Summary

What changes (high level)

Uh oh!

gijsk left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants