Skip to content

Fix : Decode HTML entities in author names before tokenization#4761

Open
dikshaa2909 wants to merge 1 commit intoaboutcode-org:developfrom
dikshaa2909:fix-author-html-entity-4760
Open

Fix : Decode HTML entities in author names before tokenization#4761
dikshaa2909 wants to merge 1 commit intoaboutcode-org:developfrom
dikshaa2909:fix-author-html-entity-4760

Conversation

@dikshaa2909
Copy link

@dikshaa2909 dikshaa2909 commented Feb 18, 2026

Fixes #4760

Summary

This PR fixes incorrect author detection when author names contain HTML entities or Unicode characters (e.g., ü, ö, ü, etc.).

Previously, author names like Ceki Gülcü (Ceki Gülcü) were completely missed during copyright scanning because the tokenizer's proper noun (NNP) pattern did not support Unicode characters that result from HTML entity decoding.

Root Cause

While HTML entities were being decoded using html.unescape() (line 427 in src/cluecode/copyrights.py), the resulting Unicode characters (ü, ö, ä, etc.) were not recognized by the NNP tokenizer patterns. The regex patterns [a-z0-9]+ on lines 2221 and 2261 only matched ASCII lowercase letters and digits, causing names with accented characters to be misclassified as common nouns (NN) instead of proper nouns (NNP), which prevented author detection.

Fix

Extended the NNP (proper noun) tokenizer patterns in src/cluecode/copyrights.py to support Unicode characters:

  • Line 2221: Changed r'^([A-Z][a-z0-9]+){1,2}[\.,]?$' to r'^([A-Z][a-zà-ÿ0-9]+){1,2}[\.,]?$'
  • Line 2261: Changed r'^([A-Z][a-z0-9]+){1,2}\.?,?$' to r'^([A-Z][a-zà-ÿ0-9]+){1,2}\.?,?$'

This ensures:

  • Proper recognition of names with extended Latin characters (à-ÿ) after HTML entity decoding
  • Correct author extraction for international names
  • No impact on existing ASCII-only author names

Tests

  • Verified fix with existing test:
    • tests/cluecode/data/authors/author_html_entity.java - now correctly detects all 3 authors including Ceki Gülcü
    • Corresponding .yml expected output validation passes
  • Ran full test suite: pytest tests/cluecode/test_copyrights.py --test-suite all
    • Before: 4618 passed, 2 failed, 13 xfailed
    • After: 4620 passed, 0 failed, 13 xfailed
  • Specific test cases now passing:
    • test_authors_author_html_entity_java_14

No regressions introduced.

Checklist

  • Reviewed contribution guidelines
  • PR is descriptively titled and links the original issue (Author missed due to HTML encoded characters #4760)
  • Tests pass locally
  • Commits are in a uniquely-named feature branch
  • Updated documentation pages (not applicable)
  • Updated CHANGELOG.rst (not required for this fix)

…okenization

Signed-off-by: dikshaa2909 <dikshadeware@gmail.com>
@dikshaa2909 dikshaa2909 changed the title Fix #4760: Decode HTML entities in author names before tokenization Fix : Decode HTML entities in author names before tokenization Feb 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Author missed due to HTML encoded characters

1 participant

Comments