Fix : Decode HTML entities in author names before tokenization by dikshaa2909 · Pull Request #4761 · aboutcode-org/scancode-toolkit

dikshaa2909 · 2026-02-18T17:31:42Z

Fixes #4760

Summary

This PR fixes incorrect author detection when author names contain HTML entities or Unicode characters (e.g., ü, ö, ü, etc.).

Previously, author names like Ceki Gülcü (Ceki Gülcü) were completely missed during copyright scanning because the tokenizer's proper noun (NNP) pattern did not support Unicode characters that result from HTML entity decoding.

Root Cause

While HTML entities were being decoded using html.unescape() (line 427 in src/cluecode/copyrights.py), the resulting Unicode characters (ü, ö, ä, etc.) were not recognized by the NNP tokenizer patterns. The regex patterns [a-z0-9]+ on lines 2221 and 2261 only matched ASCII lowercase letters and digits, causing names with accented characters to be misclassified as common nouns (NN) instead of proper nouns (NNP), which prevented author detection.

Fix

Extended the NNP (proper noun) tokenizer patterns in src/cluecode/copyrights.py to support Unicode characters:

Line 2221: Changed r'^([A-Z][a-z0-9]+){1,2}[\.,]?$' to r'^([A-Z][a-zà-ÿ0-9]+){1,2}[\.,]?$'
Line 2261: Changed r'^([A-Z][a-z0-9]+){1,2}\.?,?$' to r'^([A-Z][a-zà-ÿ0-9]+){1,2}\.?,?$'

This ensures:

Proper recognition of names with extended Latin characters (à-ÿ) after HTML entity decoding
Correct author extraction for international names
No impact on existing ASCII-only author names

Tests

Verified fix with existing test:
- tests/cluecode/data/authors/author_html_entity.java - now correctly detects all 3 authors including Ceki Gülcü
- Corresponding .yml expected output validation passes
Ran full test suite: pytest tests/cluecode/test_copyrights.py --test-suite all
- Before: 4618 passed, 2 failed, 13 xfailed
- After: 4620 passed, 0 failed, 13 xfailed
Specific test cases now passing:
- test_authors_author_html_entity_java_14

No regressions introduced.

Checklist

Reviewed contribution guidelines
PR is descriptively titled and links the original issue (Author missed due to HTML encoded characters #4760)
Tests pass locally
Commits are in a uniquely-named feature branch
Updated documentation pages (not applicable)
Updated CHANGELOG.rst (not required for this fix)

…okenization Signed-off-by: dikshaa2909 <dikshadeware@gmail.com>

Fix aboutcode-org#4760: Decode HTML entities in author names before t…

3264e56

…okenization Signed-off-by: dikshaa2909 <dikshadeware@gmail.com>

dikshaa2909 changed the title ~~Fix #4760: Decode HTML entities in author names before tokenization~~ Fix : Decode HTML entities in author names before tokenization Feb 18, 2026

dikshaa2909 mentioned this pull request Feb 18, 2026

Author missed due to HTML encoded characters #4760

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix : Decode HTML entities in author names before tokenization#4761

Fix : Decode HTML entities in author names before tokenization#4761
dikshaa2909 wants to merge 1 commit intoaboutcode-org:developfrom
dikshaa2909:fix-author-html-entity-4760

dikshaa2909 commented Feb 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Uh oh!

Conversation

dikshaa2909 commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes #4760

Summary

Root Cause

Fix

Tests

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

dikshaa2909 commented Feb 18, 2026 •

edited

Loading