Add UN M.49 region containment support#193
Merged
Merged
Conversation
Resolve #175 - match a UN M.49 region group against a contained region, e.g. es-419 (Latin America) matches es-MX (Mexico). - Add UnM49Data sourced from the CLDR territoryContainment data, parsed with XmlReader to keep the library AOT compatible, generated and embedded following the existing dataset pattern. - Add LanguageLookup.IsMatch(prefix, tag, regionContainment) as an opt-in overload; the existing two argument IsMatch is unchanged. - Add LanguageLookup.ExpandRegion to expand a region into its containing UN M.49 groups. - Fix ValidateExtendedLanguage to require 3 alpha so a numeric region following the language parses as a region, e.g. es-419. - Restore previously trimmed control flow comments in the parser and lookup. - Update README and HISTORY, add UN M.49 references, bump version to 1.4. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds UN M.49 (CLDR territoryContainment) region containment support to the LanguageTags library so numeric region groups (e.g., 419) can be matched against contained regions (e.g., MX), and enables region expansion to ancestor group regions.
Changes:
- Introduces
UnM49Data(XML loader + JSON/codegen + embedded generated dataset) sourced from Unicode CLDR territory containment. - Adds
LanguageLookup.IsMatch(prefix, tag, regionContainment)overload andLanguageLookup.ExpandRegion()for containment-based matching/expansion. - Fixes parsing so
extlangrequires3ALPHA, ensuring numeric regions likees-419parse as a region.
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| version.json | Bumps library version to 1.4. |
| README.md | Documents UN M.49 containment matching and region expansion; updates links and release-notes snippet. |
| HISTORY.md | Adds 1.4 release entry describing containment support and parser fix. |
| LanguageTagsCreate/CreateTagData.cs | Extends codegen pipeline to download/convert/generate UN M.49 data and code. |
| LanguageTags/UnM49Data.cs | Implements UN M.49 loader (XmlReader), JSON serialization, code generation, and query APIs (Find/Contains/GetAncestors). |
| LanguageTags/UnM49DataGen.cs | Adds generated embedded UN M.49 dataset used by UnM49Data.Create(). |
| LanguageData/unm49.json | Adds generated JSON form of the UN M.49 dataset. |
| LanguageTags/LanguageSchema.cs | Registers UnM49Data in the source-generated JsonSerializerContext. |
| LanguageTags/LanguageLookup.cs | Adds containment-aware matching overload and ExpandRegion(); wires in UN M.49 dataset usage. |
| LanguageTags/LanguageTagParser.cs | Tightens extlang validation to 3ALPHA so numeric regions don’t misparse as extlang; restores control-flow comments. |
| LanguageTagsTests/UnM49Tests.cs | Adds tests for UN M.49 dataset loading/round-tripping and basic query behavior. |
| LanguageTagsTests/LanguageLookupTests.cs | Adds tests for containment matching (directional + opt-in) and ExpandRegion(). |
- Region containment matching now substitutes the candidate region and reuses the plain matcher, preserving variant, extension, and private use semantics to avoid false positives, e.g. es-419-nedis no longer matches es-MX while es-419 still matches es-MX-nedis. - Clarify UnM49Data.Find returns the first of possibly multiple records for a code, and point to GetAncestors/Contains for full transitive containment. - Clarify UnM49Record.Code may be an alphabetic CLDR grouping code (EU, EZ, UN). - Add region containment tests for script preservation and prefix variants. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
"variant" has a specific RFC 5646 meaning, so describe the expanded entries as region substituted tags instead, in the XML doc and README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ptr727
added a commit
that referenced
this pull request
Jun 26, 2026
Release: promote develop to main - UN M.49 region containment (#193, version floor 1.4) - Release-workflow fixes re-synced from template (#196: #213/#214/#217) Conflicts in three workflow files were the parallel setup-dotnet dependabot bump (identical on both branches); resolved to develop's authoritative versions (newer template re-sync + the workflow fixes).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves #175.
What
Matching a UN M.49 region group against a contained region now works, e.g.
es-419(Latin America and the Caribbean) matcheses-MX(Mexico).Changes
UnM49Datadataset sourced from Unicode CLDRterritoryContainment, parsed withXmlReader(reflection-free, keeps the library AOT compatible —XmlSerializeris avoided), then JSON converted and code generated following the existing ISO/RFC dataset pattern.LanguageLookup.IsMatch(prefix, tag, regionContainment)— opt-in overload. The existing two-argumentIsMatchis unchanged and delegates withregionContainment: false, so there is no behaviour change and no extra work in the default path. Matching is directional: the broad group in the prefix matches the specific region in the tag, not the reverse.LanguageLookup.ExpandRegion— expands a tag region into the tag plus a variant per containing UN M.49 group, e.g.es-MX→es-013,es-419,es-019,es-001.ValidateExtendedLanguagenow requires 3 alpha per RFC 5646 (extlang = 3ALPHA), so a numeric region following the language parses as a region instead of an extended language, e.g.es-419.version.jsonbumped to 1.4.Design notes
419is agrouping="true"overlay in CLDR (not a canonical tree node), so the loader keeps grouping overlays and skips onlystatus="deprecated"entries — dropping overlays would remove419entirely.001(World) is the universal ancestor, soes-001matches anyestag with a region; documented in the API and covered by a test.Testing
UnM49Tests(dataset round-trip,Find,Contains,GetAncestors) andLanguageLookupTeststheories covering the containment matching andExpandRegion, including the directional and backward-compatibility cases.🤖 Generated with Claude Code