Skip to content

Add Thai shaper, ported from HarfBuzz#376

Open
lehni wants to merge 1 commit into
foliojs:masterfrom
lineto:feature/thai-shaper
Open

Add Thai shaper, ported from HarfBuzz#376
lehni wants to merge 1 commit into
foliojs:masterfrom
lineto:feature/thai-shaper

Conversation

@lehni
Copy link
Copy Markdown

@lehni lehni commented May 20, 2026

fontkit currently has no dedicated Thai shaper — Thai (thai script) falls through to DefaultShaper, which leaves SARA AM (U+0E33) intact. This breaks the GSUB chain rules every modern Thai font ships: with the buffer in [base, tone, SARA_AM] order the tone-mark-shifting ccmp lookups don't fire, so e.g. น้ำ keeps the regular uni0E49 instead of the uni0E49.small HarfBuzz produces.

This PR ports hb-ot-shaper-thai.cc:

  • SARA AM decomposition + NIKHAHIT reorder (always on). SARA AM (Thai U+0E33 / Lao U+0EB3) is split into NIKHAHIT (U+0E4D / U+0ECD) + SARA AA (U+0E32 / U+0EB2), and the NIKHAHIT walks backward over any above-base marks. After this [base, tone, SARA_AM] becomes [base, NIKHAHIT, tone, SARA_AA], which is the shape the font's ccmp was designed to match.
  • PUA tone/vowel-shift fallback for fonts without Thai GSUB. Mirrors HB's do_thai_pua_shaping: an above/below state machine assigns one of NOP/SD/SL/SDL/RD actions to each tone or vowel mark, then we remap the codepoint to its Windows or Mac PUA variant if the font ships one. Gated on the absence of a Thai script in GSUB.

Also folds in a one-character typo fix in UnicodeLayoutEngine.js's Thai mark classification: 0x0E3D (an unassigned codepoint in the Thai block) was sitting in the Above_Right switch where 0x0E4D (NIKHAHIT) belongs. Both NIKHAHIT and the surrounding cases (MAI HAN-AKAT, SARA I/II/UE/UEE, MAITAIKHU, THANTHAKHAT, YAMAKKAN) are classified as Top in Unicode's IndicPositionalCategory.

Tests

Four tests added under test/shaping.js, ported from HarfBuzz's sara-am.tests and hand-picked Thai phrases. Test fixture is the hinted Noto Sans Thai from googlefonts (SIL OFL). Three of the four exercise the SARA AM path and fail without this shaper registered.

Lao (lao script) is mapped to the same shaper since the reorder applies identically (codepoints offset by 0x80).

Closes #134, closes #133.

@lehni lehni force-pushed the feature/thai-shaper branch from 9b49c12 to 13766bc Compare May 20, 2026 19:14
- Decompose SARA AM (U+0E33) into NIKHAHIT + SARA AA and reorder NIKHAHIT past above-base marks, matching `preprocess_text_thai`
- Apply PUA tone/vowel shift fallback for legacy fonts without Thai GSUB, mirroring `do_thai_pua_shaping` (above/below state machines + Windows/Mac PUA mapping tables)
- Fall back to the buffer's Unicode script in OTLayoutEngine when neither GSUB nor GPOS picks an OT script — lets script-specific shapers run on fonts without matching GSUB/GPOS so the Thai PUA fallback is actually reachable
- Fix typo in `UnicodeLayoutEngine` Thai mark classification: `0x0E3D` (unassigned) → `0x0E4D` (NIKHAHIT)
- Register `thai` and `'lao '` (the 4-char OT tag with trailing space), add Noto Sans Thai + Noto Sans Lao (OFL) as test fixtures with 8 shaping tests
@lehni lehni force-pushed the feature/thai-shaper branch from 13766bc to 61bffba Compare May 20, 2026 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Thai mark-to-mark positioning issue Typo in Thai codepoint?

1 participant