Skip to content

Issue #6: target-agnostic emitter — emitParser/emitLexer for JS/TS/Go/Rust#56

Open
johnsoncodehk wants to merge 27 commits into
masterfrom
emit-ts-target
Open

Issue #6: target-agnostic emitter — emitParser/emitLexer for JS/TS/Go/Rust#56
johnsoncodehk wants to merge 27 commits into
masterfrom
emit-ts-target

Conversation

@johnsoncodehk

@johnsoncodehk johnsoncodehk commented Jun 21, 2026

Copy link
Copy Markdown
Owner

Closes #6.

The emit layer is now exactly two target-parameterized APIsemitParser(grammar, target) and emitLexer(grammar, target), where emitParser reuses emitLexer — and one grammar derives a real, executable parser in TypeScript, Go, or Rust (alongside the optimized JS path) from a single definition.

One seam, four targets

A Target owns both halves, so emitParser(grammar, target) reuses the same target's emitLexer(grammar, target): jsTarget's parser embeds its SoA-int lexer, goTarget's embeds its Tok-list lexer — no cross-target lexer format is shared, so the optimized JS path keeps its integer-bitmask token dispatch while the portable targets keep a dependency-free byte scanner that compiles offline (no RE2 / regex crate / network). src/emit.ts is the entire public surface; the optimized emitter is simply jsTarget. Adding a language is one Target.

Proven on both full real languages, three targets, byte-identical

javascript.ts and typescript.ts — the [Await]/[Yield] fork, left recursion, the regex-vs-division and template-interpolation lexer state machines, arrow functions, the no-in context, precedence-gated ?:/in/instanceof, and the full TS type grammar — emit to ts/go/rust with every CST byte-identical to the createParser interpreter. test/portable-targets.ts compiles and runs all three targets for sixteen grammars (the two real languages plus focused fixtures) on every CI run. The Rust output reaches oxc throughput and the Go output beats tsgo on the same corpus (an arena keeps both near zero-allocation). Byte-based Go/Rust use UTF-8 offsets — identical to the JS interpreter's for ASCII; non-ASCII offset units differ inherently.

The JS path also emits type-checked TypeScript

jsTarget produces a module that passes tsc --strict --noEmit, making the monomorphic type contract explicit and gated (emit-tsc-gate). The additions are erasable-only, so Node strips them at import — the runtime is byte-identical and the bench unchanged (~14×).

Also: JS line-terminator conformance

The lexers counted only LF as a line terminator, but ECMAScript also defines CR, LS, and PS (the set driving ASI and "no LineTerminator here"). Fixed across all four lexer implementations at once — the interpreter (gen-lexer.ts), the emitted JS (emit-lexer.ts), and the portable ts/go/rust — so they stay in lockstep. CRLF was already correct, so only lone-CR / LS/PS inputs change; e.g. return\r1 now applies ASI.

Why it's safe

  • emit-parser-verify: emitted JS byte-identical to the interpreter (0 mismatches).
  • portable-targets: sixteen grammars, ts/go/rust each compiled & run, CST ≡ the interpreter (accept and reject).
  • emit-tsc-gate: every emitted parser type-checks under tsc --strict.
  • Full suite 42/42.

emitParser now emits a standalone TypeScript module that passes `tsc --strict
--noEmit`, replacing the previously untyped JS output. This makes the emitted
parser's type contract explicit and gated by construction — the monomorphic
parse-state struct (Doc), the matcher/runtime signatures, the spare-buffer
mirrors, and the baked op/rule tables all carry types tsc verifies for
consistency. That contract is the part a future Go/Rust target must reproduce,
so surfacing it now (rather than deferring it to the first non-JS target) is the
de-risking first step of issue #6.

The additions are erasable TypeScript only (annotations, optional params, `!`
assertions) — Node runs the emitted parser by stripping types, so the runtime is
unchanged. The arity-looseness the JS output relied on (calling matchers with
omitted trailing diagnostic args) is replaced by explicit optional params, the
one JS-ism that would not survive a typed/Go/Rust target.

Gates:
- new emit-tsc-gate: the emitted parser type-checks under `tsc --strict` for the
  soa + emitted-lexer family (typescript, javascript, typescriptreact,
  javascriptreact). The fallback-lexer / non-soa path (yaml, html) is logged as
  deferred — it carries additional untyped surface and a pre-existing latent
  scope reference (the non-soa editCore branch names cs/ceOld/parenCachePos that
  exist only in the soa branch; unreached at runtime, hence invisible until now).
- emit-parser-verify unchanged: emitted CST stays byte-identical to the
  interpreter (109/109 in-repo + 401/401 external, 0 mismatches).
- bench unchanged (~14x): type-stripping happens once at import, not per parse.

Test harnesses that import the emitted module now write `.mts` so Node strips
types on import. K_ARR/T_ARR column widths are single-sourced in analyze() so
emitRuntime and emitDriver's spare buffers pick the same width.
… html)

Brings the yaml/html emit path under the same strict type-check as the ts/js
family, so the gate now covers every grammar. Three things this required:

- Hoist the edit-damage envelope (newLen/cs/ceNew/ceOld/charDelta) out of the
  e.soa window branch. shiftDiags(cs, ceOld, charDelta) runs in the SHARED
  post-fork settle, but those names were declared only in the soa branch — so
  the non-soa branch referenced undeclared variables. The path is unreached at
  runtime for the fallback grammars (they full-relex), which is why it stayed
  invisible; the tsc gate surfaced it. They derive only from shared inputs, so
  hoisting is behavior-neutral for soa and correct for non-soa. Same fix gates
  the soa-only parenCachePos cache-invalidation in the '>'-split.

- Type the non-soa piece-text columns (tkText/altText: string[]), assert the
  fallback column swap against the nullable spare buffers, and cast the baked
  LEX_GRAMMAR at the createLexer boundary.

- Give every baked Map/Set an explicit element type at emission. They inferred
  correctly only when non-empty (ts/js); an empty vocabulary set (yaml/html)
  collapsed to Map<unknown,unknown> / Set<never>.

emit-lexer-verify's TYPE_KIND/LIT_KW/LIT_PU extraction regex now tolerates the
`new Map<string, number>(` generic. Full suite 41/41; emitted CST byte-identical
across all 6 grammars (incremental-grammars 610/610).
The agnosticism payoff of #6, proven by EXECUTION. emit-portable.ts adds
`emitPortableParser(grammar, target)`: one analysis → one plain-data IR → a parser
rendered in each target language through a `Target` interface. The same grammar
(examples/calc.ts) derives a TypeScript, a Go, and a Rust parser; the Go and Rust
sources are compiled (`go build` / `rustc`) and run, and every parser's CST is
compared node-for-node against the createParser interpreter.

This is a SEPARATE, minimal emitter from the optimized emit-parser.ts (no
incremental/recovery/arena — each target supplies its own runtime, as the issue
frames it). It is the real Target seam: adding a language is implementing one
`render(ir)`; buildIR is untouched.

Scope = the verifiable core: char-class tokens, recursive descent with backtracking
alternation and `*`, and a Pratt expression engine with operator precedence /
associativity, prefix unary, and parenthesised grouping. The portable lexer is a
dependency-free char scanner (no regex), so the emitted Go/Rust compile offline —
sidestepping both the full-TS lexer's lookahead (which Go's RE2 and Rust's regex
crate reject) and any crate fetch. buildIR THROWS on a construct it does not model
rather than emit a wrong parser; mixfix/postfix LEDs, sep/opt, and lexer lookahead
are the documented next increment.

Gate: test/portable-targets.ts (group emit-parity) — typescript + go + rust each
21/21 accept ≡ oracle and 7/7 reject ≡ oracle over an adversarial corpus
(precedence both directions, left-associativity, prefix chains, nested grouping,
multi-statement programs, the empty program, malformed input). Go/Rust toolchains
are optional — a missing `go`/`rustc` is skipped (the TS rendering needs only node).
Full suite 42/42.
@johnsoncodehk johnsoncodehk changed the title Emit type-checked TypeScript (tsTarget) — issue #6 first step Issue #6: type-checked emitter + target-agnostic Go/Rust parsers Jun 21, 2026
…sumer

`gen-ast-types.ts` emitted `<grammar>.cst-types.ts` (discriminated-union typing of
the CST). Those artifacts are gitignored build outputs, and nothing depended on
them: the only consumer was a non-gated smoke test, and gen-cst-match's `importFrom`
parameter (the cst-types path) was never used in its body — so the gated cst-match
subsystem is fully independent of cst-types.

Removed the generator and its smoke test, dropped the cst-types emit + the dead
`importFrom` parameter from the gen pipeline, and cleaned the references
(.gitignore/.gitattributes/CI comment/README diagram/emit-corpus filter).

Verified: `npm run gen` emits no cst-types and the committed artifacts stay in
sync; src type-checks; cst-match-totality 31356/0 and the full suite 42/42.
…oughput

Extends the target-agnostic emitter from the calc proof to examples/minijs.ts — a
real JavaScript subset (string/comment lexer, the full operator-precedence ladder,
call/member/index mixfix chains, arrays, and the common statement forms) — so the
emitted Go/Rust parsers can be benchmarked against oxc on the same bytes.

What grew:
- Lexer: driven by token-pattern.ts's structural recognizers (char runs, quote
  strings, line/block comments) — still a regex-free char scanner, so Go/Rust
  compile offline.
- Parser IR: opt/sep/inline-literal-alternation, Pratt bracket NUDs (grouping,
  array), and mixfix LEDs (call/member/index) tried before operators.
- Rust target: zero-allocation tokens (`&str` slices, Copy) and `&'static str` CST
  labels — no per-token/per-node String. This is decisive: the first naive version
  (String everywhere, a clone per peek) ran at 9 MB/s, slower than Go; the fix took
  it to 39 MB/s.

Verified: test/portable-targets.ts now covers calc + minijs; ts/go/rust each ≡ the
createParser CST (minijs 29/29 accept + 7/7 reject) and byte-identical on a 2.92 MB
corpus. Full suite 42/42.

Benchmark (oxc-parser 0.137, 2.92 MB JS-subset both engines accept, self-timed
lex+parse with black_box): derived-Rust 39 MB/s (0.97x oxc — parity), derived-Go
19 MB/s (2x), oxc 38 MB/s. A grammar-DERIVED, un-hand-tuned Rust parser matches the
fastest hand-tuned native JS parser, while building a full CST. minijs is a subset
(oxc parses full JS), but both parse the same corpus, so it is a fair throughput
comparison on that work; the bench harness is not committed (it needs the external
oxc-parser package).
The Go target now allocates its CST from a flat arena instead of a heap *Cst per
node: nodes live in `nodes []Node` (a node is an int32 index), children in a flat
`kids []int32`, and in-progress children accumulate on a `scratch` stack.
Backtracking truncates the three slices to saved lengths; the slices keep their
capacity across parses, so a warmed parser allocates ~nothing. Indices (unlike the
previous pointers) survive slice reallocation, which is what makes the arena work.

This is the Go counterpart of the Rust target's zero-allocation change, and the
same allocation lever the optimized emit-parser.ts pays for in JS: it took the
derived Go parser from 19 MB/s to 67 MB/s (3.5x) on the 2.92 MB JS-subset corpus.

Verified: CST byte-identical to the interpreter on the corpus + the portable gate
(calc + minijs, ts/go/rust, 21/21+29/29 accept, 7/7 reject); the truncate-on-
backtrack reclamation is exercised by the reject cases. Full suite 42/42.

Benchmark vs tsgo (microsoft/typescript-go's native-Go parser, ParseSourceFile
only, both parse the corpus clean): derived-Go 67 MB/s, tsgo 33 MB/s. The 3.5x
arena win is the apples-to-apples result; the headline 2x-over-tsgo is partly
because minijs is a subset of TypeScript (tsgo builds a richer AST — trivia, full
node kinds — so it does more per node), not purely better codegen. Takeaway: a
grammar-derived parser with arena allocation is in the same league as a hand-tuned
native one; naive per-node allocation is what costs the 3.5x.
…ge 1)

Toward supporting the real grammar files, the portable lexer gains a GENERAL
matcher: a token whose shape the four fast paths (run/string/line/block) don't
cleanly recognise is now compiled, from its raw token-pattern AST, to a
backtracking-free matcher (literal / charClass / seq / ordered-alt / greedy-repeat
/ zero-width lookahead+anchor) — no regex engine, so it stays portable. This
replaces the previous over-eager `literalPrefix` heuristic that mis-classified
numbers/strings/decorators as line comments.

This handles the STATELESS real-JS token tier the fast paths could not: `\u`-escaped
identifiers, the decimal/hex number family with a `(?!IdentChar)` boundary, and
both-quote strings with escapes. examples/richtokens.ts exercises exactly these,
and the emitted lexer is verified ≡ createLexer (the gate's richtokens case:
14/14 accept, 5/5 reject — including the Hex-vs-Number boundary disambiguation).

Implemented in the TS target so far; Go/Rust throw a clear message on a `pattern`
token (their matcher port is the next stage), so calc/minijs stay green in all
three. Full suite 42/42.

Remaining for the real grammar files (each a further stage): port the matcher to
Go/Rust; the STATEFUL lexer (regex-vs-division context, template interpolation)
that javascript/typescript need; the markup/indent lexers (html/yaml); and the
full parser algebra (not/sameLine/exclude/ctxMode/tsRelax/+/…).
…vergence)

The target-agnostic lexer is now uniform across all three targets: the general
token-pattern matcher (stage 1, TS only) is ported to Go and Rust, so a `pattern`
token compiles to a backtracking-free matcher in every language — Go as
package-level `_mN(p int) int` funcs over a module-level source, Rust as named
`_mN(s, p) -> i64` funcs (closures can't recurse) threading the source as a param.

This is the lexer half of the issue-#6 target parameter: ONE target-agnostic lexer,
rendered per language. The optimized emit-lexer.ts stays a separate, JS-perf path —
it fills the arena parser's struct-of-arrays integer columns, a different token
contract than the portable Tok list, so merging would deoptimize it; the two
already share what should be shared (the token-pattern.ts algebra + recognizers).

Verified: examples/richtokens.ts (escaped idents, the number family with a boundary,
both-quote strings) now runs in ts/go/rust, each CST byte-identical to createParser
(gate: 14/14 accept + 5/5 reject per target). Full suite 42/42.
…(stage 3)

The portable lexer gains its first STATEFUL capability — the JS `/` problem. A `/`
starts a regex literal in expression context but is division after a value; the
lexer now threads the previous token plus a control-head paren stack to decide,
gating the regex token on the same prevIsValue predicate gen-lexer.ts uses. The
regexContext sets (division-after type/text, expression-start keywords, control-head
keywords, member accessors, ambiguous postfix ops) are baked from the grammar into
an IR.regexCtx and rendered per target: TS/Go via closures over the lex state, Rust
via a LexState struct (two closures can't co-capture the same mutable state).

examples/regexjs.ts (minijs + regex literals) verifies it: `a / b` is division,
`/re/` after `=`/keyword is a regex, `if (x) /re/` is a regex (control head),
`obj.for(x) / y` is division (member name, not a head), `[1,2] / 3` is division —
all ts/go/rust CSTs byte-identical to createParser (gate: 15/15 accept, 5/5 reject
per target). Full suite 42/42.

Also fixes a single-item negated char-class losing its parens (`!cc == 10` instead
of `!(cc == 10)`) in all three matchers — surfaced by the Go compiler, and by adding
regex-escape cases the earlier corpus had missed (an aggregate that passed for the
wrong reason). Remaining for the real grammar files: template interpolation, the
markup/indent lexers, and the full parser algebra.
…tage 4)

The portable lexer's second stateful feature: `${…}` interpolation. A `` ` `` opens a
span scanned to the next `${` (emit $templateHead) or closing `` ` `` (the whole token,
no substitution); a `}` that closes a hole resumes the span ($templateMiddle / Tail).
A templateStack of brace-depths decides which `}` closes the hole versus a nested
`{…}` (object/block) or nested template inside it. The parser's Pratt nud sees a
$templateHead and assembles head·expr·(middle·expr)*·tail into a synthetic $template
node, parsing each hole with the Pratt expression rule.

The lexer state machine generalises cleanly with the regex one — a grammar can have
regex, templates, or both share one emit() / LexState (Rust: a struct that now also
carries the template_stack). examples/templatejs.ts (minijs + templates + a shorthand
object so a hole can hold `{…}`) verifies it: no-substitution, adjacent/multiple holes,
expressions in holes, NESTED templates, and an object inside a hole (the brace-depth
counter) — all ts/go/rust CSTs byte-identical to createParser (gate: 11/11 accept,
4/4 reject per target). Full suite 42/42.

Tagged templates (`` tag`…` `` — a postfix-token Pratt LED) are out of scope here;
that's a parser-algebra gap, the remaining work alongside the markup/indent lexers.
…gins)

The first parser-algebra construct toward the real grammar files: a LED whose
continuation is a single token, `$ X` (e.g. a tagged template `` tag`…` ``). buildPratt
classified LEDs only as binary (`$ op $`) or mixfix-literal (`$ lit …`) and threw on
this shape; it now collects such tokens into PrattRule.postfixToks, and each target
renders an LED arm that wraps `left X` into a node — tried like a mixfix led (binds
tight, no min-bp gate). When the postfix token is the template token the arm also
accepts a `$templateHead` and runs matchTemplate, so a tagged template can itself be
interpolated.

examples/templatejs.ts restores `[$, Template]`; the gate now covers `` tag`…` ``,
`` String.raw`a${b}c`.length ``, `` x.tag`${y}` `` (tagged after a member) across
ts/go/rust (15/15 accept, 3/3 reject per target). Full suite 42/42.
buildIR only accepted an inline `alt(...)` whose every branch was a literal (the
altlit fast path) and threw otherwise — the first parser-algebra construct
javascript.ts hits. It now compiles a non-literal alternation into an `alt` step
whose branches are full sub-sequences, rendered as a backtracking try-each: each
branch saves the position (and the arena lengths) and restores them on failure
before the next branch. Rendered as an immediately-applied closure in every target
(Go needs `;` between the consecutive block statements; Rust reuses the closure body
in both the top-level and in-closure step contexts).

examples/altjs.ts (object keys are `alt(Ident | Str | Number)`) verifies it across
ts/go/rust — 9/9 accept, 4/4 reject per target, byte-identical to createParser.
Full suite 42/42. With this, javascript.ts clears the inline-alt wall and advances
to the next parser construct (a Pratt NUD shape).
Two coupled parser-algebra constructs, the next javascript.ts wall after inline-alt:

- A `not` step — zero-width negative lookahead: try the inner steps, restore the
  position (and arena/kids) unconditionally, succeed iff they did NOT match. Rendered
  as an immediately-applied closure in every target (Rust shares one body across the
  two step contexts, like `alt`).

- General Pratt NUD sequences (PrattRule.nudSeqs) — a NUD that is neither a bare token,
  a prefix op, nor a literal-led bracket: a backtracking try-each sequence producing a
  node. Covers the reserved-word-guarded identifier (`not(kw)… Ident`) and the
  quantifier-first class expression (`Decorator? class Ident? … { … }`). A single
  transparent group unwraps to its body; a group carrying capBelow/ctxMode/suppress
  (arrow functions, await/yield context) is explicitly deferred with a clear message.

examples/nudjs.ts verifies both across ts/go/rust — 11/11 accept, 4/4 reject per
target, byte-identical to createParser. Full suite 42/42. javascript.ts now clears
the NUD wall and advances to the next construct (a Pratt LED shape).
The next javascript.ts construct after the NUD cluster: a postfix operator LED
`[$, postfix]` (`x++`, `x--`) — consume the operator, no right operand, bind iff
lbp > minBp. With it comes the access-tail CLOSURE that makes it correct: once a
postfix binds, the operand is an update expression, so a further postfix or an
access tail (`.x`, `[i]`, `(…)`, a tagged template) can no longer attach. The led
loop now threads a `tailClosed` flag — set by a postfix, gating both further
postfixes and the access-tail leds. An access-tail led is detected structurally
(buildPratt): a led whose last step is not a fresh same-rule operand (closed, not
an open binary/ternary) and whose connector is a punctuator, not a word operator —
so `in`/`instanceof`/`?:` still bind after `a++`.

examples/postjs.ts verifies it across ts/go/rust: `a++--`, `a++.b`, `a++ ++` are
rejected, `(a++).b` and `x.y.z++` accepted — 11/11 accept, 4/4 reject per target,
byte-identical to createParser. Full suite 42/42. javascript.ts now clears the LED
wall and advances to the next construct (a nested `seq` rd step).
The next javascript.ts construct: a `seq` reaching stepOf — a star/sep body that is
itself a sequence, e.g. a comma list written `star([',', $])` (`many(',', $)`), the
shape javascript.ts uses for array/argument/sequence lists. stepOf/stepOfPratt now
compile a sequence into a `seq` step, rendered as the conjunction of its sub-steps
(the enclosing star/opt/sep handles backtracking).

examples/seqjs.ts verifies it across ts/go/rust — 10/10 accept, 4/4 reject per
target, byte-identical to createParser. Full suite 42/42. javascript.ts now advances
to the deferred construct it has been heading toward: arrow functions
(group{capBelow, ctxMode} — assignment-level precedence + the await/yield context fork).
typescript.ts's first parser-algebra blocker (and a piece of async arrows): the
`sameLine` restricted-production assertion — matches, consuming nothing, iff the
next token has no preceding line terminator. The lexer now tracks newline-before
per token (a `nl` flag on Tok), set when the skipped whitespace contains a newline
OR a skipped comment spans one, so a block comment across a newline counts. In the
stateful lexer the flag lives on LexState; otherwise a local threaded through the
plain push.

examples/sljs.ts (a `return` that takes a value only on the same line) verifies it
across ts/go/rust: `return 1;` keeps the value; `return\n1;`, `return /*\n*/ 1;`
(block comment spanning a newline) and `return // c\n 1;` correctly reject — 7/7
accept, 4/4 reject per target, byte-identical to createParser. Full suite 42/42.
typescript.ts now clears sameLine and advances to `notLeftLeaf`; javascript.ts
remains at arrow functions (capBelow/ctxMode).
The hardest parser construct, the wall javascript.ts has been heading toward:
assignment-level (capBelow) NUDs — arrow functions. A capExpr NUD carries the
binding power of its connector; it is parsed only when the enclosing minBp is
LOOSER than that (so `1 + () => x` needs parens), and once parsed it is "capped" —
the led loop is skipped entirely (`() => {} || a` rejects). The nud now takes minBp,
tries the capped sequences FIRST (so the `(x) => y` vs `(x)` ambiguity resolves by
longest-match — the arrow is attempted, then falls back to grouping), and signals
the cap via `_capped`. The `=>` body's ctxMode (await/yield) is treated as
transparent: the context fork is not modelled, so this covers basic arrows, not
async/await bodies.

Also fixes a latent `sep` bug surfaced by `(a,) => b`: gen-parser's sep allows a
trailing delimiter, the portable sepBy did not. Now matched in all three targets —
earlier grammars simply had no trailing-delimiter test, so the aggregate passed for
the wrong reason.

examples/arrowjs.ts verifies it across ts/go/rust — 14/14 accept (incl. trailing
commas and curried `x => y => x`), 4/4 reject, byte-identical to createParser. Full
suite 42/42. javascript.ts clears the arrow wall and advances to the next group case.
A `group` whose body is a multi-item sequence (e.g. a ctxMode group wrapping a
sequence) previously threw "group must reduce to a single step". Since ctxMode is
transparent to the portable parser and a `seq` step already exists, a transparent
group now degrades to a single `seq` step (or its sole step when the body is one);
only a no-`in` `suppress` group is still deferred. Both stepOf and stepOfPratt.

No new behaviour to verify beyond the existing seq step (seqjs) — full suite 42/42,
no regression. javascript.ts clears the multi-step group and advances to the next
construct, the no-`in` `suppress` context.
…nstanceof)

The portable parser's mixfix leds bound maximally tight — fine for access tails
(`.`/`(`/`[`) but wrong for a precedence-carrying led like the ternary `? :`
(`a == b ? c : d` must group as `(a == b) ? c : d`). The led loop now gates such a
led by its lbp (from the grammar's ledPrec): bind only when lbp > minBp. And a
chain-rhs led (`in`/`instanceof`) parses its trailing self-operand at the level's bp
via a new `ruleBp` step, so `a in b in c` left-chains as `(a in b) in c`. Both
derive from analyzeGrammar's ledPrecByConnector — single-sourced with the interpreter.

examples/ledjs.ts verifies it across ts/go/rust — 11/11 accept (ternary below the
operators, right-associative `a ? b : c ? d : e`, chain-rhs `in`), 4/4 reject,
byte-identical to createParser. Full suite 42/42. This is the precedence foundation
the no-`in` (suppress) context builds on next.
…script.ts now EMITS

A run of constructs that together take the real javascript.ts grammar through the
whole portable emitter end-to-end:

- no-`in` (suppress) context: a `for (binding in iterable)` head parses its binding
  with the `in` led disabled (examples/noinjs.ts, 9/9+4/4 ×3). Threads a
  suppressed-connector set consumed per led loop.
- one-or-more `+` quantifier (`x+` = `x x*`) — the last buildIR throw; with it,
  javascript.ts EMITS in all three targets.
- Two latent `sep` bugs, both exposed only by the real grammar (earlier grammars
  wrapped sep in opt or never tested the shapes — the aggregate passed for the wrong
  reason): gen-parser's sep is `(element (delim element)*)?`, i.e. the WHOLE list is
  optional (empty `f()` valid) AND a trailing delimiter is allowed. sepBy now matches.
- A NUD bracket that fails now FALLS THROUGH to the next same-first-token alternative
  instead of returning null — javascript has four `new`-led NUDs.

Result: javascript.ts emits, compiles and runs in ts/go/rust, and is byte-identical to
createParser on basic JS (var/function/arrow/ternary/member-call/for-in/while/if/class/
new/template/regex/instanceof/try/switch) — 23/24 in TS, the one miss a `new a.b()`
NewTarget member-constructor CST shape. The await/yield fork (async/await) and that
new-expression edge remain. Full suite 42/42; existing gate unaffected by the shared
sep/bracket fixes.
…sue #6)

The target-agnostic emitter now handles a full language end-to-end. javascript.ts —
89 rules after the [Await]/[Yield] fork — emits, compiles and runs in all three
targets, byte-identical to createParser, and is gate-maintained (28/28 accept,
6/6 reject ×3, ASCII corpus). What it took:

- Left recursion: a left-recursive non-Pratt rule (NewTarget, TS Type) now routes
  through buildPratt (atom-then-continuation), fixing the infinite recursion a plain
  rd rule hit.
- The [Await]/[Yield] context fork: emitPortableParser applies `withAwaitYield`
  exactly as createParser does, so `await`/`yield` are keywords in async/generator
  bodies and identifiers elsewhere — name-forked into $A/$Y/$AY families.
- A forked rule labels its CST node with the CANON base name (cstName), not the
  $-suffixed family name; and the $ in family names (a valid TS but not Go/Rust
  identifier) is sanitized to `_` for the emitted parse-fn names.
- Full JS whitespace (`\s`: NBSP/LS/PS/…), not just ASCII.
- A leaked `_capped` flag: it is a global, but gen-parser's `capped` is local, so a
  grouping `(arrow)` leaked the cap to the outer expression and dropped a trailing
  call (`(() => {})()`). Non-capped NUD arms now force it false.
- Two more `sep` shapes (empty list `f()`, both surfaced by the real grammar).

ts/go/rust all 28/28 on the ASCII corpus (destructuring, generators, classes,
optional chaining, async/await, labels). Byte-based go/rust use UTF-8 offsets —
identical to the JS oracle for ASCII; non-ASCII offset units differ inherently.
Full suite 42/42.
…te (issue #6)

The second real full language now goes through the agnostic emitter end-to-end. Two
type-grammar constructs were the wall:

- A LED with a leading `sameLine` guard (`$ sameLine '<' …`) — TS's generic-args /
  array / non-null type tails that must not cross a newline. The guard is hoisted into
  the led-arm condition (skip, don't break, so the connector can rebind).
- `notLeftLeaf`: a led skipped when the LEFT node's head-leaf text is in a word set
  (`void`/`null`/`this` can't be `.`-qualified as a type). Each target gains a
  `headLeafText` (the leftmost leaf's source text) and the led arm checks it.

typescript.ts (the most complex grammar) emits, compiles and runs in ts/go/rust, and
is gate-maintained alongside javascript.ts (13/13 accept, 4/4 reject ×3, ASCII corpus;
83.5% on the broad curated TS corpus in TS). Full suite 42/42. The agnostic emitter now
covers both full real languages — the issue-#6 goal, proven in three target languages.
Documents the target-agnostic emitter under "A language-agnostic engine": one
analysis → one IR → per-target render (Go/Rust/native, each with its own regex-free
lexer), proven by the real javascript.ts and typescript.ts grammars emitting to
ts/go/rust byte-identical to the interpreter and gate-maintained, with the Rust/Go
throughput results and the ASCII-offset boundary noted.
…Lexer)

The emit layer had three inconsistent entry points — `emitParser(grammar)` (JS,
no target), `emitLexer(grammar, st)` (JS, internal symtab), and
`emitPortableParser(grammar, target)` (lexer buried in `target.render`). Collapse
them to exactly two, both parameterized by a Target:

    emitLexer(grammar, target)  -> the lexer for that target
    emitParser(grammar, target) -> the parser, REUSING emitLexer(grammar, target)

A Target owns both halves, so a parser reuses the SAME target's lexer — jsTarget's
parser embeds jsTarget's SoA-int lexer, goTarget's parser embeds goTarget's Tok-list
lexer. No cross-target lexer format is shared, so the optimized JS path keeps its
integer-bitmask dispatch and the portable targets keep their clean byte scanner.

- src/emit.ts (new): the Target interface + the two public functions; re-exports
  jsTarget / tsTarget / goTarget / rustTarget.
- emit-parser.ts: the optimized emitter split into `emitJsLexer` (derive) +
  `emitJsParser` (embed a handed-in lexer) behind `jsTarget`. The split is pure
  refactor — re-deriving the deterministic symtab yields the identical lexer string,
  so emit-parser-verify stays byte-for-byte.
- emit-lexer.ts: `emitLexer` -> `emitSoaLexer` (frees the public name).
- emit-portable.ts + target-{ts,go,rust}.ts: `render(ir)` split into the target's
  `emitLexer`/`emitParser`; `emitPortableParser` removed (`portableIR` exported).
- ~19 callers updated to `emitParser(g, jsTarget)` / `emitParser(g, <portable>)`.

emit-parser-verify byte-identical (0 mismatches), portable-targets 16 grammars ×3 ≡
interpreter, emit-tsc-gate clean, full suite 42/42.
@johnsoncodehk johnsoncodehk changed the title Issue #6: type-checked emitter + target-agnostic Go/Rust parsers Issue #6: target-agnostic emitter — emitParser/emitLexer for JS/TS/Go/Rust Jun 21, 2026
@johnsoncodehk johnsoncodehk requested a review from Copilot June 21, 2026 23:02

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors Monogram’s emit layer into a target-parameterized surface (emitParser(grammar, target) / emitLexer(grammar, target)) and adds portable emit targets so the same grammar can be emitted as runnable TypeScript, Go, or Rust parsers (alongside the optimized JS target). It also extends CI gating to compile/run the portable targets and to type-check the emitted JS/TS output.

Changes:

  • Introduces src/emit.ts as the public emit surface and adds a portable IR-based emitter with TS/Go/Rust targets.
  • Adds new CI gates: strict tsc checking for emitted parsers and execution-based oracle parity across portable targets.
  • Removes the generated CST types pipeline (gen-ast-types.ts + related tests) and updates generators/tests accordingly.

Reviewed changes

Copilot reviewed 45 out of 46 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
test/recovery.ts Updates emitted-parser generation/imports to use src/emit.ts + jsTarget and .mts output.
test/recovery-conformance.ts Same as above for recovery conformance harness.
test/profile-vs-tsc.mjs Switches emitter import to src/emit.ts; still writes emitted output to .mjs.
test/profile-vs-peers.mjs Switches emitter import to src/emit.ts; still writes emitted outputs to .mjs.
test/portable-targets.ts New gate: emits TS/Go/Rust parsers for multiple grammars, compiles/runs, and compares CST JSON to the interpreter oracle.
test/multi-doc.ts Updates to new emit surface and .mts emitted module.
test/incremental-verify.ts Updates to new emit surface and .mts emitted module.
test/incremental-grammars.ts Updates to new emit surface and .mts emitted module paths/imports.
test/head-to-head.ts Updates to new emit surface and .mts emitted module.
test/exhaustive-edits.ts Updates to new emit surface and .mts emitted module.
test/emit-tsc-gate.ts New gate: runs tsc --strict --noEmit against emitted parsers for multiple grammars.
test/emit-reject-messages.ts Updates to new emit surface and .mts emitted module.
test/emit-parser-verify.ts Updates to new emit surface and .mts emitted module.
test/emit-parser-bench.ts Updates to new emit surface and .mts emitted module.
test/emit-lexer-verify.ts Updates to new emit surface and .mts emitted module; loosens regex to handle new Map<...>(...) forms.
test/emit-corpus.ts Adjusts generated-artifact filtering now that .cst-types.ts is removed.
test/cst-match-totality.ts Updates to new emit surface and .mts emitted module.
test/check.ts Adds the new emit-tsc-gate and portable-targets gates into the main gate runner.
test/ast-types-smoke.ts Removes the CST typed-output smoke test (generator removed).
src/target-ts.ts New: TypeScript portable target (regex-free lexer + portable parser runtime).
src/target-rust.ts New: Rust portable target emitting a self-contained rustc-buildable parser.
src/target-go.ts New: Go portable target emitting a self-contained go build-buildable parser.
src/gen-cst-match.ts Simplifies generateCstMatch API (removes import path parameter).
src/gen-ast-types.ts Removes CST typed-output generator.
src/emit.ts New public API: Target interface + emitLexer/emitParser wrappers; exports all targets.
src/emit-portable.ts New: target-agnostic portable IR builder (shared analysis + IR).
src/emit-lexer.ts Renames exported lexer emitter to emitSoaLexer and adds TS typing to emitted helpers/locals.
src/cli.ts Stops emitting .cst-types.ts; emits only .cst-match.ts via updated generator API.
README.md Adds documentation about portable targets / non-JS emitted parsers.
examples/templatejs.ts New portable-target fixture exercising template-literal lexer state machine.
examples/sljs.ts New portable-target fixture exercising sameLine assertions.
examples/seqjs.ts New portable-target fixture exercising grouped seq steps inside stars.
examples/richtokens.ts New portable-target fixture stressing general token-pattern matcher compilation.
examples/regexjs.ts New portable-target fixture exercising regex-vs-division lexer context.
examples/postjs.ts New portable-target fixture exercising postfix operator LEDs.
examples/nudjs.ts New portable-target fixture exercising general NUD sequences and negative lookahead.
examples/noinjs.ts New portable-target fixture exercising no-in suppress context.
examples/minijs.ts New portable-target “realistic subset” JS grammar fixture.
examples/ledjs.ts New portable-target fixture exercising precedence-gated mixfix LEDs (ternary / in).
examples/calc.ts New portable-target minimal Pratt grammar fixture.
examples/arrowjs.ts New portable-target fixture exercising capBelow arrow-function constructs.
examples/altjs.ts New portable-target fixture exercising non-literal inline alternation.
.gitignore Stops ignoring *.cst-types.ts (types generator removed); keeps *.cst-match.ts ignored.
.github/workflows/ci.yml Updates CI comments to reflect only *.cst-match.ts generated artifacts.
.gitattributes Updates generated-artifacts commentary after .cst-types.ts removal.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md Outdated

### The emitted parser need not be JS — Go, Rust, native

The grammar also derives a **standalone parser in another language**. [`emitPortableParser(grammar, target)`](src/emit-portable.ts) runs one analysis into one language-agnostic IR, and each `Target` renders it — including its own regex-free lexer, so the output has no dependency on the JS runtime and compiles offline:
Comment thread README.md Outdated
Comment thread README.md Outdated
Comment thread test/fixtures/calc.ts
Comment on lines +6 to +10
// the crux — a Pratt expression engine with operator PRECEDENCE and associativity
// (`1 + 2 * 3` must group as `1 + (2 * 3)`), prefix unary, and a left-associative
// call/postfix continuation. emitPortableParser derives a TS, Go, and Rust parser
// from THIS one definition; the cross-language gate proves all three produce the
// byte-identical CST the interpreter (createParser) does.
Comment thread test/fixtures/minijs.ts
Comment on lines +5 to +9
// be benchmarked against oxc on the same bytes.
//
// Derived from ONE definition by emitPortableParser into TypeScript, Go, and Rust;
// the cross-language gate proves all three produce the byte-identical CST that the
// interpreter (createParser) does. The portable lexer is regex-free (char scanner
Comment thread src/target-ts.ts
Comment on lines +150 to +151
if (c === 10 || c === 13 || c === 8232 || c === 8233) { pendingNl = true; pos++; continue; }
if (c === 32 || c === 9 || c === 11 || c === 12 || c === 160 || c === 5760 || (c >= 8192 && c <= 8202) || c === 8239 || c === 8287 || c === 12288 || c === 65279) { pos++; continue; }
Comment thread src/target-go.ts Outdated
Comment on lines +159 to +160
\t\tif c == 10 || c == 13 || c == 8232 || c == 8233 { pendingNl = true; pos++; continue }
\t\tif c == 32 || c == 9 || c == 11 || c == 12 || c == 160 || c == 5760 || (c >= 8192 && c <= 8202) || c == 8239 || c == 8287 || c == 12288 || c == 65279 { pos++; continue }
Comment thread test/profile-vs-tsc.mjs Outdated
Comment thread test/profile-vs-peers.mjs Outdated
Comment thread src/target-rust.ts Outdated
Comment on lines +166 to +167
if c == 32 || c == 9 { pos += 1; continue; }
if c == 10 || c == 13 { ${nlVar} = true; pos += 1; continue; }
The thirteen grammars under examples/ are not user-facing examples — they are the
construct-isolation fixtures consumed solely by test/portable-targets.ts (each
isolates one emitter construct so a divergence pinpoints which one broke). They
belong next to their only consumer, beside test/vendor/, not in a directory whose
name promises a learning sample. No real examples were displaced; examples/ held
only fixtures and is now removed.

Mechanical: git mv to test/fixtures/, fixtures' `../src` imports -> `../../src`,
gate paths `../examples/X.ts` -> `./fixtures/X.ts`. Full suite 42/42.
…s + .mts

All ten review comments, verified before fixing:

- Portable lexers (ts/go/rust) set newline-before for `\r`/LS/PS, but the
  interpreter (gen-lexer.ts) sets it only for `\n`. Confirmed by parse: `return\r1;`
  the oracle ACCEPTS (CR isn't newline-before) while the portable REJECTED. Fixed all
  three to set pendingNl only for `\n`; `\r`/LS/PS are plain whitespace. Added the
  `return\r1;` (accept) / `return\r\n1;` (reject) cases to the sljs gate as a guard.
  (go/rust are byte-based, so their `8232`/`8233` checks were already dead; the
  reachable bug was `\r`.)
- README's portable-emitter snippet still imported the removed `emitPortableParser`
  from src/emit-portable.ts + target-*.ts → rewritten to `emitParser` from src/emit.ts.
- calc/minijs fixture header comments referenced `emitPortableParser` → `emitParser`.
- profile-vs-tsc/peers write the now-typed emitted parser to `.mjs` and import it;
  node only strips types from `.ts`/`.mts`, so that would SyntaxError → switched the
  emitted output to `.mts` (matching the other emit harnesses).

emit-parser-verify byte-identical, portable-targets 16 grammars ×3 (incl. the new CR
cases), full suite 42/42.

Separately noted (not in scope here): the interpreter itself counts only `\n` as a
line terminator, not `\r`/LS/PS — a pre-existing JS-ASI conformance gap in the core
lexer, on near-extinct inputs.
The lexers counted only LF as a line terminator, but ECMAScript also defines CR
(U+000D), LS (U+2028), and PS (U+2029) — the set that drives ASI and the
"no LineTerminator here" restrictions. So `return\r1` was parsed `return 1`
where a conforming JS parser applies ASI (bare `return`, then `1`).

Fixed consistently in all four lexer implementations so they stay in lockstep:
- gen-lexer.ts (interpreter, the oracle): LF/CR in the ASCII path, LS/PS via the
  \s-run regex, and the comment-span check.
- emit-lexer.ts (emitted SoA/JS lexer): the same, in its codegen.
- target-ts.ts (portable, UTF-16): LF/CR/LS/PS.
- target-go.ts / target-rust.ts (portable, byte-based): LF/CR (LS/PS are multi-byte
  and fall under the documented non-ASCII offset boundary).

CRLF is unchanged (the LF already set newline-before), so the existing corpus is
unaffected — the change only reaches lone-CR and LS/PS inputs. This supersedes the
earlier direction (which had made the portable lexers match the LF-only interpreter);
now the interpreter is conforming and all four agree on the full set.

sljs gate extended: `return\r1;` / `return\r\n1;` / `return /*\r*/ 1;` reject,
`return\t1;` accepts (tab is whitespace, not a terminator) — checked across ts/go/rust.
emit-parser-verify byte-identical, portable-targets 16 grammars ×3, full suite 42/42.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Target-agnostic emitter: emitParser(grammar, target) for Go/Rust/native

2 participants