Issue #6: target-agnostic emitter — emitParser/emitLexer for JS/TS/Go/Rust#56
Open
johnsoncodehk wants to merge 27 commits into
Open
Issue #6: target-agnostic emitter — emitParser/emitLexer for JS/TS/Go/Rust#56johnsoncodehk wants to merge 27 commits into
johnsoncodehk wants to merge 27 commits into
Conversation
emitParser now emits a standalone TypeScript module that passes `tsc --strict --noEmit`, replacing the previously untyped JS output. This makes the emitted parser's type contract explicit and gated by construction — the monomorphic parse-state struct (Doc), the matcher/runtime signatures, the spare-buffer mirrors, and the baked op/rule tables all carry types tsc verifies for consistency. That contract is the part a future Go/Rust target must reproduce, so surfacing it now (rather than deferring it to the first non-JS target) is the de-risking first step of issue #6. The additions are erasable TypeScript only (annotations, optional params, `!` assertions) — Node runs the emitted parser by stripping types, so the runtime is unchanged. The arity-looseness the JS output relied on (calling matchers with omitted trailing diagnostic args) is replaced by explicit optional params, the one JS-ism that would not survive a typed/Go/Rust target. Gates: - new emit-tsc-gate: the emitted parser type-checks under `tsc --strict` for the soa + emitted-lexer family (typescript, javascript, typescriptreact, javascriptreact). The fallback-lexer / non-soa path (yaml, html) is logged as deferred — it carries additional untyped surface and a pre-existing latent scope reference (the non-soa editCore branch names cs/ceOld/parenCachePos that exist only in the soa branch; unreached at runtime, hence invisible until now). - emit-parser-verify unchanged: emitted CST stays byte-identical to the interpreter (109/109 in-repo + 401/401 external, 0 mismatches). - bench unchanged (~14x): type-stripping happens once at import, not per parse. Test harnesses that import the emitted module now write `.mts` so Node strips types on import. K_ARR/T_ARR column widths are single-sourced in analyze() so emitRuntime and emitDriver's spare buffers pick the same width.
… html) Brings the yaml/html emit path under the same strict type-check as the ts/js family, so the gate now covers every grammar. Three things this required: - Hoist the edit-damage envelope (newLen/cs/ceNew/ceOld/charDelta) out of the e.soa window branch. shiftDiags(cs, ceOld, charDelta) runs in the SHARED post-fork settle, but those names were declared only in the soa branch — so the non-soa branch referenced undeclared variables. The path is unreached at runtime for the fallback grammars (they full-relex), which is why it stayed invisible; the tsc gate surfaced it. They derive only from shared inputs, so hoisting is behavior-neutral for soa and correct for non-soa. Same fix gates the soa-only parenCachePos cache-invalidation in the '>'-split. - Type the non-soa piece-text columns (tkText/altText: string[]), assert the fallback column swap against the nullable spare buffers, and cast the baked LEX_GRAMMAR at the createLexer boundary. - Give every baked Map/Set an explicit element type at emission. They inferred correctly only when non-empty (ts/js); an empty vocabulary set (yaml/html) collapsed to Map<unknown,unknown> / Set<never>. emit-lexer-verify's TYPE_KIND/LIT_KW/LIT_PU extraction regex now tolerates the `new Map<string, number>(` generic. Full suite 41/41; emitted CST byte-identical across all 6 grammars (incremental-grammars 610/610).
The agnosticism payoff of #6, proven by EXECUTION. emit-portable.ts adds `emitPortableParser(grammar, target)`: one analysis → one plain-data IR → a parser rendered in each target language through a `Target` interface. The same grammar (examples/calc.ts) derives a TypeScript, a Go, and a Rust parser; the Go and Rust sources are compiled (`go build` / `rustc`) and run, and every parser's CST is compared node-for-node against the createParser interpreter. This is a SEPARATE, minimal emitter from the optimized emit-parser.ts (no incremental/recovery/arena — each target supplies its own runtime, as the issue frames it). It is the real Target seam: adding a language is implementing one `render(ir)`; buildIR is untouched. Scope = the verifiable core: char-class tokens, recursive descent with backtracking alternation and `*`, and a Pratt expression engine with operator precedence / associativity, prefix unary, and parenthesised grouping. The portable lexer is a dependency-free char scanner (no regex), so the emitted Go/Rust compile offline — sidestepping both the full-TS lexer's lookahead (which Go's RE2 and Rust's regex crate reject) and any crate fetch. buildIR THROWS on a construct it does not model rather than emit a wrong parser; mixfix/postfix LEDs, sep/opt, and lexer lookahead are the documented next increment. Gate: test/portable-targets.ts (group emit-parity) — typescript + go + rust each 21/21 accept ≡ oracle and 7/7 reject ≡ oracle over an adversarial corpus (precedence both directions, left-associativity, prefix chains, nested grouping, multi-statement programs, the empty program, malformed input). Go/Rust toolchains are optional — a missing `go`/`rustc` is skipped (the TS rendering needs only node). Full suite 42/42.
…sumer `gen-ast-types.ts` emitted `<grammar>.cst-types.ts` (discriminated-union typing of the CST). Those artifacts are gitignored build outputs, and nothing depended on them: the only consumer was a non-gated smoke test, and gen-cst-match's `importFrom` parameter (the cst-types path) was never used in its body — so the gated cst-match subsystem is fully independent of cst-types. Removed the generator and its smoke test, dropped the cst-types emit + the dead `importFrom` parameter from the gen pipeline, and cleaned the references (.gitignore/.gitattributes/CI comment/README diagram/emit-corpus filter). Verified: `npm run gen` emits no cst-types and the committed artifacts stay in sync; src type-checks; cst-match-totality 31356/0 and the full suite 42/42.
…oughput Extends the target-agnostic emitter from the calc proof to examples/minijs.ts — a real JavaScript subset (string/comment lexer, the full operator-precedence ladder, call/member/index mixfix chains, arrays, and the common statement forms) — so the emitted Go/Rust parsers can be benchmarked against oxc on the same bytes. What grew: - Lexer: driven by token-pattern.ts's structural recognizers (char runs, quote strings, line/block comments) — still a regex-free char scanner, so Go/Rust compile offline. - Parser IR: opt/sep/inline-literal-alternation, Pratt bracket NUDs (grouping, array), and mixfix LEDs (call/member/index) tried before operators. - Rust target: zero-allocation tokens (`&str` slices, Copy) and `&'static str` CST labels — no per-token/per-node String. This is decisive: the first naive version (String everywhere, a clone per peek) ran at 9 MB/s, slower than Go; the fix took it to 39 MB/s. Verified: test/portable-targets.ts now covers calc + minijs; ts/go/rust each ≡ the createParser CST (minijs 29/29 accept + 7/7 reject) and byte-identical on a 2.92 MB corpus. Full suite 42/42. Benchmark (oxc-parser 0.137, 2.92 MB JS-subset both engines accept, self-timed lex+parse with black_box): derived-Rust 39 MB/s (0.97x oxc — parity), derived-Go 19 MB/s (2x), oxc 38 MB/s. A grammar-DERIVED, un-hand-tuned Rust parser matches the fastest hand-tuned native JS parser, while building a full CST. minijs is a subset (oxc parses full JS), but both parse the same corpus, so it is a fair throughput comparison on that work; the bench harness is not committed (it needs the external oxc-parser package).
The Go target now allocates its CST from a flat arena instead of a heap *Cst per node: nodes live in `nodes []Node` (a node is an int32 index), children in a flat `kids []int32`, and in-progress children accumulate on a `scratch` stack. Backtracking truncates the three slices to saved lengths; the slices keep their capacity across parses, so a warmed parser allocates ~nothing. Indices (unlike the previous pointers) survive slice reallocation, which is what makes the arena work. This is the Go counterpart of the Rust target's zero-allocation change, and the same allocation lever the optimized emit-parser.ts pays for in JS: it took the derived Go parser from 19 MB/s to 67 MB/s (3.5x) on the 2.92 MB JS-subset corpus. Verified: CST byte-identical to the interpreter on the corpus + the portable gate (calc + minijs, ts/go/rust, 21/21+29/29 accept, 7/7 reject); the truncate-on- backtrack reclamation is exercised by the reject cases. Full suite 42/42. Benchmark vs tsgo (microsoft/typescript-go's native-Go parser, ParseSourceFile only, both parse the corpus clean): derived-Go 67 MB/s, tsgo 33 MB/s. The 3.5x arena win is the apples-to-apples result; the headline 2x-over-tsgo is partly because minijs is a subset of TypeScript (tsgo builds a richer AST — trivia, full node kinds — so it does more per node), not purely better codegen. Takeaway: a grammar-derived parser with arena allocation is in the same league as a hand-tuned native one; naive per-node allocation is what costs the 3.5x.
…ge 1) Toward supporting the real grammar files, the portable lexer gains a GENERAL matcher: a token whose shape the four fast paths (run/string/line/block) don't cleanly recognise is now compiled, from its raw token-pattern AST, to a backtracking-free matcher (literal / charClass / seq / ordered-alt / greedy-repeat / zero-width lookahead+anchor) — no regex engine, so it stays portable. This replaces the previous over-eager `literalPrefix` heuristic that mis-classified numbers/strings/decorators as line comments. This handles the STATELESS real-JS token tier the fast paths could not: `\u`-escaped identifiers, the decimal/hex number family with a `(?!IdentChar)` boundary, and both-quote strings with escapes. examples/richtokens.ts exercises exactly these, and the emitted lexer is verified ≡ createLexer (the gate's richtokens case: 14/14 accept, 5/5 reject — including the Hex-vs-Number boundary disambiguation). Implemented in the TS target so far; Go/Rust throw a clear message on a `pattern` token (their matcher port is the next stage), so calc/minijs stay green in all three. Full suite 42/42. Remaining for the real grammar files (each a further stage): port the matcher to Go/Rust; the STATEFUL lexer (regex-vs-division context, template interpolation) that javascript/typescript need; the markup/indent lexers (html/yaml); and the full parser algebra (not/sameLine/exclude/ctxMode/tsRelax/+/…).
…vergence) The target-agnostic lexer is now uniform across all three targets: the general token-pattern matcher (stage 1, TS only) is ported to Go and Rust, so a `pattern` token compiles to a backtracking-free matcher in every language — Go as package-level `_mN(p int) int` funcs over a module-level source, Rust as named `_mN(s, p) -> i64` funcs (closures can't recurse) threading the source as a param. This is the lexer half of the issue-#6 target parameter: ONE target-agnostic lexer, rendered per language. The optimized emit-lexer.ts stays a separate, JS-perf path — it fills the arena parser's struct-of-arrays integer columns, a different token contract than the portable Tok list, so merging would deoptimize it; the two already share what should be shared (the token-pattern.ts algebra + recognizers). Verified: examples/richtokens.ts (escaped idents, the number family with a boundary, both-quote strings) now runs in ts/go/rust, each CST byte-identical to createParser (gate: 14/14 accept + 5/5 reject per target). Full suite 42/42.
…(stage 3) The portable lexer gains its first STATEFUL capability — the JS `/` problem. A `/` starts a regex literal in expression context but is division after a value; the lexer now threads the previous token plus a control-head paren stack to decide, gating the regex token on the same prevIsValue predicate gen-lexer.ts uses. The regexContext sets (division-after type/text, expression-start keywords, control-head keywords, member accessors, ambiguous postfix ops) are baked from the grammar into an IR.regexCtx and rendered per target: TS/Go via closures over the lex state, Rust via a LexState struct (two closures can't co-capture the same mutable state). examples/regexjs.ts (minijs + regex literals) verifies it: `a / b` is division, `/re/` after `=`/keyword is a regex, `if (x) /re/` is a regex (control head), `obj.for(x) / y` is division (member name, not a head), `[1,2] / 3` is division — all ts/go/rust CSTs byte-identical to createParser (gate: 15/15 accept, 5/5 reject per target). Full suite 42/42. Also fixes a single-item negated char-class losing its parens (`!cc == 10` instead of `!(cc == 10)`) in all three matchers — surfaced by the Go compiler, and by adding regex-escape cases the earlier corpus had missed (an aggregate that passed for the wrong reason). Remaining for the real grammar files: template interpolation, the markup/indent lexers, and the full parser algebra.
…tage 4)
The portable lexer's second stateful feature: `${…}` interpolation. A `` ` `` opens a
span scanned to the next `${` (emit $templateHead) or closing `` ` `` (the whole token,
no substitution); a `}` that closes a hole resumes the span ($templateMiddle / Tail).
A templateStack of brace-depths decides which `}` closes the hole versus a nested
`{…}` (object/block) or nested template inside it. The parser's Pratt nud sees a
$templateHead and assembles head·expr·(middle·expr)*·tail into a synthetic $template
node, parsing each hole with the Pratt expression rule.
The lexer state machine generalises cleanly with the regex one — a grammar can have
regex, templates, or both share one emit() / LexState (Rust: a struct that now also
carries the template_stack). examples/templatejs.ts (minijs + templates + a shorthand
object so a hole can hold `{…}`) verifies it: no-substitution, adjacent/multiple holes,
expressions in holes, NESTED templates, and an object inside a hole (the brace-depth
counter) — all ts/go/rust CSTs byte-identical to createParser (gate: 11/11 accept,
4/4 reject per target). Full suite 42/42.
Tagged templates (`` tag`…` `` — a postfix-token Pratt LED) are out of scope here;
that's a parser-algebra gap, the remaining work alongside the markup/indent lexers.
…gins)
The first parser-algebra construct toward the real grammar files: a LED whose
continuation is a single token, `$ X` (e.g. a tagged template `` tag`…` ``). buildPratt
classified LEDs only as binary (`$ op $`) or mixfix-literal (`$ lit …`) and threw on
this shape; it now collects such tokens into PrattRule.postfixToks, and each target
renders an LED arm that wraps `left X` into a node — tried like a mixfix led (binds
tight, no min-bp gate). When the postfix token is the template token the arm also
accepts a `$templateHead` and runs matchTemplate, so a tagged template can itself be
interpolated.
examples/templatejs.ts restores `[$, Template]`; the gate now covers `` tag`…` ``,
`` String.raw`a${b}c`.length ``, `` x.tag`${y}` `` (tagged after a member) across
ts/go/rust (15/15 accept, 3/3 reject per target). Full suite 42/42.
buildIR only accepted an inline `alt(...)` whose every branch was a literal (the altlit fast path) and threw otherwise — the first parser-algebra construct javascript.ts hits. It now compiles a non-literal alternation into an `alt` step whose branches are full sub-sequences, rendered as a backtracking try-each: each branch saves the position (and the arena lengths) and restores them on failure before the next branch. Rendered as an immediately-applied closure in every target (Go needs `;` between the consecutive block statements; Rust reuses the closure body in both the top-level and in-closure step contexts). examples/altjs.ts (object keys are `alt(Ident | Str | Number)`) verifies it across ts/go/rust — 9/9 accept, 4/4 reject per target, byte-identical to createParser. Full suite 42/42. With this, javascript.ts clears the inline-alt wall and advances to the next parser construct (a Pratt NUD shape).
Two coupled parser-algebra constructs, the next javascript.ts wall after inline-alt:
- A `not` step — zero-width negative lookahead: try the inner steps, restore the
position (and arena/kids) unconditionally, succeed iff they did NOT match. Rendered
as an immediately-applied closure in every target (Rust shares one body across the
two step contexts, like `alt`).
- General Pratt NUD sequences (PrattRule.nudSeqs) — a NUD that is neither a bare token,
a prefix op, nor a literal-led bracket: a backtracking try-each sequence producing a
node. Covers the reserved-word-guarded identifier (`not(kw)… Ident`) and the
quantifier-first class expression (`Decorator? class Ident? … { … }`). A single
transparent group unwraps to its body; a group carrying capBelow/ctxMode/suppress
(arrow functions, await/yield context) is explicitly deferred with a clear message.
examples/nudjs.ts verifies both across ts/go/rust — 11/11 accept, 4/4 reject per
target, byte-identical to createParser. Full suite 42/42. javascript.ts now clears
the NUD wall and advances to the next construct (a Pratt LED shape).
The next javascript.ts construct after the NUD cluster: a postfix operator LED `[$, postfix]` (`x++`, `x--`) — consume the operator, no right operand, bind iff lbp > minBp. With it comes the access-tail CLOSURE that makes it correct: once a postfix binds, the operand is an update expression, so a further postfix or an access tail (`.x`, `[i]`, `(…)`, a tagged template) can no longer attach. The led loop now threads a `tailClosed` flag — set by a postfix, gating both further postfixes and the access-tail leds. An access-tail led is detected structurally (buildPratt): a led whose last step is not a fresh same-rule operand (closed, not an open binary/ternary) and whose connector is a punctuator, not a word operator — so `in`/`instanceof`/`?:` still bind after `a++`. examples/postjs.ts verifies it across ts/go/rust: `a++--`, `a++.b`, `a++ ++` are rejected, `(a++).b` and `x.y.z++` accepted — 11/11 accept, 4/4 reject per target, byte-identical to createParser. Full suite 42/42. javascript.ts now clears the LED wall and advances to the next construct (a nested `seq` rd step).
The next javascript.ts construct: a `seq` reaching stepOf — a star/sep body that is
itself a sequence, e.g. a comma list written `star([',', $])` (`many(',', $)`), the
shape javascript.ts uses for array/argument/sequence lists. stepOf/stepOfPratt now
compile a sequence into a `seq` step, rendered as the conjunction of its sub-steps
(the enclosing star/opt/sep handles backtracking).
examples/seqjs.ts verifies it across ts/go/rust — 10/10 accept, 4/4 reject per
target, byte-identical to createParser. Full suite 42/42. javascript.ts now advances
to the deferred construct it has been heading toward: arrow functions
(group{capBelow, ctxMode} — assignment-level precedence + the await/yield context fork).
typescript.ts's first parser-algebra blocker (and a piece of async arrows): the `sameLine` restricted-production assertion — matches, consuming nothing, iff the next token has no preceding line terminator. The lexer now tracks newline-before per token (a `nl` flag on Tok), set when the skipped whitespace contains a newline OR a skipped comment spans one, so a block comment across a newline counts. In the stateful lexer the flag lives on LexState; otherwise a local threaded through the plain push. examples/sljs.ts (a `return` that takes a value only on the same line) verifies it across ts/go/rust: `return 1;` keeps the value; `return\n1;`, `return /*\n*/ 1;` (block comment spanning a newline) and `return // c\n 1;` correctly reject — 7/7 accept, 4/4 reject per target, byte-identical to createParser. Full suite 42/42. typescript.ts now clears sameLine and advances to `notLeftLeaf`; javascript.ts remains at arrow functions (capBelow/ctxMode).
The hardest parser construct, the wall javascript.ts has been heading toward:
assignment-level (capBelow) NUDs — arrow functions. A capExpr NUD carries the
binding power of its connector; it is parsed only when the enclosing minBp is
LOOSER than that (so `1 + () => x` needs parens), and once parsed it is "capped" —
the led loop is skipped entirely (`() => {} || a` rejects). The nud now takes minBp,
tries the capped sequences FIRST (so the `(x) => y` vs `(x)` ambiguity resolves by
longest-match — the arrow is attempted, then falls back to grouping), and signals
the cap via `_capped`. The `=>` body's ctxMode (await/yield) is treated as
transparent: the context fork is not modelled, so this covers basic arrows, not
async/await bodies.
Also fixes a latent `sep` bug surfaced by `(a,) => b`: gen-parser's sep allows a
trailing delimiter, the portable sepBy did not. Now matched in all three targets —
earlier grammars simply had no trailing-delimiter test, so the aggregate passed for
the wrong reason.
examples/arrowjs.ts verifies it across ts/go/rust — 14/14 accept (incl. trailing
commas and curried `x => y => x`), 4/4 reject, byte-identical to createParser. Full
suite 42/42. javascript.ts clears the arrow wall and advances to the next group case.
A `group` whose body is a multi-item sequence (e.g. a ctxMode group wrapping a sequence) previously threw "group must reduce to a single step". Since ctxMode is transparent to the portable parser and a `seq` step already exists, a transparent group now degrades to a single `seq` step (or its sole step when the body is one); only a no-`in` `suppress` group is still deferred. Both stepOf and stepOfPratt. No new behaviour to verify beyond the existing seq step (seqjs) — full suite 42/42, no regression. javascript.ts clears the multi-step group and advances to the next construct, the no-`in` `suppress` context.
…nstanceof) The portable parser's mixfix leds bound maximally tight — fine for access tails (`.`/`(`/`[`) but wrong for a precedence-carrying led like the ternary `? :` (`a == b ? c : d` must group as `(a == b) ? c : d`). The led loop now gates such a led by its lbp (from the grammar's ledPrec): bind only when lbp > minBp. And a chain-rhs led (`in`/`instanceof`) parses its trailing self-operand at the level's bp via a new `ruleBp` step, so `a in b in c` left-chains as `(a in b) in c`. Both derive from analyzeGrammar's ledPrecByConnector — single-sourced with the interpreter. examples/ledjs.ts verifies it across ts/go/rust — 11/11 accept (ternary below the operators, right-associative `a ? b : c ? d : e`, chain-rhs `in`), 4/4 reject, byte-identical to createParser. Full suite 42/42. This is the precedence foundation the no-`in` (suppress) context builds on next.
…script.ts now EMITS A run of constructs that together take the real javascript.ts grammar through the whole portable emitter end-to-end: - no-`in` (suppress) context: a `for (binding in iterable)` head parses its binding with the `in` led disabled (examples/noinjs.ts, 9/9+4/4 ×3). Threads a suppressed-connector set consumed per led loop. - one-or-more `+` quantifier (`x+` = `x x*`) — the last buildIR throw; with it, javascript.ts EMITS in all three targets. - Two latent `sep` bugs, both exposed only by the real grammar (earlier grammars wrapped sep in opt or never tested the shapes — the aggregate passed for the wrong reason): gen-parser's sep is `(element (delim element)*)?`, i.e. the WHOLE list is optional (empty `f()` valid) AND a trailing delimiter is allowed. sepBy now matches. - A NUD bracket that fails now FALLS THROUGH to the next same-first-token alternative instead of returning null — javascript has four `new`-led NUDs. Result: javascript.ts emits, compiles and runs in ts/go/rust, and is byte-identical to createParser on basic JS (var/function/arrow/ternary/member-call/for-in/while/if/class/ new/template/regex/instanceof/try/switch) — 23/24 in TS, the one miss a `new a.b()` NewTarget member-constructor CST shape. The await/yield fork (async/await) and that new-expression edge remain. Full suite 42/42; existing gate unaffected by the shared sep/bracket fixes.
…sue #6) The target-agnostic emitter now handles a full language end-to-end. javascript.ts — 89 rules after the [Await]/[Yield] fork — emits, compiles and runs in all three targets, byte-identical to createParser, and is gate-maintained (28/28 accept, 6/6 reject ×3, ASCII corpus). What it took: - Left recursion: a left-recursive non-Pratt rule (NewTarget, TS Type) now routes through buildPratt (atom-then-continuation), fixing the infinite recursion a plain rd rule hit. - The [Await]/[Yield] context fork: emitPortableParser applies `withAwaitYield` exactly as createParser does, so `await`/`yield` are keywords in async/generator bodies and identifiers elsewhere — name-forked into $A/$Y/$AY families. - A forked rule labels its CST node with the CANON base name (cstName), not the $-suffixed family name; and the $ in family names (a valid TS but not Go/Rust identifier) is sanitized to `_` for the emitted parse-fn names. - Full JS whitespace (`\s`: NBSP/LS/PS/…), not just ASCII. - A leaked `_capped` flag: it is a global, but gen-parser's `capped` is local, so a grouping `(arrow)` leaked the cap to the outer expression and dropped a trailing call (`(() => {})()`). Non-capped NUD arms now force it false. - Two more `sep` shapes (empty list `f()`, both surfaced by the real grammar). ts/go/rust all 28/28 on the ASCII corpus (destructuring, generators, classes, optional chaining, async/await, labels). Byte-based go/rust use UTF-8 offsets — identical to the JS oracle for ASCII; non-ASCII offset units differ inherently. Full suite 42/42.
…te (issue #6) The second real full language now goes through the agnostic emitter end-to-end. Two type-grammar constructs were the wall: - A LED with a leading `sameLine` guard (`$ sameLine '<' …`) — TS's generic-args / array / non-null type tails that must not cross a newline. The guard is hoisted into the led-arm condition (skip, don't break, so the connector can rebind). - `notLeftLeaf`: a led skipped when the LEFT node's head-leaf text is in a word set (`void`/`null`/`this` can't be `.`-qualified as a type). Each target gains a `headLeafText` (the leftmost leaf's source text) and the led arm checks it. typescript.ts (the most complex grammar) emits, compiles and runs in ts/go/rust, and is gate-maintained alongside javascript.ts (13/13 accept, 4/4 reject ×3, ASCII corpus; 83.5% on the broad curated TS corpus in TS). Full suite 42/42. The agnostic emitter now covers both full real languages — the issue-#6 goal, proven in three target languages.
Documents the target-agnostic emitter under "A language-agnostic engine": one analysis → one IR → per-target render (Go/Rust/native, each with its own regex-free lexer), proven by the real javascript.ts and typescript.ts grammars emitting to ts/go/rust byte-identical to the interpreter and gate-maintained, with the Rust/Go throughput results and the ASCII-offset boundary noted.
…Lexer)
The emit layer had three inconsistent entry points — `emitParser(grammar)` (JS,
no target), `emitLexer(grammar, st)` (JS, internal symtab), and
`emitPortableParser(grammar, target)` (lexer buried in `target.render`). Collapse
them to exactly two, both parameterized by a Target:
emitLexer(grammar, target) -> the lexer for that target
emitParser(grammar, target) -> the parser, REUSING emitLexer(grammar, target)
A Target owns both halves, so a parser reuses the SAME target's lexer — jsTarget's
parser embeds jsTarget's SoA-int lexer, goTarget's parser embeds goTarget's Tok-list
lexer. No cross-target lexer format is shared, so the optimized JS path keeps its
integer-bitmask dispatch and the portable targets keep their clean byte scanner.
- src/emit.ts (new): the Target interface + the two public functions; re-exports
jsTarget / tsTarget / goTarget / rustTarget.
- emit-parser.ts: the optimized emitter split into `emitJsLexer` (derive) +
`emitJsParser` (embed a handed-in lexer) behind `jsTarget`. The split is pure
refactor — re-deriving the deterministic symtab yields the identical lexer string,
so emit-parser-verify stays byte-for-byte.
- emit-lexer.ts: `emitLexer` -> `emitSoaLexer` (frees the public name).
- emit-portable.ts + target-{ts,go,rust}.ts: `render(ir)` split into the target's
`emitLexer`/`emitParser`; `emitPortableParser` removed (`portableIR` exported).
- ~19 callers updated to `emitParser(g, jsTarget)` / `emitParser(g, <portable>)`.
emit-parser-verify byte-identical (0 mismatches), portable-targets 16 grammars ×3 ≡
interpreter, emit-tsc-gate clean, full suite 42/42.
There was a problem hiding this comment.
Pull request overview
This PR refactors Monogram’s emit layer into a target-parameterized surface (emitParser(grammar, target) / emitLexer(grammar, target)) and adds portable emit targets so the same grammar can be emitted as runnable TypeScript, Go, or Rust parsers (alongside the optimized JS target). It also extends CI gating to compile/run the portable targets and to type-check the emitted JS/TS output.
Changes:
- Introduces
src/emit.tsas the public emit surface and adds a portable IR-based emitter with TS/Go/Rust targets. - Adds new CI gates: strict
tscchecking for emitted parsers and execution-based oracle parity across portable targets. - Removes the generated CST types pipeline (
gen-ast-types.ts+ related tests) and updates generators/tests accordingly.
Reviewed changes
Copilot reviewed 45 out of 46 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| test/recovery.ts | Updates emitted-parser generation/imports to use src/emit.ts + jsTarget and .mts output. |
| test/recovery-conformance.ts | Same as above for recovery conformance harness. |
| test/profile-vs-tsc.mjs | Switches emitter import to src/emit.ts; still writes emitted output to .mjs. |
| test/profile-vs-peers.mjs | Switches emitter import to src/emit.ts; still writes emitted outputs to .mjs. |
| test/portable-targets.ts | New gate: emits TS/Go/Rust parsers for multiple grammars, compiles/runs, and compares CST JSON to the interpreter oracle. |
| test/multi-doc.ts | Updates to new emit surface and .mts emitted module. |
| test/incremental-verify.ts | Updates to new emit surface and .mts emitted module. |
| test/incremental-grammars.ts | Updates to new emit surface and .mts emitted module paths/imports. |
| test/head-to-head.ts | Updates to new emit surface and .mts emitted module. |
| test/exhaustive-edits.ts | Updates to new emit surface and .mts emitted module. |
| test/emit-tsc-gate.ts | New gate: runs tsc --strict --noEmit against emitted parsers for multiple grammars. |
| test/emit-reject-messages.ts | Updates to new emit surface and .mts emitted module. |
| test/emit-parser-verify.ts | Updates to new emit surface and .mts emitted module. |
| test/emit-parser-bench.ts | Updates to new emit surface and .mts emitted module. |
| test/emit-lexer-verify.ts | Updates to new emit surface and .mts emitted module; loosens regex to handle new Map<...>(...) forms. |
| test/emit-corpus.ts | Adjusts generated-artifact filtering now that .cst-types.ts is removed. |
| test/cst-match-totality.ts | Updates to new emit surface and .mts emitted module. |
| test/check.ts | Adds the new emit-tsc-gate and portable-targets gates into the main gate runner. |
| test/ast-types-smoke.ts | Removes the CST typed-output smoke test (generator removed). |
| src/target-ts.ts | New: TypeScript portable target (regex-free lexer + portable parser runtime). |
| src/target-rust.ts | New: Rust portable target emitting a self-contained rustc-buildable parser. |
| src/target-go.ts | New: Go portable target emitting a self-contained go build-buildable parser. |
| src/gen-cst-match.ts | Simplifies generateCstMatch API (removes import path parameter). |
| src/gen-ast-types.ts | Removes CST typed-output generator. |
| src/emit.ts | New public API: Target interface + emitLexer/emitParser wrappers; exports all targets. |
| src/emit-portable.ts | New: target-agnostic portable IR builder (shared analysis + IR). |
| src/emit-lexer.ts | Renames exported lexer emitter to emitSoaLexer and adds TS typing to emitted helpers/locals. |
| src/cli.ts | Stops emitting .cst-types.ts; emits only .cst-match.ts via updated generator API. |
| README.md | Adds documentation about portable targets / non-JS emitted parsers. |
| examples/templatejs.ts | New portable-target fixture exercising template-literal lexer state machine. |
| examples/sljs.ts | New portable-target fixture exercising sameLine assertions. |
| examples/seqjs.ts | New portable-target fixture exercising grouped seq steps inside stars. |
| examples/richtokens.ts | New portable-target fixture stressing general token-pattern matcher compilation. |
| examples/regexjs.ts | New portable-target fixture exercising regex-vs-division lexer context. |
| examples/postjs.ts | New portable-target fixture exercising postfix operator LEDs. |
| examples/nudjs.ts | New portable-target fixture exercising general NUD sequences and negative lookahead. |
| examples/noinjs.ts | New portable-target fixture exercising no-in suppress context. |
| examples/minijs.ts | New portable-target “realistic subset” JS grammar fixture. |
| examples/ledjs.ts | New portable-target fixture exercising precedence-gated mixfix LEDs (ternary / in). |
| examples/calc.ts | New portable-target minimal Pratt grammar fixture. |
| examples/arrowjs.ts | New portable-target fixture exercising capBelow arrow-function constructs. |
| examples/altjs.ts | New portable-target fixture exercising non-literal inline alternation. |
| .gitignore | Stops ignoring *.cst-types.ts (types generator removed); keeps *.cst-match.ts ignored. |
| .github/workflows/ci.yml | Updates CI comments to reflect only *.cst-match.ts generated artifacts. |
| .gitattributes | Updates generated-artifacts commentary after .cst-types.ts removal. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| ### The emitted parser need not be JS — Go, Rust, native | ||
|
|
||
| The grammar also derives a **standalone parser in another language**. [`emitPortableParser(grammar, target)`](src/emit-portable.ts) runs one analysis into one language-agnostic IR, and each `Target` renders it — including its own regex-free lexer, so the output has no dependency on the JS runtime and compiles offline: |
Comment on lines
+6
to
+10
| // the crux — a Pratt expression engine with operator PRECEDENCE and associativity | ||
| // (`1 + 2 * 3` must group as `1 + (2 * 3)`), prefix unary, and a left-associative | ||
| // call/postfix continuation. emitPortableParser derives a TS, Go, and Rust parser | ||
| // from THIS one definition; the cross-language gate proves all three produce the | ||
| // byte-identical CST the interpreter (createParser) does. |
Comment on lines
+5
to
+9
| // be benchmarked against oxc on the same bytes. | ||
| // | ||
| // Derived from ONE definition by emitPortableParser into TypeScript, Go, and Rust; | ||
| // the cross-language gate proves all three produce the byte-identical CST that the | ||
| // interpreter (createParser) does. The portable lexer is regex-free (char scanner |
Comment on lines
+150
to
+151
| if (c === 10 || c === 13 || c === 8232 || c === 8233) { pendingNl = true; pos++; continue; } | ||
| if (c === 32 || c === 9 || c === 11 || c === 12 || c === 160 || c === 5760 || (c >= 8192 && c <= 8202) || c === 8239 || c === 8287 || c === 12288 || c === 65279) { pos++; continue; } |
Comment on lines
+159
to
+160
| \t\tif c == 10 || c == 13 || c == 8232 || c == 8233 { pendingNl = true; pos++; continue } | ||
| \t\tif c == 32 || c == 9 || c == 11 || c == 12 || c == 160 || c == 5760 || (c >= 8192 && c <= 8202) || c == 8239 || c == 8287 || c == 12288 || c == 65279 { pos++; continue } |
Comment on lines
+166
to
+167
| if c == 32 || c == 9 { pos += 1; continue; } | ||
| if c == 10 || c == 13 { ${nlVar} = true; pos += 1; continue; } |
The thirteen grammars under examples/ are not user-facing examples — they are the construct-isolation fixtures consumed solely by test/portable-targets.ts (each isolates one emitter construct so a divergence pinpoints which one broke). They belong next to their only consumer, beside test/vendor/, not in a directory whose name promises a learning sample. No real examples were displaced; examples/ held only fixtures and is now removed. Mechanical: git mv to test/fixtures/, fixtures' `../src` imports -> `../../src`, gate paths `../examples/X.ts` -> `./fixtures/X.ts`. Full suite 42/42.
…s + .mts All ten review comments, verified before fixing: - Portable lexers (ts/go/rust) set newline-before for `\r`/LS/PS, but the interpreter (gen-lexer.ts) sets it only for `\n`. Confirmed by parse: `return\r1;` the oracle ACCEPTS (CR isn't newline-before) while the portable REJECTED. Fixed all three to set pendingNl only for `\n`; `\r`/LS/PS are plain whitespace. Added the `return\r1;` (accept) / `return\r\n1;` (reject) cases to the sljs gate as a guard. (go/rust are byte-based, so their `8232`/`8233` checks were already dead; the reachable bug was `\r`.) - README's portable-emitter snippet still imported the removed `emitPortableParser` from src/emit-portable.ts + target-*.ts → rewritten to `emitParser` from src/emit.ts. - calc/minijs fixture header comments referenced `emitPortableParser` → `emitParser`. - profile-vs-tsc/peers write the now-typed emitted parser to `.mjs` and import it; node only strips types from `.ts`/`.mts`, so that would SyntaxError → switched the emitted output to `.mts` (matching the other emit harnesses). emit-parser-verify byte-identical, portable-targets 16 grammars ×3 (incl. the new CR cases), full suite 42/42. Separately noted (not in scope here): the interpreter itself counts only `\n` as a line terminator, not `\r`/LS/PS — a pre-existing JS-ASI conformance gap in the core lexer, on near-extinct inputs.
The lexers counted only LF as a line terminator, but ECMAScript also defines CR (U+000D), LS (U+2028), and PS (U+2029) — the set that drives ASI and the "no LineTerminator here" restrictions. So `return\r1` was parsed `return 1` where a conforming JS parser applies ASI (bare `return`, then `1`). Fixed consistently in all four lexer implementations so they stay in lockstep: - gen-lexer.ts (interpreter, the oracle): LF/CR in the ASCII path, LS/PS via the \s-run regex, and the comment-span check. - emit-lexer.ts (emitted SoA/JS lexer): the same, in its codegen. - target-ts.ts (portable, UTF-16): LF/CR/LS/PS. - target-go.ts / target-rust.ts (portable, byte-based): LF/CR (LS/PS are multi-byte and fall under the documented non-ASCII offset boundary). CRLF is unchanged (the LF already set newline-before), so the existing corpus is unaffected — the change only reaches lone-CR and LS/PS inputs. This supersedes the earlier direction (which had made the portable lexers match the LF-only interpreter); now the interpreter is conforming and all four agree on the full set. sljs gate extended: `return\r1;` / `return\r\n1;` / `return /*\r*/ 1;` reject, `return\t1;` accepts (tab is whitespace, not a terminator) — checked across ts/go/rust. emit-parser-verify byte-identical, portable-targets 16 grammars ×3, full suite 42/42.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #6.
The emit layer is now exactly two target-parameterized APIs —
emitParser(grammar, target)andemitLexer(grammar, target), whereemitParserreusesemitLexer— and one grammar derives a real, executable parser in TypeScript, Go, or Rust (alongside the optimized JS path) from a single definition.One seam, four targets
A
Targetowns both halves, soemitParser(grammar, target)reuses the same target'semitLexer(grammar, target):jsTarget's parser embeds its SoA-int lexer,goTarget's embeds its Tok-list lexer — no cross-target lexer format is shared, so the optimized JS path keeps its integer-bitmask token dispatch while the portable targets keep a dependency-free byte scanner that compiles offline (no RE2 /regexcrate / network).src/emit.tsis the entire public surface; the optimized emitter is simplyjsTarget. Adding a language is oneTarget.Proven on both full real languages, three targets, byte-identical
javascript.tsandtypescript.ts— the[Await]/[Yield]fork, left recursion, the regex-vs-division and template-interpolation lexer state machines, arrow functions, the no-incontext, precedence-gated?:/in/instanceof, and the full TS type grammar — emit to ts/go/rust with every CST byte-identical to thecreateParserinterpreter.test/portable-targets.tscompiles and runs all three targets for sixteen grammars (the two real languages plus focused fixtures) on every CI run. The Rust output reaches oxc throughput and the Go output beats tsgo on the same corpus (an arena keeps both near zero-allocation). Byte-based Go/Rust use UTF-8 offsets — identical to the JS interpreter's for ASCII; non-ASCII offset units differ inherently.The JS path also emits type-checked TypeScript
jsTargetproduces a module that passestsc --strict --noEmit, making the monomorphic type contract explicit and gated (emit-tsc-gate). The additions are erasable-only, so Node strips them at import — the runtime is byte-identical and the bench unchanged (~14×).Also: JS line-terminator conformance
The lexers counted only LF as a line terminator, but ECMAScript also defines CR, LS, and PS (the set driving ASI and "no LineTerminator here"). Fixed across all four lexer implementations at once — the interpreter (
gen-lexer.ts), the emitted JS (emit-lexer.ts), and the portable ts/go/rust — so they stay in lockstep. CRLF was already correct, so only lone-CR / LS/PS inputs change; e.g.return\r1now applies ASI.Why it's safe
emit-parser-verify: emitted JS byte-identical to the interpreter (0 mismatches).portable-targets: sixteen grammars, ts/go/rust each compiled & run, CST ≡ the interpreter (accept and reject).emit-tsc-gate: every emitted parser type-checks undertsc --strict.