fix(no-optimize): allocate stack frame + i64 local storage in select_with_stack by avrabe · Pull Request #85 · pulseengine/synth

avrabe · 2026-04-28T17:56:14Z

Summary

Two AAPCS-violating bugs in synth's --no-optimize lowering of spilled wasm locals, both surfaced running pulseengine/gale (verified Rust kernel primitives) through synth on a real Cortex-M Zephyr engine bench.

i64 local storage dropped the upper half. LocalSet/LocalTee/LocalGet for spilled locals always used 4-byte Str/Ldr regardless of the local's wasm type. For i64 locals, only the lower 32 bits round-tripped; the upper bits were whatever happened to be in the register pair partner. For gale's u64-packed FFI decision structs ({action: u32, new_count: u32} packed), this corrupted the action field, breaking every ISR-driven semaphore operation in the engine bench (count=0 [drain_timeout] at every RPM step, 60-min Renode timeout).
Locals aliased the callee-saved spill area. The legacy offset formula (local_idx - 4) * 4 was hardcoded for num_params == 4 and produced negative offsets for other configurations, which the I64Ldr/I64Str encoder silently clamps to 0. Result: the locals were stored at [sp, #0] — exactly where stmdb sp!, {r4, r5, ...} had just saved r4 and r5. After ldmia, the caller saw its r4/r5 corrupted with the inner function's local values. Pure AAPCS violation.

Fix

Add compute_local_layout(wasm_ops, num_params) that walks the wasm ops once to determine each non-param local's width (via infer_i64_locals) and assign stack offsets in ascending-index order with i64 locals 8-byte aligned. Frame size rounded up to 8 bytes for AAPCS SP alignment at call sites.
Prologue: emit sub sp, sp, #frame_size after the callee-saved push.
Epilogue (and explicit Return): emit add sp, sp, #frame_size before the pop.
LocalGet/LocalSet/LocalTee dispatch on the layout entry: i64 locals use I64Ldr/I64Str (which already emit two 32-bit ops at offsets N and N+4); i32 locals use plain Ldr/Str. Offsets come from the layout, not the legacy formula.

Verification

/tmp/repro_i64.wat (1 i32 param + 1 i64 local round-trip):

Before: str.w r0, [sp]; (no upper store)
After:  sub.w sp, sp, #8;
        str.w r0, [sp]; str.w r1, [sp, #4];
        ldr.w r2, [sp]; ldr.w r3, [sp, #4];
        add.w sp, sp, #8; ldmia.

gale_k_sem_give_decide (3 i32 params + 1 i64 local):

Before: str.w r3, [sp]; str.w r5, [sp, #4]; — clobbered saved r4/r5
After:  sub.w sp, sp, #8 → str into locals frame, not spill;
        add.w sp, sp, #8 before ldmia — proper AAPCS preservation.

Engine-bench build with GALE_USE_SYNTH=ON now produces a 22768 B zephyr.elf (was 22692 B with the buggy synth). Renode validation pending in CI on the gale side.

Out of scope (separate work)

Optimized-mode regalloc clobbers r0..r3. optimizer_bridge.rs::ir_to_arm hardcodes R0:R1 and R2:R3 for i64 ops, ignoring AAPCS parameter live-ins. Repro: /tmp/match_gale.wat. Affects the path without --no-optimize. Needs a separate PR — touches every Opcode::I64* arm.
Implicit-return-to-R0 elision. Small functions whose final wasm result lands in a non-R0 temp don't get a closing mov r0, ? in --no-optimize mode. Pre-existing; affects addone-style i32-returning functions. Doesn't affect i64 returns (natural R0/R1 lowering).
Br with depth ≥ stack emits bx lr without restoring callee-saved regs or deallocating the frame. Pre-existing.

Test plan

cargo build --release --bin synth clean
cargo install --path crates/synth-cli --force clean
/tmp/repro_i64.wat round-trips correctly (both halves stored/loaded)
gale_k_sem_give_decide produces proper AAPCS prologue/epilogue
i32-only test cases (identity, addone, three_const) still produce valid code
Engine bench links with GALE_USE_SYNTH=ON → 22768 B FLASH
gale CI Renode bench passes (next step)

Currently synth produces a relocatable object (.o, ET_REL) only when the input wasm has imports — the relocations they generate trigger the relocatable code path. Self-contained wasm modules with no imports produce a complete ET_EXEC firmware with vector table, Reset_Handler, linear_memory section, etc. For linking into a host build system (e.g. integrating verified Rust kernel primitives compiled to wasm into a Zephyr build), the host expects a relocatable .o it can pull into its existing link step. Add a --relocatable CLI flag that forces ET_REL output regardless of whether the wasm has imports. The flag is additive — default behaviour is unchanged. Tested with gale-ffi.wasm (200 functions, 0 imports): output is now a 26645-byte ET_REL ARM EABI5 object with all gale_* symbols defined and no vector-table machinery. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two bugs in select_with_stack's spilled-local handling, both surfaced running gale (formally-verified Zephyr kernel primitives) through synth on real Cortex-M code: 1. **i64 local storage only writes one half.** LocalSet/LocalTee/LocalGet for spilled locals always emitted plain Str/Ldr (4 bytes), even for i64 locals. The upper half was never stored, so local.get returned garbage for 32 high bits. For gale's u64-packed FFI decision structs this corrupted the action field, breaking every ISR-driven semaphore operation in the engine bench (count=0 / drain_timeout at every step). 2. **Locals area aliased the callee-saved spill area.** The legacy offset formula `(local_idx - 4) * 4` was hardcoded for num_params==4 and produced negative offsets for other configurations, which the I64Ldr/I64Str encoder silently clamped to 0. With offset 0, locals landed exactly where stmdb had just saved r4/r5 — corrupting the callee-saved register spill and propagating wrong values back to the caller after ldmia. Pure AAPCS violation. Fix: - Add `compute_local_layout(wasm_ops, num_params) -> LocalLayout` that walks the wasm op stream once to determine each non-param local's width (i32/i64) and assigns proper stack offsets from a base of 0. Uses `infer_i64_locals` (also new) which simulates a width vstack to classify locals based on what gets stored into them. - Prologue emits `sub sp, sp, #frame_size` after the callee-saved push, allocating the locals area below the saved-register spills. - Epilogue emits `add sp, sp, #frame_size` before the pop, restoring SP to the callee-saved spills before they get popped. Also applied to the explicit Return opcode handler. - LocalGet/LocalSet/LocalTee dispatch on the layout entry: i64 locals use I64Ldr/I64Str (which already emit two 32-bit memory ops at offset N and N+4); i32 locals use plain Ldr/Str. Offsets come from layout, not from the legacy formula. Frame size is rounded up to 8 bytes for AAPCS SP alignment at call sites. Verified locally: /tmp/repro_i64.wat (1 i32 param + 1 i64 local round-trip): Before: str.w r0, [sp]; (no upper store) — upper half lost. After: sub.w sp, sp, #8; str.w r0, [sp]; str.w r1, [sp, #4]; (both halves stored) ldr.w r2, [sp]; ldr.w r3, [sp, #4]; (both halves loaded) add.w sp, sp, #8; ldmia. gale_k_sem_give_decide (3 i32 params + 1 i64 local, the function whose runtime miscompilation hung the engine bench in CI): Before: str.w r3, [sp]; str.w r5, [sp, #4]; — clobbered the saved r4/r5 from stmdb, AAPCS-violating. After: sub.w sp, sp, #8 → str into the locals frame, not the spill; add.w sp, sp, #8 before ldmia — proper AAPCS. Engine-bench build with GALE_USE_SYNTH=ON now produces a 22768 B zephyr.elf (was 22692 B with the buggy synth). Renode validation pending in CI. Out of scope: - The optimized regalloc bug in optimizer_bridge.rs (clobbers r0..r3 parameter registers — see /tmp/match_gale.wat repro). This fix lets --no-optimize work; the optimized path needs a separate pass. - Implicit-return-to-R0 elision in select_with_stack: small functions whose result lands in a non-R0 temp don't get a final mov to R0 in --no-optimize mode. Pre-existing, unrelated to this fix; affects i32-returning functions with a single intermediate value. Doesn't affect i64 returns (which use the natural R0/R1 pair). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…frame fixes The previous CI run booted Zephyr with the synth-built ELF but every RPM step reported count=0 [drain_timeout]. Diagnosed two synth bugs: 1. i64 local storage dropped the upper half (--no-optimize path) 2. Locals area aliased the saved-register spill (also --no-optimize) Both fixed in pulseengine/synth#85. This commit points the workflow at that branch so the next CI run uses the fixed synth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #85 fixed i64 local STORAGE (both halves stored to consecutive stack slots) but introduced a follow-on bug: when the i64 register pair gets allocated via two separate alloc_temp_safe calls, the resulting pair can be NON-CONSECUTIVE in ALLOCATABLE_REGS if a register in between is live on the wasm stack. Subsequent i64 ops downstream call i64_pair_hi(rdlo) to recover the high register, which assumes the pair is consecutive. With a non-consecutive pair from earlier, i64_pair_hi returns the WRONG register and the op reads garbage. Witnessed on gale_k_sem_give_decide: ldr.w r5, [sp] ; LocalGet i64 lo - r5 picked ldr.w r6, [sp, #4] ; LocalGet i64 hi - r6 picked (consecutive ✓) ...later... orr.w r0, r0, r2 ; i64.or expected (r5,r6) but used (r2,r3) Fix: add `alloc_consecutive_pair` helper that ensures two consecutive free entries in ALLOCATABLE_REGS. Use it everywhere a pair is allocated for an i64 result: I64Const (line 4567), I64Add/Sub/Mul result allocs (lines 4914+, 4958+, 4996+), and the new i64-LocalGet path from PR #85. Verified locally: /tmp/repro_i64.wat round-trips correctly with consecutive register pair (lo→r2, hi→r3). gale_k_sem_give_decide's LocalGet 3 now loads to consecutive r5/r6. Note: the engine bench in Renode still hangs after this fix. Further diagnostic shows i64.or's ARM lowering uses register pairs (r0,r1) and (r2,r3) regardless of what's on the wasm-tracked stack — a separate bug in synth's wildcard fallthrough for I64Or in select_with_stack. That fix is out of scope for this PR; tracked separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three deeper bugs surfaced when running gale_k_sem_give_decide on Renode after PR #85 + the consecutive-pair fix: 1. **i64 ops fall through to select_default** in select_with_stack, which assumes inputs are in R0:R1 / R2:R3. Wasm-stack-tracked pairs from earlier ops never make it. Symptoms: i64.or used register pairs from previous shift ops instead of the just-loaded LocalGet result, producing a corrupted return value. Fix: explicit handlers for I64Or / I64And / I64Xor / I64ExtendI32U / I64ExtendI32S / I64Shl / I64ShrU / I64ShrS in select_with_stack, each popping the wasm-tracked pair, deriving its hi via i64_pair_hi, and emitting the right ARM instructions / pseudo-ops with arbitrary registers (the existing I64Shl/etc. ArmOp pseudo-ops accept arbitrary register operands). 2. **alloc_consecutive_pair didn't reserve implicit hi registers**. The wasm stack only tracks the lo register of each i64; the hi is conventional but invisible to a naive scan. A fresh allocation could overwrite an earlier i64's implicit hi, breaking subsequent i64_pair_hi lookups. Witnessed: I64Const 32 allocated (r1, r2), clobbering the hi of a previously-extended i64 in (r0, r1). Fix: alloc_consecutive_pair now scans every stack entry and reserves both lo AND its conventional pair_hi (over-reserves for i32 stack entries — safe). 3. **alloc_consecutive_pair didn't reserve just-popped operands**. When an i64 op popped operands and then allocated a result pair, the popped values were temporarily off the stack. The allocation could pick a register that's about to be read by the op's own second instruction (e.g. dst_lo == a_hi means the lo Or write clobbers a_hi before the hi Or reads it). Fix: extra_avoid parameter on alloc_consecutive_pair. I64Add / I64Sub / I64Or / I64Shl / I64Load / extends pass their popped operand registers via extra_avoid. Verified locally: gale_k_sem_give_decide now produces: orr r0, r6, r8 ; lo = shift_result_lo OR local3_lo orr r1, r7, ip ; hi = shift_result_hi OR local3_hi matching the wasm semantics for i64.or. Engine bench builds clean with 22644 B FLASH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ulator-artifactual (#27) * experiment: add GALE_USE_SYNTH build mode (wasm → synth → Cortex-M) When GALE_USE_SYNTH=ON, the gale-ffi crate is compiled to wasm32 first, then run through the synth AOT compiler (pulseengine/synth) which emits a Cortex-M ET_REL relocatable object. The object is wrapped into the same libgale_ffi.a path the rest of the build expects via ar, so the per-module gale_sem.c / gale_mutex.c / etc. consumers need no changes. This is the build-system half of the 4th-variant experiment for the cross-language LTO blog post: same engine bench, three existing builds (GCC baseline / GCC + Gale / LLVM + LTO + Gale) plus a 4th data point where verified Rust reaches Cortex-M via Verus → rustc → wasm → synth's Rocq-proved i32 instruction selection. Requires synth with --relocatable flag (pulseengine/synth#83). Default behaviour (GALE_USE_SYNTH=OFF) is unchanged: rustc-direct-to-Cortex-M is still the production path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: add engine-bench-renode-synth workflow for the 4th-variant experiment Adds the gale-via-synth lane to the engine-bench Renode matrix as a follow-up to the cross-language LTO post. Builds the GCC baseline and the GALE_USE_SYNTH=ON variant (wasm32 -> synth -> Cortex-M ET_REL -> libgale_ffi.a) on the same CI run, then sweeps both through Renode at the long sample count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci+cmake: insert wasm-opt + loom optional passes before synth Adds two optional formally-verified-(or-not) wasm optimizers between rustc and synth in the GALE_USE_SYNTH pipeline: rustc -> [wasm-opt -Oz] -> [loom optimize] -> synth -> ar -> .a Both are detected via find_program() and only inserted into the pipeline if found. If neither is on the path the pipeline reduces to rustc -> synth (unchanged behaviour). Effect on engine bench (stm32f4_disco, prj-gale.conf, GCC C kernel): synth alone: text=22448, total=38533 synth + wasm-opt + loom: text=22420, total=38505 (-28 B) Wasm-level reduction is dramatic (-34% from wasm-opt -Oz), but the synth-emitted ARM code is dominated by per-function instruction selection overhead, so the final ELF only moves a few dozen bytes. The verification-chain story is the bigger win: loom proves each pass it applies preserves semantics; rejected passes are skipped rather than applied unsoundly. CI workflow installs both: binaryen apt package (wasm-opt) and loom-cli from pulseengine/loom main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: trigger engine-bench-renode-synth on push to experiment branch workflow_dispatch alone requires the workflow to exist on the default branch before it can be invoked, which we can't do without merging the whole experiment first. Adding a push trigger on experiment/gale-via-synth so the workflow auto-runs whenever the experiment branch advances. Strip this trigger before any merge to main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: fix cargo install syntax (--git takes package name positional, not --path) cargo install --git URL --path PATH is invalid — the two flags are mutually exclusive. When installing a sub-crate from a git workspace, pass the package name as a positional argument: cargo install --git URL [--branch B] PACKAGE_NAME --force Fixes the workflow's synth-cli install (was rejected with: 'the argument --git <URL> cannot be used with --path <PATH>') and the identical pattern used for loom-cli. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * workaround: synth --no-optimize (regalloc clobbers param registers in optimize mode) Diagnosed during local debugging: synth's optimized register-allocation path clobbers r0/r1 (input parameter registers) at function entry when the wasm body pushes i32 constants before the first local.get. The function's prologue ends up looking like: movs r0, #1 ← clobbers param 0 (count) movs r1, #0 ← clobbers param 1 (limit) ... cmp r0, r1 ← compares clobbered values, not the actual params This crashes the engine bench in Renode (HardFault on first gale call, infinite handler loop, never reaches the test's Zero Drops assertion). The CI run hit a 60-min step timeout without producing a single sample. Minimal repro saved at /tmp/match_gale.wat (3 i32 params, i64 local, push 3 i32 constants, then local.get 0). Worth filing as a synth issue once the experiment lands. Workaround: synth --no-optimize. Disables the offending pass and emits proper AAPCS prologue (push r4..r8/lr, locals on stack, params read from r0/r1/r2 unchanged). Verified locally: same gale_k_sem_give_decide function now starts with `stmdb sp!, {r4..r8, lr}` and reads r0/r1 correctly. Cost: ~68 bytes of additional flash (22624 → 22692) and unknown cycles. The --no-optimize path uses stack-based locals which is wasteful but correct. Stack frame size also goes up — synth reserves ~4KB per function for locals which may be excessive; will need to validate no stack overflow on the engine bench worker thread. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: install synth from fix/synth-i64-locals-and-frame to pick up i64+frame fixes The previous CI run booted Zephyr with the synth-built ELF but every RPM step reported count=0 [drain_timeout]. Diagnosed two synth bugs: 1. i64 local storage dropped the upper half (--no-optimize path) 2. Locals area aliased the saved-register spill (also --no-optimize) Both fixed in pulseengine/synth#85. This commit points the workflow at that branch so the next CI run uses the fixed synth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: re-trigger with synth fix/synth-i64-locals-and-frame@b8da214 Pulls in: - explicit I64Or/And/Xor/ExtendI32U/ExtendI32S/Shl/ShrU/ShrS handlers in synth's select_with_stack (no more wildcard fallthrough to select_default's R0:R1/R2:R3 assumption) - alloc_consecutive_pair now reserves implicit pair_hi of every stack entry plus extra_avoid for popped operands Local build verified: gale_k_sem_give_decide ends with orr r0, r6, r8 orr r1, r7, ip matching the wasm i64.or semantics. 22644 B FLASH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(investigate): skip loom install to A/B test cycle delta source Tracing the −34.5% handoff cycle delta in the synth bench. Found that loom's optimizer hoists `local.set 3 = 0` from the fall-through arm of gale_k_sem_give_decide to BEFORE the dispatcher, dropping the WakeThread/Increment distinction at the wasm level — synth then emits ARM that always returns action=INCREMENT regardless of has_waiter. The bench passes (samples=7750, drops=0) because the engine_control worker is rarely actually blocked at sem_give time, so the WAKE path is rarely needed for correctness. But the cycle delta is then comparing a degenerate always-INCREMENT path to rustc's correct WAKE/INCREMENT discrimination — apples to oranges. This run skips loom in CI so we can A/B against the loom-on result and validate the hypothesis. CMakeLists' find_program(LOOM) fails when loom isn't on PATH, falling through to wasm-opt -> synth without loom. Filed for follow-up: pulseengine/loom optimizer bug. The hoisting is unsound for this control-flow pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(investigate): A/B same ELFs under Renode nightly to attribute -34.5% delta The container ships Renode 1.16.0 (ARG RENODE_VERSION=1.16.0 in zephyrproject-rtos/docker-image Dockerfile.base, unchanged across v0.28.x..v0.29.2). 1.16.1 (Feb 2026) touches several Cortex-M paths that could shift cycle accounting on the gale instruction stream: fixed ARMv8-M Thumb2 data-processing instructions, fixed Stack Pointer bits[1:0] handling, fixed wrong exception when FPU is disabled. None is labelled a cycle-counter fix in the changelog, but Thumb-2 dispatch changes shift cycle accounting whenever an instruction takes a different micro-op path. Adds nightly Renode (builds.renode.io) alongside 1.16.0 and runs the same two ELFs under both. Yields three controls: (a) baseline vs gale, both under nightly — does the gale delta change when the cycle model changes? (b) baseline_1.16.0 vs baseline_nightly (same ELF) — control: cycle-model drift on identical instructions. (c) gale_1.16.0 vs gale_nightly (same ELF) — does the model shift gale's instructions more than baseline's? If yes, the 1.16.0 model is mis-scoring gale-specific instructions and the delta is partly artifactual. Implementation: PATH override puts /opt/renode-nightly first for the two new run steps. Robot file unchanged (it reads ELF / BENCH_CSV_OUT from env). Existing 1.16.0 comparison is undisturbed; nightly outputs go to events-nightly.csv and a separate report section. Timeout bumped 120 -> 240 min to cover all four Renode runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * synth CI: B6 (rustc-direct cross-check) + D6 (manifest.txt) Two audit-driven additions to the synth Renode A/B: B6 — synth correctness control (audit P7, DO-330 §6.1.4): The headline -34.5% gale-vs-baseline delta has been ruled non- simulator-artifactual by the cycle-model A/B (controls a/b/c showed 0.0% drift on identical ELFs across Renode 1.16.0 and nightly). What remains is "could synth be miscompiling in a way that produces semantically-equivalent-but-faster code?" — a research compiler's normal failure mode. Adds a third gale build via rustc-direct (no synth in the chain; default GALE_USE_SYNTH=OFF path), runs it under 1.16.0, and renders an additional comparison "synth vs rustc-direct" in the step summary. If the two agree on handoff median within ~3%, the synth-emitted ELF is a faithful Cortex-M codegen of the same Rust source. This is the cheapest possible "tool error mitigation" in the DO-330 §6.1.4 sense — a redundant computation by a battle-tested compiler with the same input contract. Total Renode runs in this workflow: 5 (baseline 1.16.0, synth 1.16.0, rustc-direct 1.16.0, baseline nightly, synth nightly). timeout-minutes bumped 240 -> 300. D6 — manifest.txt artifact (audit P9): Single point of provenance. The bench's input layer is layered with branch-tip pointers (synth --branch fix/..., zephyr --mr gale/sem-replacement) and mutable tags (container :v0.29.0). Without a captured manifest, "I ran exactly this configuration" reduces to "trust the gale_sha and hope nothing else moved" — wrong by construction. Adds a "Compose build manifest" step at the end of the job that captures: rustc/cargo/synth/wasm-opt/west/robotframework/SDK versions, Renode 1.16.0 and nightly versions, sha256 of the nightly tarball, sha256 of all built ELFs, sha256 + byte-size of all CSV outputs, and `west list` for Zephyr fork + module shas. Output to /tmp/manifest.txt, included in the artifact bundle alongside the three ELFs themselves. The nightly tarball is downloaded to /tmp/renode-nightly.tgz first so the sha256 can be recorded (was: piped curl -> tar with no intermediate file). Six months from now, every reproducibility claim reduces to downloading the artifact, opening manifest.txt, and re-resolving the three sha-pinned things plus the listed branch tips. Both YAMLs validated; no behavior changes to existing arms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…28) * experiment: add GALE_USE_SYNTH build mode (wasm → synth → Cortex-M) When GALE_USE_SYNTH=ON, the gale-ffi crate is compiled to wasm32 first, then run through the synth AOT compiler (pulseengine/synth) which emits a Cortex-M ET_REL relocatable object. The object is wrapped into the same libgale_ffi.a path the rest of the build expects via ar, so the per-module gale_sem.c / gale_mutex.c / etc. consumers need no changes. This is the build-system half of the 4th-variant experiment for the cross-language LTO blog post: same engine bench, three existing builds (GCC baseline / GCC + Gale / LLVM + LTO + Gale) plus a 4th data point where verified Rust reaches Cortex-M via Verus → rustc → wasm → synth's Rocq-proved i32 instruction selection. Requires synth with --relocatable flag (pulseengine/synth#83). Default behaviour (GALE_USE_SYNTH=OFF) is unchanged: rustc-direct-to-Cortex-M is still the production path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: add engine-bench-renode-synth workflow for the 4th-variant experiment Adds the gale-via-synth lane to the engine-bench Renode matrix as a follow-up to the cross-language LTO post. Builds the GCC baseline and the GALE_USE_SYNTH=ON variant (wasm32 -> synth -> Cortex-M ET_REL -> libgale_ffi.a) on the same CI run, then sweeps both through Renode at the long sample count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci+cmake: insert wasm-opt + loom optional passes before synth Adds two optional formally-verified-(or-not) wasm optimizers between rustc and synth in the GALE_USE_SYNTH pipeline: rustc -> [wasm-opt -Oz] -> [loom optimize] -> synth -> ar -> .a Both are detected via find_program() and only inserted into the pipeline if found. If neither is on the path the pipeline reduces to rustc -> synth (unchanged behaviour). Effect on engine bench (stm32f4_disco, prj-gale.conf, GCC C kernel): synth alone: text=22448, total=38533 synth + wasm-opt + loom: text=22420, total=38505 (-28 B) Wasm-level reduction is dramatic (-34% from wasm-opt -Oz), but the synth-emitted ARM code is dominated by per-function instruction selection overhead, so the final ELF only moves a few dozen bytes. The verification-chain story is the bigger win: loom proves each pass it applies preserves semantics; rejected passes are skipped rather than applied unsoundly. CI workflow installs both: binaryen apt package (wasm-opt) and loom-cli from pulseengine/loom main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: trigger engine-bench-renode-synth on push to experiment branch workflow_dispatch alone requires the workflow to exist on the default branch before it can be invoked, which we can't do without merging the whole experiment first. Adding a push trigger on experiment/gale-via-synth so the workflow auto-runs whenever the experiment branch advances. Strip this trigger before any merge to main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: fix cargo install syntax (--git takes package name positional, not --path) cargo install --git URL --path PATH is invalid — the two flags are mutually exclusive. When installing a sub-crate from a git workspace, pass the package name as a positional argument: cargo install --git URL [--branch B] PACKAGE_NAME --force Fixes the workflow's synth-cli install (was rejected with: 'the argument --git <URL> cannot be used with --path <PATH>') and the identical pattern used for loom-cli. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * workaround: synth --no-optimize (regalloc clobbers param registers in optimize mode) Diagnosed during local debugging: synth's optimized register-allocation path clobbers r0/r1 (input parameter registers) at function entry when the wasm body pushes i32 constants before the first local.get. The function's prologue ends up looking like: movs r0, #1 ← clobbers param 0 (count) movs r1, #0 ← clobbers param 1 (limit) ... cmp r0, r1 ← compares clobbered values, not the actual params This crashes the engine bench in Renode (HardFault on first gale call, infinite handler loop, never reaches the test's Zero Drops assertion). The CI run hit a 60-min step timeout without producing a single sample. Minimal repro saved at /tmp/match_gale.wat (3 i32 params, i64 local, push 3 i32 constants, then local.get 0). Worth filing as a synth issue once the experiment lands. Workaround: synth --no-optimize. Disables the offending pass and emits proper AAPCS prologue (push r4..r8/lr, locals on stack, params read from r0/r1/r2 unchanged). Verified locally: same gale_k_sem_give_decide function now starts with `stmdb sp!, {r4..r8, lr}` and reads r0/r1 correctly. Cost: ~68 bytes of additional flash (22624 → 22692) and unknown cycles. The --no-optimize path uses stack-based locals which is wasteful but correct. Stack frame size also goes up — synth reserves ~4KB per function for locals which may be excessive; will need to validate no stack overflow on the engine bench worker thread. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: install synth from fix/synth-i64-locals-and-frame to pick up i64+frame fixes The previous CI run booted Zephyr with the synth-built ELF but every RPM step reported count=0 [drain_timeout]. Diagnosed two synth bugs: 1. i64 local storage dropped the upper half (--no-optimize path) 2. Locals area aliased the saved-register spill (also --no-optimize) Both fixed in pulseengine/synth#85. This commit points the workflow at that branch so the next CI run uses the fixed synth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: re-trigger with synth fix/synth-i64-locals-and-frame@b8da214 Pulls in: - explicit I64Or/And/Xor/ExtendI32U/ExtendI32S/Shl/ShrU/ShrS handlers in synth's select_with_stack (no more wildcard fallthrough to select_default's R0:R1/R2:R3 assumption) - alloc_consecutive_pair now reserves implicit pair_hi of every stack entry plus extra_avoid for popped operands Local build verified: gale_k_sem_give_decide ends with orr r0, r6, r8 orr r1, r7, ip matching the wasm i64.or semantics. 22644 B FLASH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(investigate): skip loom install to A/B test cycle delta source Tracing the −34.5% handoff cycle delta in the synth bench. Found that loom's optimizer hoists `local.set 3 = 0` from the fall-through arm of gale_k_sem_give_decide to BEFORE the dispatcher, dropping the WakeThread/Increment distinction at the wasm level — synth then emits ARM that always returns action=INCREMENT regardless of has_waiter. The bench passes (samples=7750, drops=0) because the engine_control worker is rarely actually blocked at sem_give time, so the WAKE path is rarely needed for correctness. But the cycle delta is then comparing a degenerate always-INCREMENT path to rustc's correct WAKE/INCREMENT discrimination — apples to oranges. This run skips loom in CI so we can A/B against the loom-on result and validate the hypothesis. CMakeLists' find_program(LOOM) fails when loom isn't on PATH, falling through to wasm-opt -> synth without loom. Filed for follow-up: pulseengine/loom optimizer bug. The hoisting is unsound for this control-flow pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(investigate): A/B same ELFs under Renode nightly to attribute -34.5% delta The container ships Renode 1.16.0 (ARG RENODE_VERSION=1.16.0 in zephyrproject-rtos/docker-image Dockerfile.base, unchanged across v0.28.x..v0.29.2). 1.16.1 (Feb 2026) touches several Cortex-M paths that could shift cycle accounting on the gale instruction stream: fixed ARMv8-M Thumb2 data-processing instructions, fixed Stack Pointer bits[1:0] handling, fixed wrong exception when FPU is disabled. None is labelled a cycle-counter fix in the changelog, but Thumb-2 dispatch changes shift cycle accounting whenever an instruction takes a different micro-op path. Adds nightly Renode (builds.renode.io) alongside 1.16.0 and runs the same two ELFs under both. Yields three controls: (a) baseline vs gale, both under nightly — does the gale delta change when the cycle model changes? (b) baseline_1.16.0 vs baseline_nightly (same ELF) — control: cycle-model drift on identical instructions. (c) gale_1.16.0 vs gale_nightly (same ELF) — does the model shift gale's instructions more than baseline's? If yes, the 1.16.0 model is mis-scoring gale-specific instructions and the delta is partly artifactual. Implementation: PATH override puts /opt/renode-nightly first for the two new run steps. Robot file unchanged (it reads ELF / BENCH_CSV_OUT from env). Existing 1.16.0 comparison is undisturbed; nightly outputs go to events-nightly.csv and a separate report section. Timeout bumped 120 -> 240 min to cover all four Renode runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * flight_control: macro-bench skeleton (Phases 1-4) Add the six-thread / two-timer / five-primitive macro benchmark described in docs/research/macro-bench-design.md. Composes ring_buf + sem + mutex + msgq + condvar on a 100 Hz fixed-rate flight-control loop, capturing per-sensor-ISR algo + handoff (engine_control parity) plus per-controller-period t_lock, t_post, t_round, t_bcast. Single CSV row per sensor sample with -1 sentinels for the segments not measured on that row (~9-of-10 cycles have no t_bcast, ~9-of-10 sensor rows have no t_lock/t_post/t_round). 3-axis sweep (sensor_hz x contention x payload) totalling ~4500 events on the long sweep, matching engine_control's Renode lane density. Verified: - Builds clean for qemu_cortex_m3 (baseline + gale variants). - QEMU smoke run: 150/150 samples, drops=0, telemetry_emits=11 (priority inheritance keeps the lowest-priority telemetry thread alive under fusion/actuator contention). - All four new cycle-delta segments populate as expected. Two-ring split (sensor_ring -> fusion -> emit_ring -> reader) avoids the single-sem race where reader_loop and the fusion thread would otherwise compete for sensor_data_ready and steal samples from each other. Reader thread runs at priority 10 (below all workers) so its UART back-pressure never starves the measured chain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * flight_control: analyzer + QEMU runner + Renode robot (Phase 5a) Add analyze.py extending engine_control's per-step + Mann-Whitney shape with four new metric columns (t_lock, t_post, t_round, t_bcast). Negative values in the new columns are the "not measured on this row" sentinel and are filtered out per metric. New asserts beyond the engine_control set: - telemetry_emits > 0 on both variants (design doc Section "Risks": priority-inheritance must keep the lowest-priority telemetry thread off the starvation floor) - gale p99 <= 2 * baseline p99 on each of t_lock, t_post, t_round, t_bcast (one regression guard per primitive segment) run_qemu_bench.sh + tag_events.py mirror engine_control's shape 1:1 (same env conventions, same per-run-id tagging). Renode robot file is engine_stm32f4.robot with the wait line updated to match the macro bench's startup banner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: add engine-bench-renode-flight workflow (Phase 5b) Modeled on engine-bench-renode-synth.yml, runs the new macro benchmark on stm32f4_disco under Renode for the long sweep (~4500 events). Same variant matrix (baseline + gale), same artifact upload shape, same MD report rendered into the job summary. Triggered on push to experiment/macro-bench-flight-control and manually via workflow_dispatch. Uses only safe GitHub contexts (github.workspace, github.ref) — no untrusted inputs flow into shell commands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * flight: emit controller-rate rows only; trim long sweep to 9 cells Run 25135494876 timed out at step 13 of 27 (120-min budget) because the bench emitted one row per sensor ISR (~1 kHz) of which only ~5% carried t_lock/t_post/t_round/t_bcast — controller runs at 100 Hz, so the matching pair-tag covers 1 in ~10 sensor rows, and the partial CSV shows only 109 of 2012 emitted rows had a real measurement. The other 95% were near-empty rows starving Renode at the UART. Two changes, applied together: 1. emit_event returns bool. Rows whose slot has no controller-cycle pair-tag (t_lock == 0) are dropped. reader_count++ only when a row was actually emitted; reader_skipped tracks the dropped sensor-rate rows for visibility. UART traffic falls ~10x. 2. Long sweep trimmed from 27 cells (sensor_hz x contention x payload) to 9 cells (sensor_hz=1000 only x 3 contention x 3 payload). The sensor_hz=2000 axis was the timeout cause; sensor_hz=500 carries identical primitive signal at lower rate. Per-cell sensor budget bumped from 150–200 to 1000 so each cell yields ~100 controller- tagged rows (samples * 100 / sensor_hz). TOTAL_SAMPLES recomputed to 900. The drain loop is rewritten in controller-rate units: it now waits for `expected_ctrl = samples * 100 / sensor_hz` rows to land, with a short 5s drain timeout because the cell already waited budget_ms for sensor ISRs to retire. CI timeout bumped 120 -> 180 min. Workflow comment block updated to match. Per audit P3 #1 #5 (cycle-delta column names) and the partial-CSV diagnostic from run 25135494876. Local build clean for both qemu_cortex_m3 baseline (16,392 B FLASH, 41,488 B RAM) and gale variant (18,480 B FLASH, 41,488 B RAM). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * flight: audit fixes B2/B3/B4 (stale-token drain, slot wrap, emit_drops) Three correctness fixes from the Mythos audit (10 personas, 1 fresh-session validator). All three are confirmed by the partial CSV from run 25135494876 — none of them is hypothetical. B2 — actuator_done stale-token drain (P1 #1): ctrl_loop's K_MSEC(2) timeout path on actuator_done leaves the sem token uncollected if the actuator gives later. The next cycle's k_sem_take returns 0 immediately, reads a previous-cycle g_actuator_done_cyc, and computes t_round = old_done_cyc - new_t_post_out which underflows to ~2^32 cycles. Add a drain loop before each k_msgq_put to flush stale tokens. B3 — slot collision wrap (P1 #3): Bump SLOT_COUNT 512 → 1024 so the per-cell sensor-ISR budget (1000 in the trimmed long sweep) cannot wrap within a single cell. Cross-cell wrap remains harmless: sweep_driver stops the sensor timer and drains the reader between cells, so any in-flight controller stamps from cell X land in slots the reader has already consumed before cell X+1 starts. RAM cost: 5 arrays × 512 × 4 bytes = +10 KB. RAM use 41,488 B → 51,728 B (78.9 %); FLASH +52 B. B4 — emit_ring drops counter (P4 #2): ring_buf_put failures into emit_ring were silently dropped. gale's potentially-faster sem_give could mean the reader drains emit_ring better → fewer dropped emits → more rows in gale's CSV than baseline → biased comparison. Add g_emit_drops volatile, emit it in the === END === footer, and assert == 0 in the analyzer for both variants. Forces both onto the same denominator. Build clean for qemu_cortex_m3 baseline (16,444 B FLASH, 51,728 B RAM = 78.9 %). Audit cross-references: P1 (Cortex-M RTOS engineer) — B2, B3 P4 (counter-attacker) — B4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * flight: B1 — column-semantics block in main.c + analyzer report The column names (t_round, t_bcast) are inherited from the design doc, but the actual measurement windows are narrower than the names suggest: - t_round is named "round-trip" but measures only controller_post_exit → actuator-0 stamp; it does NOT include the controller's post-wake sem_take. It also includes actuator 0's cycles_busy=100 busy-loop (same for both variants). - t_bcast is named "broadcast" but measures the broadcaster's own lock+broadcast+unlock window on the fusion thread; the telemetry wake is never sampled. Per audit P3 #1 #5: a reader who treats these names as stated will over-attribute the measurement scope. The cheapest fix that protects publication credibility is to define the columns precisely where the reader will look — at the top of the analyzer's markdown report and in main.c's file header. Numbers stay; honest scope-setting is appended. CSV column positions are unchanged so engine_control's analyzer docstring's "strict superset" claim remains true. Build clean; analyzer parses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * flight CI: D6 — manifest.txt artifact (single point of provenance) Per audit P9: a reader downloading flight-bench-renode-long.zip six months from now should be able to identify exactly which Renode, Zephyr fork, rustc, SDK, and gale_sha produced the cycles. Without this, "I ran exactly this configuration" reduces to "trust the gale_sha and hope nothing else moved" — but every input below the gale repo (Zephyr fork at branch tip, container at mutable tag, rustc on stable channel) is a moving target. Adds one new step "Compose build manifest" right before the upload step. The manifest captures: rustc / cargo / west / robotframework / SDK versions, Renode 1.16.0 version, Zephyr fork + modules sha via `west list`, and sha256 + byte-size of every built ELF and emitted CSV. Output goes to /tmp/manifest.txt and is included in the artifact bundle. Both ELFs are also uploaded so binary-level reproducibility can be verified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * flight: B5 — moving-block bootstrap CI on pooled p99 + multi-comparison note Per audit P4 #6 / P2 F1 / P2 F2: the flight bench reports per-step medians with naive-bootstrap CI and pooled p99 as a point estimate with no CI. Two related issues: 1. The bench's samples are consecutive 100 Hz controller cycles — autocorrelated. A slow noise burst contaminates 5–10 consecutive samples; naive bootstrap underestimates the CI by treating dependent samples as independent. Politis-Romano predict the correction factor is ~sqrt((1+ρ)/(1-ρ)) for first-order auto- correlation. 2. The Mann-Whitney p-values are reported uncorrected across 162 simultaneous tests (27 cells × 6 metrics). At α=0.05 under H0 that yields ~8 false-positive cells by chance; a reader scanning the per-step table for "where did gale win?" will pick those up as signal. Fixes: - New helper `block_bootstrap_percentile_ci` (block_size=10, iters=2000). Used for pooled p50/p75/p95/p99 in the per-metric pooled tables. Per-step medians keep the naive bootstrap (median is robust to autocorrelation; the issue is tails, not central tendency). - One-paragraph note above the pooled tables explains the bootstrap choice and points readers at Holm-Bonferroni / BH-FDR for the per-step MW-U cells. Smoke-tested against engine_control's events.csv (different schema so 0 rows, but the report header + column-semantics block render correctly). Block bootstrap on synthetic xs=range(100) gives p99 CI [66, 98] for point=98 — wider than naive, as expected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

avrabe · 2026-05-03T13:17:57Z

Superseded by #86 — bundled into one CI-runnable PR per the medium-large PR strategy. Same diff is included in #86.

avrabe and others added 2 commits April 27, 2026 22:24

avrabe and others added 2 commits April 28, 2026 22:19

avrabe changed the base branch from feat/synth-relocatable-flag to main May 3, 2026 13:16

avrabe mentioned this pull request May 3, 2026

feat: AAPCS regalloc fix + i64 stack frame + --relocatable + M7 hardening #86

Open

8 tasks

avrabe closed this May 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(no-optimize): allocate stack frame + i64 local storage in select_with_stack#85

fix(no-optimize): allocate stack frame + i64 local storage in select_with_stack#85
avrabe wants to merge 4 commits intomainfrom
fix/synth-i64-locals-and-frame

avrabe commented Apr 28, 2026

Uh oh!

avrabe commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

avrabe commented Apr 28, 2026

Summary

Fix

Verification

Out of scope (separate work)

Test plan

Uh oh!

avrabe commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant