cranelift: inline small constant-length `array.copy` for performance by gfx · Pull Request #13460 · bytecodealliance/wasmtime

gfx · 2026-05-23T01:01:49Z

Small, constant-length bulk copies — memory.copy, table.copy, and array.copy — currently go through the memory_copy libcall regardless of size. For tiny copies the libcall's fixed per-call cost — a wasm↔host transition plus an indirect call — dominates the actual memmove, so they are far slower than they need to be.

This PR expands such copies inline, as wide loads followed by stores, when the copy length is a compile-time constant of at most 128 bytes.

Updated after code review (thanks @alexcrichton): this started out array.copy-specific and element-count-based. Following the review suggestion it now lives in raw_bulk_memory_operation, keyed on the constant byte length, so it covers all constant-length bulk copies uniformly — and it uses width-agnostic accesses (greedy i8x16 → i64 → … → i8) instead of one load/store per element.

Speed

i32 array copies, ns per copy, release build, aarch64 (Apple M3 Pro), lower is better:

bytes	libcall	inline	speedup
8	3.9	1.9	2.0×
16	4.8	2.0	2.4×
32	5.2	2.4	2.2×
64	4.1	2.3	1.8×
128	4.6	2.7	1.7×
256	5.9	5.9	1.0× (libcall)

Because the inline path is width-agnostic, i8/i16 arrays and memory.copy of the same byte length behave the same.

Methodology

ns per copy = wall time / iteration count for a function that runs ~1e6 copies in a loop (so the per-call host↔wasm transition and the allocation are amortized away), taking the median of several repetitions. The "libcall" column uses a dynamic length (forcing the libcall) and the "inline" column a constant length, in the same module at the same size. A small fixed per-iteration readback keeps the copy from being optimized away, which puts a ~2 ns floor on the absolute numbers; the inline-vs-libcall ratio is the signal.

Correctness & scope

The byte range is covered greedily with the widest convenient access; every chunk is loaded before any is stored, so overlapping source/destination ranges keep memmove semantics.
BulkOp::MemoryCopy carries entity-appropriate access flags (GC-heap corruption trap for arrays, little-endian heap region for linear memory, table region for tables); none assume alignment, since wide chunks may straddle element boundaries.
The length must be a compile-time constant: the expansion is fully unrolled, branch-free straight-line code, so the chunk widths and offsets are fixed when the code is emitted. Dynamic lengths (and constant lengths above 128 bytes) keep the libcall, whose memmove is the right tool when the length is unknown or large — past 128 bytes inline merely ties it (see the 256-byte row). Reference-typed array elements and lazily-initialized funcref tables still take the per-element loop, unchanged.
The 128-byte bound is from measurement: inline wins by ~1.7–2.7× through 128 bytes and ties the libcall by 256, where the cost of loading every chunk before storing catches up.
Fuel for an inlined copy is charged by byte length, matching the libcall.

Tests

tests/disas/array-copy-inline.wat and tests/disas/memory-copy-inline.wat lock the codegen (the libcall is gone; wide loads-then-stores remain, including the greedy i8x16 + i64 + i32 split).
tests/misc_testsuite/gc/array-copy-inline.wast checks correctness for every element width (i8/i16/i32/i64/f32/f64/v128), the threshold boundary (16 inline vs 17 libcall i64s, i.e. 128 vs 136 bytes), and overlapping copies; the wast harness runs it across the DRC, null, and copying collectors.

🤖 Generated with Claude Code

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an optimization to inline small, statically-sized array.copy operations in Cranelift to avoid the fixed overhead of the memory_copy libcall, and introduces tests to lock down both codegen shape and runtime semantics.

Changes:

Inline-expand constant-length array.copy for small copies (<= 8 elements) into explicit loads+stores with memmove semantics.
Add runtime GC tests covering element sizes/types, overlap semantics, and the 8/9-element threshold.
Add a disassembly-based “shape” test to ensure the inline expansion remains stable.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
tests/misc_testsuite/gc/array-copy-inline.wast	New runtime tests validating correctness across element sizes, threshold behavior, and overlap (`memmove`) semantics.
tests/disas/array-copy-inline.wat	New disassembly test to lock in the expected inline load-then-store codegen sequence.
crates/cranelift/src/func_environ.rs	Implements the inline expansion for small constant-length `array.copy` and a helper to detect constant lengths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

When a scalar-element `array.copy` has a compile-time-constant length of at most 8, expand it inline as loads-then-stores instead of calling the `memory_copy` libcall. The libcall's fixed per-call cost (a wasm/host transition and an indirect call) dominates for tiny copies, and is especially visible in unoptimized builds where the libcall body itself is not optimized. Every element is loaded before any is stored so overlapping ranges keep memmove semantics. Dynamic or larger lengths, and tables, still use the libcall, whose `memmove` amortizes the overhead. The bound of 8 trades a little perf (the inline-vs-libcall crossover is ~16 elements) for bounded code size, capturing the largest wins, which cluster at <= 8 elements. Assisted-by: Claude Code:claude-opus-4-7

Reword `emit_inline_array_copy`'s doc to describe a bitwise copy by element width (it also handles `v128` and copies `f32`/`f64` via integer types), and replace the `elem_size`/`n` parameter shadowing with `stride`/`count`. No codegen change. Assisted-by: Claude Code:claude-opus-4-7

alexcrichton · 2026-05-23T15:34:35Z

Thanks for this! If you're willing I believe there's a better, more general, but slightly-more-difficult-to-work-with, location for this optimization to be located. Specifically within raw_bulk_memory_operation I think it'd be reasonable to test for the byte length and see if it's a constant. That way this would uniformly handle all constant-length bulk copies rather than just array.copy. Additionally it's also fine to perform "wrong width" loads/stores where, for example, i8x16 could be used to load values all the time when the size is > 16 bytes as opposed to only when the array element is v128.

Does that sound reasonable to you? And would you be interested to move this optimization to there?

gfx · 2026-05-24T00:44:51Z

Done in d9a6a14 — moved into raw_bulk_memory_operation, keyed on a constant byte length, so memory.copy and table.copy get it too, not just array.copy. Loads/stores are now width-agnostic (greedy i8x16→i64→…→i8), and BulkOp::MemoryCopy carries entity-appropriate flags (GC corrupt-trap / little+heap / table region; none assume alignment).

The threshold is now byte-based at 128 B. Measured on aarch64, inline is ~1.7–2.7× faster than the libcall up to 128 B; at 256 B the load-all-then-store-all sequence ties the libcall, so it falls back there. Inlined-copy fuel is charged by byte length to match the libcall.

Per review, move the inline expansion into `raw_bulk_memory_operation`, keyed on a constant byte length, so it also covers `memory.copy` and `table.copy`, not just `array.copy`. It now uses width-agnostic wide accesses (`i8x16` down to `i8`), and `BulkOp::MemoryCopy` carries entity-appropriate flags. Threshold 128 bytes, from measurement (~1.7-2.7x faster up to 128 B, ties by 256). Assisted-by: Claude Code:claude-opus-4-7

alexcrichton · 2026-05-24T14:59:50Z

+        } = op
+        {
+            if let Some(bytes) = Self::value_as_const_int(builder, len) {
+                if (1..=INLINE_COPY_MAX_BYTES).contains(&bytes) {


For this I'd recommend having if bytes <= INLINE_COPY_MAX_BYTES with perhaps an early-return of 0 to avoid generating any code. (using a range with contains feels a bit overkill here)

Done in fa4a65f: the gate is now bytes <= INLINE_COPY_MAX_BYTES. The zero-length case falls out naturally — emit_inline_memcpy(0) decomposes into zero chunks, so the load/store loops emit nothing.

alexcrichton · 2026-05-24T15:05:04Z

+        let inst = builder.func.dfg.value_def(value).inst()?;
+        Some(match builder.func.dfg.insts[inst] {
+            ir::InstructionData::UnaryImm {
+                opcode: ir::Opcode::Iconst,
+                imm,
+            } => imm.bits().cast_unsigned(),
+            ir::InstructionData::Unary {
+                opcode: ir::Opcode::Uextend | ir::Opcode::Ireduce,
+                arg,
+            } => Self::value_as_const_int(builder, arg)?,
+            ir::InstructionData::BinaryImm64 {
+                opcode: ir::Opcode::ImulImm,
+                arg,
+                imm,
+            } => Self::value_as_const_int(builder, arg)?.wrapping_mul(imm.bits().cast_unsigned()),
+            _ => return None,
+        })


One thing that I'd be a bit worried about here is that types are ignored and the operations aren't actually performed, this is just plumbing iconst values. I think that this'll require actually handling the input/output types of the operation, for example ireduce would alter the value of a immediate with the upper bits set.

One possible alternative to this, however, is to plumb the raw length of the operation through the BulkOp::MemoryCopy value. It's already a bit unfortunate to duplicate const-eval here and while understandable we might be able to sidestep it. Ideally, for example, this would purely match iconst and nothing else, and the need to match uextend/ireduce/imul_imm is due to the code generated here otherwise. If, though, the original length were plumbed (as-is present in wasm) through the BulkOp::MemoryCopy that might mean that iconst would only be necessary here.

To handle that, could you see if this could be updated to only match iconst and then the necessary variable plumbed through to here? Failing that, though, I think this'll need to be a more complicated implementation which handles the types that are happening in each operation.

Done in fa4a65f: plumbed the wasm length through BulkOp::MemoryCopy { const_len: Option<u64> }, so value_as_const_int now matches only iconst (no more uextend/ireduce/imul_imm). The byte length is computed in Rust at the call site as count.checked_mul(elem_size).

alexcrichton · 2026-05-24T15:06:26Z

+        }
+        let vals: SmallVec<[ir::Value; 8]> = chunks
+            .iter()
+            .map(|&(off, ty)| builder.ins().load(ty, flags, src_addr, off))


Threading flags here can be a bit finnicky -- would you be up for benchmarking without threading flags to here? For example using MemFlagsData::trusted() here for loads/stores. While it makes sense that we could do it and theoretically optimize further I'd ideally like to start simpler if we can and skip passing flags around. If there's a noticable performance boost threading flags around, though, that's a different situation to be in.

Done in fa4a65f: dropped the threaded flags; emit_inline_memcpy now uses MemFlagsData::trusted(). Inspected the cross-compiled asm via the disas tests — movdqu on x86_64, ldr q on aarch64. Each load only feeds its paired store, so the x64 aligned flag never lets a load fold into an alignment-requiring SSE operand; there is a comment in the body explaining this. (Runtime exercised on aarch64 locally; CI covers x86_64 execution.)

- Use `MemFlagsData::trusted()` in `emit_inline_memcpy` and drop the per-entity flags threaded through `BulkOp::MemoryCopy`; each load only feeds its paired store, so unaligned moves are selected regardless. - Detect a constant length by matching only `iconst` on the raw wasm length (plumbed as `const_len`) instead of folding the width casts and `* element_size` multiply, which was type-unfaithful. - Gate on `bytes <= 128` with a zero-length early return. Assisted-by: Claude Code:claude-opus-4-7

alexcrichton

Looks great, thanks again for this!

alexcrichton · 2026-05-26T21:55:16Z

Locally the test is failing as:

running 1 test
test fuel::craneliftpulley_run ... FAILED

failures:

---- fuel::craneliftpulley_run stdout ----

thread '<unnamed>' (3621049) panicked at cranelift/codegen/src/isa/pulley_shared/inst/emit.rs:469:13:
assertion `left == right` failed
  left: Big
 right: Little
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    fuel::craneliftpulley_run

I'll try to dig in tomorrow to take a look

alexcrichton · 2026-05-26T21:56:09Z

In the meantime, if you're willing to test, unconditionally doing little endian loads/stores would probably fix this since I think the issue is Pulley doesn't implement big-endian v128 loads/stores. I think at least, would still need to confirm.

gfx requested a review from a team as a code owner May 23, 2026 01:01

Copilot AI review requested due to automatic review settings May 23, 2026 01:01

gfx requested a review from a team as a code owner May 23, 2026 01:01

gfx requested review from cfallin and removed request for a team May 23, 2026 01:01

Copilot AI reviewed May 23, 2026

View reviewed changes

Comment thread crates/cranelift/src/func_environ.rs Outdated

Comment thread crates/cranelift/src/func_environ.rs Outdated

gfx force-pushed the array-copy-inline branch from 7569784 to 1c48b7d Compare May 23, 2026 01:06

gfx force-pushed the array-copy-inline branch from d9a6a14 to 08a7533 Compare May 24, 2026 00:48

gfx force-pushed the array-copy-inline branch 2 times, most recently from 5d3059f to 3f896f4 Compare May 24, 2026 01:02

gfx changed the title ~~cranelift: inline small constant-length array.copy~~ cranelift: inline small constant-length array.copy for performance May 24, 2026

alexcrichton requested review from alexcrichton and removed request for cfallin May 24, 2026 14:58

alexcrichton reviewed May 24, 2026

View reviewed changes

alexcrichton approved these changes May 26, 2026

View reviewed changes

alexcrichton added this pull request to the merge queue May 26, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 26, 2026

Conversation

gfx commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Speed

Correctness & scope

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

alexcrichton commented May 23, 2026

Uh oh!

gfx commented May 24, 2026

Uh oh!

alexcrichton May 24, 2026

Choose a reason for hiding this comment

Uh oh!

gfx May 26, 2026

Choose a reason for hiding this comment

Uh oh!

alexcrichton May 24, 2026

Choose a reason for hiding this comment

Uh oh!

gfx May 26, 2026

Choose a reason for hiding this comment

Uh oh!

alexcrichton May 24, 2026

Choose a reason for hiding this comment

Uh oh!

gfx May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexcrichton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexcrichton commented May 26, 2026

Uh oh!

alexcrichton commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gfx commented May 23, 2026 •

edited

Loading

gfx May 26, 2026 •

edited

Loading