Skip to content

cranelift: inline small constant-length array.copy for performance#13460

Open
gfx wants to merge 4 commits into
bytecodealliance:mainfrom
wado-lang:array-copy-inline
Open

cranelift: inline small constant-length array.copy for performance#13460
gfx wants to merge 4 commits into
bytecodealliance:mainfrom
wado-lang:array-copy-inline

Conversation

@gfx
Copy link
Copy Markdown

@gfx gfx commented May 23, 2026

Small, constant-length bulk copies — memory.copy, table.copy, and array.copy — currently go through the memory_copy libcall regardless of size. For tiny copies the libcall's fixed per-call cost — a wasm↔host transition plus an indirect call — dominates the actual memmove, so they are far slower than they need to be.

This PR expands such copies inline, as wide loads followed by stores, when the copy length is a compile-time constant of at most 128 bytes.

Updated after code review (thanks @alexcrichton): this started out array.copy-specific and element-count-based. Following the review suggestion it now lives in raw_bulk_memory_operation, keyed on the constant byte length, so it covers all constant-length bulk copies uniformly — and it uses width-agnostic accesses (greedy i8x16i64 → … → i8) instead of one load/store per element.

Speed

i32 array copies, ns per copy, release build, aarch64 (Apple M3 Pro), lower is better:

bytes libcall inline speedup
8 3.9 1.9 2.0×
16 4.8 2.0 2.4×
32 5.2 2.4 2.2×
64 4.1 2.3 1.8×
128 4.6 2.7 1.7×
256 5.9 5.9 1.0× (libcall)

Because the inline path is width-agnostic, i8/i16 arrays and memory.copy of the same byte length behave the same.

Methodology

ns per copy = wall time / iteration count for a function that runs ~1e6 copies in a loop (so the per-call host↔wasm transition and the allocation are amortized away), taking the median of several repetitions. The "libcall" column uses a dynamic length (forcing the libcall) and the "inline" column a constant length, in the same module at the same size. A small fixed per-iteration readback keeps the copy from being optimized away, which puts a ~2 ns floor on the absolute numbers; the inline-vs-libcall ratio is the signal.

Correctness & scope

  • The byte range is covered greedily with the widest convenient access; every chunk is loaded before any is stored, so overlapping source/destination ranges keep memmove semantics.
  • BulkOp::MemoryCopy carries entity-appropriate access flags (GC-heap corruption trap for arrays, little-endian heap region for linear memory, table region for tables); none assume alignment, since wide chunks may straddle element boundaries.
  • The length must be a compile-time constant: the expansion is fully unrolled, branch-free straight-line code, so the chunk widths and offsets are fixed when the code is emitted. Dynamic lengths (and constant lengths above 128 bytes) keep the libcall, whose memmove is the right tool when the length is unknown or large — past 128 bytes inline merely ties it (see the 256-byte row). Reference-typed array elements and lazily-initialized funcref tables still take the per-element loop, unchanged.
  • The 128-byte bound is from measurement: inline wins by ~1.7–2.7× through 128 bytes and ties the libcall by 256, where the cost of loading every chunk before storing catches up.
  • Fuel for an inlined copy is charged by byte length, matching the libcall.

Tests

  • tests/disas/array-copy-inline.wat and tests/disas/memory-copy-inline.wat lock the codegen (the libcall is gone; wide loads-then-stores remain, including the greedy i8x16 + i64 + i32 split).
  • tests/misc_testsuite/gc/array-copy-inline.wast checks correctness for every element width (i8/i16/i32/i64/f32/f64/v128), the threshold boundary (16 inline vs 17 libcall i64s, i.e. 128 vs 136 bytes), and overlapping copies; the wast harness runs it across the DRC, null, and copying collectors.

🤖 Generated with Claude Code

@gfx gfx requested a review from a team as a code owner May 23, 2026 01:01
Copilot AI review requested due to automatic review settings May 23, 2026 01:01
@gfx gfx requested a review from a team as a code owner May 23, 2026 01:01
@gfx gfx requested review from cfallin and removed request for a team May 23, 2026 01:01
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an optimization to inline small, statically-sized array.copy operations in Cranelift to avoid the fixed overhead of the memory_copy libcall, and introduces tests to lock down both codegen shape and runtime semantics.

Changes:

  • Inline-expand constant-length array.copy for small copies (<= 8 elements) into explicit loads+stores with memmove semantics.
  • Add runtime GC tests covering element sizes/types, overlap semantics, and the 8/9-element threshold.
  • Add a disassembly-based “shape” test to ensure the inline expansion remains stable.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
tests/misc_testsuite/gc/array-copy-inline.wast New runtime tests validating correctness across element sizes, threshold behavior, and overlap (memmove) semantics.
tests/disas/array-copy-inline.wat New disassembly test to lock in the expected inline load-then-store codegen sequence.
crates/cranelift/src/func_environ.rs Implements the inline expansion for small constant-length array.copy and a helper to detect constant lengths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/cranelift/src/func_environ.rs Outdated
Comment thread crates/cranelift/src/func_environ.rs Outdated
When a scalar-element `array.copy` has a compile-time-constant length of at
most 8, expand it inline as loads-then-stores instead of calling the
`memory_copy` libcall. The libcall's fixed per-call cost (a wasm/host
transition and an indirect call) dominates for tiny copies, and is
especially visible in unoptimized builds where the libcall body itself is
not optimized.

Every element is loaded before any is stored so overlapping ranges keep
memmove semantics. Dynamic or larger lengths, and tables, still use the
libcall, whose `memmove` amortizes the overhead. The bound of 8 trades a
little perf (the inline-vs-libcall crossover is ~16 elements) for bounded
code size, capturing the largest wins, which cluster at <= 8 elements.

Assisted-by: Claude Code:claude-opus-4-7
@gfx gfx force-pushed the array-copy-inline branch from 7569784 to 1c48b7d Compare May 23, 2026 01:06
Reword `emit_inline_array_copy`'s doc to describe a bitwise copy by element
width (it also handles `v128` and copies `f32`/`f64` via integer types), and
replace the `elem_size`/`n` parameter shadowing with `stride`/`count`. No
codegen change.

Assisted-by: Claude Code:claude-opus-4-7
@alexcrichton
Copy link
Copy Markdown
Member

Thanks for this! If you're willing I believe there's a better, more general, but slightly-more-difficult-to-work-with, location for this optimization to be located. Specifically within raw_bulk_memory_operation I think it'd be reasonable to test for the byte length and see if it's a constant. That way this would uniformly handle all constant-length bulk copies rather than just array.copy. Additionally it's also fine to perform "wrong width" loads/stores where, for example, i8x16 could be used to load values all the time when the size is > 16 bytes as opposed to only when the array element is v128.

Does that sound reasonable to you? And would you be interested to move this optimization to there?

@gfx
Copy link
Copy Markdown
Author

gfx commented May 24, 2026

Done in d9a6a14 — moved into raw_bulk_memory_operation, keyed on a constant byte length, so memory.copy and table.copy get it too, not just array.copy. Loads/stores are now width-agnostic (greedy i8x16i64→…→i8), and BulkOp::MemoryCopy carries entity-appropriate flags (GC corrupt-trap / little+heap / table region; none assume alignment).

The threshold is now byte-based at 128 B. Measured on aarch64, inline is ~1.7–2.7× faster than the libcall up to 128 B; at 256 B the load-all-then-store-all sequence ties the libcall, so it falls back there. Inlined-copy fuel is charged by byte length to match the libcall.

@gfx gfx force-pushed the array-copy-inline branch from d9a6a14 to 08a7533 Compare May 24, 2026 00:48
Per review, move the inline expansion into `raw_bulk_memory_operation`, keyed
on a constant byte length, so it also covers `memory.copy` and `table.copy`,
not just `array.copy`. It now uses width-agnostic wide accesses (`i8x16` down
to `i8`), and `BulkOp::MemoryCopy` carries entity-appropriate flags. Threshold
128 bytes, from measurement (~1.7-2.7x faster up to 128 B, ties by 256).

Assisted-by: Claude Code:claude-opus-4-7
@gfx gfx force-pushed the array-copy-inline branch 2 times, most recently from 5d3059f to 3f896f4 Compare May 24, 2026 01:02
@gfx gfx changed the title cranelift: inline small constant-length array.copy cranelift: inline small constant-length array.copy for performance May 24, 2026
@alexcrichton alexcrichton requested review from alexcrichton and removed request for cfallin May 24, 2026 14:58
Comment thread crates/cranelift/src/func_environ.rs Outdated
} = op
{
if let Some(bytes) = Self::value_as_const_int(builder, len) {
if (1..=INLINE_COPY_MAX_BYTES).contains(&bytes) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this I'd recommend having if bytes <= INLINE_COPY_MAX_BYTES with perhaps an early-return of 0 to avoid generating any code. (using a range with contains feels a bit overkill here)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in fa4a65f: the gate is now bytes <= INLINE_COPY_MAX_BYTES. The zero-length case falls out naturally — emit_inline_memcpy(0) decomposes into zero chunks, so the load/store loops emit nothing.

Comment thread crates/cranelift/src/func_environ.rs Outdated
Comment on lines +4435 to +4451
let inst = builder.func.dfg.value_def(value).inst()?;
Some(match builder.func.dfg.insts[inst] {
ir::InstructionData::UnaryImm {
opcode: ir::Opcode::Iconst,
imm,
} => imm.bits().cast_unsigned(),
ir::InstructionData::Unary {
opcode: ir::Opcode::Uextend | ir::Opcode::Ireduce,
arg,
} => Self::value_as_const_int(builder, arg)?,
ir::InstructionData::BinaryImm64 {
opcode: ir::Opcode::ImulImm,
arg,
imm,
} => Self::value_as_const_int(builder, arg)?.wrapping_mul(imm.bits().cast_unsigned()),
_ => return None,
})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that I'd be a bit worried about here is that types are ignored and the operations aren't actually performed, this is just plumbing iconst values. I think that this'll require actually handling the input/output types of the operation, for example ireduce would alter the value of a immediate with the upper bits set.

One possible alternative to this, however, is to plumb the raw length of the operation through the BulkOp::MemoryCopy value. It's already a bit unfortunate to duplicate const-eval here and while understandable we might be able to sidestep it. Ideally, for example, this would purely match iconst and nothing else, and the need to match uextend/ireduce/imul_imm is due to the code generated here otherwise. If, though, the original length were plumbed (as-is present in wasm) through the BulkOp::MemoryCopy that might mean that iconst would only be necessary here.

To handle that, could you see if this could be updated to only match iconst and then the necessary variable plumbed through to here? Failing that, though, I think this'll need to be a more complicated implementation which handles the types that are happening in each operation.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in fa4a65f: plumbed the wasm length through BulkOp::MemoryCopy { const_len: Option<u64> }, so value_as_const_int now matches only iconst (no more uextend/ireduce/imul_imm). The byte length is computed in Rust at the call site as count.checked_mul(elem_size).

}
let vals: SmallVec<[ir::Value; 8]> = chunks
.iter()
.map(|&(off, ty)| builder.ins().load(ty, flags, src_addr, off))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Threading flags here can be a bit finnicky -- would you be up for benchmarking without threading flags to here? For example using MemFlagsData::trusted() here for loads/stores. While it makes sense that we could do it and theoretically optimize further I'd ideally like to start simpler if we can and skip passing flags around. If there's a noticable performance boost threading flags around, though, that's a different situation to be in.

Copy link
Copy Markdown
Author

@gfx gfx May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in fa4a65f: dropped the threaded flags; emit_inline_memcpy now uses MemFlagsData::trusted(). Inspected the cross-compiled asm via the disas tests — movdqu on x86_64, ldr q on aarch64. Each load only feeds its paired store, so the x64 aligned flag never lets a load fold into an alignment-requiring SSE operand; there is a comment in the body explaining this. (Runtime exercised on aarch64 locally; CI covers x86_64 execution.)

- Use `MemFlagsData::trusted()` in `emit_inline_memcpy` and drop the
  per-entity flags threaded through `BulkOp::MemoryCopy`; each load only
  feeds its paired store, so unaligned moves are selected regardless.
- Detect a constant length by matching only `iconst` on the raw wasm
  length (plumbed as `const_len`) instead of folding the width casts and
  `* element_size` multiply, which was type-unfaithful.
- Gate on `bytes <= 128` with a zero-length early return.

Assisted-by: Claude Code:claude-opus-4-7
Copy link
Copy Markdown
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks again for this!

@alexcrichton alexcrichton added this pull request to the merge queue May 26, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 26, 2026
@alexcrichton
Copy link
Copy Markdown
Member

Locally the test is failing as:

running 1 test
test fuel::craneliftpulley_run ... FAILED

failures:

---- fuel::craneliftpulley_run stdout ----

thread '<unnamed>' (3621049) panicked at cranelift/codegen/src/isa/pulley_shared/inst/emit.rs:469:13:
assertion `left == right` failed
  left: Big
 right: Little
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    fuel::craneliftpulley_run

I'll try to dig in tomorrow to take a look

@alexcrichton
Copy link
Copy Markdown
Member

In the meantime, if you're willing to test, unconditionally doing little endian loads/stores would probably fix this since I think the issue is Pulley doesn't implement big-endian v128 loads/stores. I think at least, would still need to confirm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants