Reproducer (Apple M-series, Julia 1.11.8, LoopVectorization master, VectorizationBase 0.21.72)
```julia
using LoopVectorization
function f!(out, arr)
@turbo for j in axes(out, 2), i in axes(out, 1)
out[i, j] = arr[2i, 2j]
end
end
A = rand(6, 2)
out = fill(NaN, 3, 1) # 3 i-iterations, 1 j-iteration
f!(out, A)
@show out # last entry remains NaN — the loop never wrote it
```
The strided load on the contiguous axis (`arr[2i, ...]`) combined with `@turbo`'s default unroll on `:i` makes the cleanup tail skip the final iteration(s) on Apple ARM.
Pattern
The number of dropped trailing `out[i, j]` writes is roughly `out_i mod (unroll_factor * W)` when that's < unroll_factor*W and > 0. Concretely, on aarch64+Apple (NEON 128-bit):
Float64 (W=2): bug when `out_i` is odd and ≥ 3 — last 1 iteration dropped.
| M |
out_i |
last entry written? |
| 4..5 |
2 |
yes |
| 6..7 |
3 |
NO (last entry stays at NaN) |
| 8..9 |
4 |
yes |
| 10..11 |
5 |
NO |
| 12..13 |
6 |
yes |
| 14..15 |
7 |
NO |
Float32 (W=4): bug when `out_i mod 4 ∈ {1, 2, 3}` and out_i ≥ 4 — last `out_i mod 4` iterations dropped.
| M |
out_i |
dropped |
| 8..9 |
4 |
0 |
| 10..11 |
5 |
1 |
| 12..13 |
6 |
2 |
| 14..15 |
7 |
3 |
| 16..17 |
8 |
0 |
| 18..19 |
9 |
1 |
| 20..21 |
10 |
2 |
| 22..23 |
11 |
3 |
Trigger conditions
- Apple aarch64 (NEON 128-bit register width).
- An axis with a strided load on the contiguous dimension, e.g. `arr[2i, ...]` or `arr[3i, ...]` or `arr[2i-1, ...]`.
- LV's default unroll on that axis (`@turbo unroll=1` and `@turbo unroll=(1,1)` both make the bug go away).
- Independent of the other axis: holds whether `:j` has 1 or 40 iterations, whether the body has 1 term or 4.
The non-strided baseline `out[i,j] = arr[i, j]` does not trigger the bug — only strided loads on the unrolled axis do.
Test gating
This is the cluster of `@test_broken`s in `test/shuffleloadstores.jl` around line 494 (the `tullio_issue_131` pattern with `(j+1) % 4 ∈ (2, 3) && (j+1) ≥ 6`). On the underlying access pattern: `(j+1) = M`, and the failures collapse exactly to `out_i = M÷2` being odd and ≥ 3.
Workarounds for users
- `@turbo unroll=1 for ...` — disable unrolling on the unrolled axis. Tested correct for all shapes I tried.
- `@turbo unroll=(2, 2) for ...` — cross-unroll both axes by 2. Also tested correct.
- `@turbo unroll=(1, 4)` (only unroll the non-strided axis) does not fix it on this access pattern — same cleanup tail bug appears.
Likely fix area
`src/codegen/lowering.jl` lines ~363-425 (`unsigned(Ureduct) < unsigned(UF)` branch that generates the unroll cleanup), and/or `terminatecondition` at `src/codegen/loopstartstopmanager.jl:1378`. The cleanup termination check appears to use `UF` instead of `1` for the final scalar phase when the unrolled axis is contiguous + has strided access. I have not isolated the exact off-by-one yet.
Context
Discovered while investigating the SciML small grant for getting LoopVectorization tests green on macOS ARM (companion PRs #569 and JuliaSIMD/VectorizationBase.jl#127, which fix the W=1 nested VecUnroll store and BitVector dynamic-index load issues respectively).
Filing this so the remaining `@test_broken` in `shuffleloadstores.jl` has a concrete pointer to the bug.
Reproducer (Apple M-series, Julia 1.11.8, LoopVectorization master, VectorizationBase 0.21.72)
```julia
using LoopVectorization
function f!(out, arr)
@turbo for j in axes(out, 2), i in axes(out, 1)
out[i, j] = arr[2i, 2j]
end
end
A = rand(6, 2)
out = fill(NaN, 3, 1) # 3 i-iterations, 1 j-iteration
f!(out, A)
@show out # last entry remains NaN — the loop never wrote it
```
The strided load on the contiguous axis (`arr[2i, ...]`) combined with `@turbo`'s default unroll on `:i` makes the cleanup tail skip the final iteration(s) on Apple ARM.
Pattern
The number of dropped trailing `out[i, j]` writes is roughly `out_i mod (unroll_factor * W)` when that's < unroll_factor*W and > 0. Concretely, on aarch64+Apple (NEON 128-bit):
Float64 (W=2): bug when `out_i` is odd and ≥ 3 — last 1 iteration dropped.
Float32 (W=4): bug when `out_i mod 4 ∈ {1, 2, 3}` and out_i ≥ 4 — last `out_i mod 4` iterations dropped.
Trigger conditions
The non-strided baseline `out[i,j] = arr[i, j]` does not trigger the bug — only strided loads on the unrolled axis do.
Test gating
This is the cluster of `@test_broken`s in `test/shuffleloadstores.jl` around line 494 (the `tullio_issue_131` pattern with `(j+1) % 4 ∈ (2, 3) && (j+1) ≥ 6`). On the underlying access pattern: `(j+1) = M`, and the failures collapse exactly to `out_i = M÷2` being odd and ≥ 3.
Workarounds for users
Likely fix area
`src/codegen/lowering.jl` lines ~363-425 (`unsigned(Ureduct) < unsigned(UF)` branch that generates the unroll cleanup), and/or `terminatecondition` at `src/codegen/loopstartstopmanager.jl:1378`. The cleanup termination check appears to use `UF` instead of `1` for the final scalar phase when the unrolled axis is contiguous + has strided access. I have not isolated the exact off-by-one yet.
Context
Discovered while investigating the SciML small grant for getting LoopVectorization tests green on macOS ARM (companion PRs #569 and JuliaSIMD/VectorizationBase.jl#127, which fix the W=1 nested VecUnroll store and BitVector dynamic-index load issues respectively).
Filing this so the remaining `@test_broken` in `shuffleloadstores.jl` has a concrete pointer to the bug.