⚡ Thunderbolt: softmax_v6 — Heterogeneous AVX2 Loop Unrolling by bugparty · Pull Request #56 · bugparty/cpu_math_kernels_pri

bugparty · 2026-06-17T20:03:54Z

💡 What: Added softmax_v6, a heterogeneous unrolled AVX2 implementation (8x max reduction, 4x Horner polynomial exp, 8x normalization).
🎯 Why: Uniformly unrolling Pass 2 (the Horner polynomial exp approximation) by 8x maxes out the 16 available YMM registers in AVX2, causing register spilling to the stack. By strategically keeping the dense compute phase at a 4x unroll while pushing the simpler passes to 8x, we minimize loop overhead and saturate the execution ports without incurring spill penalties.
🏗️ How: Separated the unroll factors manually in the 3-pass loop. Pass 1 and Pass 3 are explicitly unrolled 8x (processing 64 floats per iteration). Pass 2 remains at 4x (32 floats).
📊 Impact:
N=1048576 (Pool Mode): softmax_v6 (1.620ms avg, 2.59 GFLOP/s) matched or slightly beat softmax_v5 (1.681ms avg, 2.49 GFLOP/s).
Microbenchmark of loop combinations showed ~10% latency reduction (1040us -> 933us) in L1 cache-resident buffers.
🖥️ Tested on: GCC 13.3.0, AVX2 environment
🔬 How to reproduce:
cd build && make ml_kernel_bench
DISABLE_CPU_BINDING=1 ./ml_kernels/ml_kernel_bench --sizes 1048576 --iters 50 --filter "softmax"

PR created automatically by Jules for task 16743516370251725513 started by @bugparty

Summary by CodeRabbit

Release Notes

New Features
- Introduced a new optimized softmax kernel with enhanced multi-pass vectorization approach.
Tests
- Added comprehensive validation testing for the new kernel with diverse input configurations.
Documentation
- Added technical documentation on optimization strategies and best practices for vectorized kernel development.

Implements `softmax_v6` with 8x-4x-8x heterogeneous loop unrolling. This avoids YMM register spilling in the Horner polynomial phase while saturating the execution ports for the max and norm reductions. Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>

google-labs-jules · 2026-06-17T20:03:56Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

coderabbitai · 2026-06-17T20:04:10Z

📝 Walkthrough

Walkthrough

Adds softmax_v6, an AVX2 inline kernel with a 3-pass structure: 8-way unrolled max reduction, 4-way exp256_ps_v2 exp/sum accumulation, and 8-way normalization. Wires it into the benchmark harness (SoftmaxV6Benchmark) and correctness tests (test_softmax_v6). Appends a dev-log entry on heterogeneous-unrolling register-pressure tradeoffs.

Changes

softmax_v6 Kernel, Benchmark, Test, and Docs

Layer / File(s)	Summary
softmax_v6 three-pass AVX2 kernel `ml_kernels/include/ml_kernels/softmax.h`	Implements `softmax_v6` with Pass 1 (8-way 64-element max reduction + tails), Pass 2 (4-way 32-element `exp256_ps_v2` accumulation + scalar tail), and Pass 3 (8-way 64-element normalization + tails), with early returns for `n==0` and `sum_val==0.0f`.
Benchmark registration and correctness test `ml_kernels/src/kernel_bench.cpp`, `ml_kernels/src/test_naive_ops.cpp`	`SoftmaxV6Benchmark` overrides `name()`/`run()` to invoke `softmax_v6` on pooled buffers and is registered. `test_softmax_v6()` runs both naive and v6 on a 72-element input, compares outputs with tolerance, and calls `exit(1)` on mismatch; invoked from `main()`.
Register-pressure design notes `.jules/thunderbolt.md`	Adds a 2024-06-17 dev-log entry documenting mixed-ILP register-spilling behavior in AVX2 kernels and recommending heterogeneous unrolling sized by counting registers in the most complex pass.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

bugparty/cpu_math_kernels_pri#31: Introduces the earlier softmax variant and exp256_ps_v2 that softmax_v6 directly builds on, following the same 3-pass max→exp+sum→normalize pattern with the same benchmark/test wiring structure.

Poem

🐇 Hop through registers, eight at a time,
AVX2 vectors marching in rhyme,
Max and exp and normalize too —
No spilling allowed in this kernel stew!
The rabbit unrolls what the compiler can't see,
Heterogeneous phases, as they should be! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Thunderbolt: softmax_v6 — Heterogeneous AVX2 Loop Unrolling' directly and specifically summarizes the main change: introducing a new softmax_v6 implementation that uses heterogeneous AVX2 loop unrolling. It is clear, concise, and accurately represents the core contribution of the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch thunderbolt-avx2-softmax-v6-16743516370251725513

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

ml_kernels/src/test_naive_ops.cpp (1)

184-207: ⚡ Quick win

Consider adding a test case for the scalar tail path.

The 72-element input exercises the 64-element and 8-element vector loops, but not the scalar remainder path (requires n % 8 != 0). Adding a second test with, e.g., 73 or 75 elements would improve coverage for the scalar cleanup loops in all three passes.

Also, for consistency with test_softmax_v3/v4/v5, consider adding the sum-to-one verification:

float sum = 0.0f;
for (size_t i = 0; i < input.size(); ++i) {
    // existing comparison...
    sum += output_v6[i];
}
if (std::abs(sum - 1.0f) > 1e-4f) {
    std::cerr << "Sum mismatch: expected 1.0, got " << sum << std::endl;
    exit(1);
}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/test_naive_ops.cpp` around lines 184 - 207, The
test_softmax_v6 function is missing two things: (1) it does not verify that the
output values sum to 1.0, which is necessary for consistency with
test_softmax_v3, test_softmax_v4, and test_softmax_v5, and (2) the 72-element
input does not trigger the scalar remainder path because 72 is divisible by 8,
so it does not properly test the scalar cleanup loops. To fix this, add a
sum-to-one verification loop after the element-wise comparison that accumulates
output_v6 values and checks if the sum is approximately 1.0, and change the
input vector size from 72 elements to 73 or 75 elements to ensure the scalar
tail path is exercised during all three passes of the softmax computation.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@ml_kernels/src/test_naive_ops.cpp`:
- Around line 184-207: The test_softmax_v6 function is missing two things: (1)
it does not verify that the output values sum to 1.0, which is necessary for
consistency with test_softmax_v3, test_softmax_v4, and test_softmax_v5, and (2)
the 72-element input does not trigger the scalar remainder path because 72 is
divisible by 8, so it does not properly test the scalar cleanup loops. To fix
this, add a sum-to-one verification loop after the element-wise comparison that
accumulates output_v6 values and checks if the sum is approximately 1.0, and
change the input vector size from 72 elements to 73 or 75 elements to ensure the
scalar tail path is exercised during all three passes of the softmax
computation.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3144497e-a9b0-4b9f-932e-8eff88be55c3

📥 Commits

Reviewing files that changed from the base of the PR and between acca01e and abb7357.

📒 Files selected for processing (4)

.jules/thunderbolt.md
ml_kernels/include/ml_kernels/softmax.h
ml_kernels/src/kernel_bench.cpp
ml_kernels/src/test_naive_ops.cpp

coderabbitai Bot reviewed Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡ Thunderbolt: softmax_v6 — Heterogeneous AVX2 Loop Unrolling#56

⚡ Thunderbolt: softmax_v6 — Heterogeneous AVX2 Loop Unrolling#56
bugparty wants to merge 1 commit into
mainfrom
thunderbolt-avx2-softmax-v6-16743516370251725513

bugparty commented Jun 17, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

google-labs-jules Bot commented Jun 17, 2026

Uh oh!

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bugparty commented Jun 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

google-labs-jules Bot commented Jun 17, 2026

Uh oh!

coderabbitai Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bugparty commented Jun 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading