Skip to content

⚡ Thunderbolt: softmax_v6 — Heterogeneous AVX2 Loop Unrolling#56

Open
bugparty wants to merge 1 commit into
mainfrom
thunderbolt-avx2-softmax-v6-16743516370251725513
Open

⚡ Thunderbolt: softmax_v6 — Heterogeneous AVX2 Loop Unrolling#56
bugparty wants to merge 1 commit into
mainfrom
thunderbolt-avx2-softmax-v6-16743516370251725513

Conversation

@bugparty

@bugparty bugparty commented Jun 17, 2026

Copy link
Copy Markdown
Owner

💡 What: Added softmax_v6, a heterogeneous unrolled AVX2 implementation (8x max reduction, 4x Horner polynomial exp, 8x normalization).
🎯 Why: Uniformly unrolling Pass 2 (the Horner polynomial exp approximation) by 8x maxes out the 16 available YMM registers in AVX2, causing register spilling to the stack. By strategically keeping the dense compute phase at a 4x unroll while pushing the simpler passes to 8x, we minimize loop overhead and saturate the execution ports without incurring spill penalties.
🏗️ How: Separated the unroll factors manually in the 3-pass loop. Pass 1 and Pass 3 are explicitly unrolled 8x (processing 64 floats per iteration). Pass 2 remains at 4x (32 floats).
📊 Impact:
N=1048576 (Pool Mode): softmax_v6 (1.620ms avg, 2.59 GFLOP/s) matched or slightly beat softmax_v5 (1.681ms avg, 2.49 GFLOP/s).
Microbenchmark of loop combinations showed ~10% latency reduction (1040us -> 933us) in L1 cache-resident buffers.
🖥️ Tested on: GCC 13.3.0, AVX2 environment
🔬 How to reproduce:
cd build && make ml_kernel_bench
DISABLE_CPU_BINDING=1 ./ml_kernels/ml_kernel_bench --sizes 1048576 --iters 50 --filter "softmax"


PR created automatically by Jules for task 16743516370251725513 started by @bugparty

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced a new optimized softmax kernel with enhanced multi-pass vectorization approach.
  • Tests

    • Added comprehensive validation testing for the new kernel with diverse input configurations.
  • Documentation

    • Added technical documentation on optimization strategies and best practices for vectorized kernel development.

Implements `softmax_v6` with 8x-4x-8x heterogeneous loop unrolling.
This avoids YMM register spilling in the Horner polynomial phase while
saturating the execution ports for the max and norm reductions.

Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>
@google-labs-jules

Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds softmax_v6, an AVX2 inline kernel with a 3-pass structure: 8-way unrolled max reduction, 4-way exp256_ps_v2 exp/sum accumulation, and 8-way normalization. Wires it into the benchmark harness (SoftmaxV6Benchmark) and correctness tests (test_softmax_v6). Appends a dev-log entry on heterogeneous-unrolling register-pressure tradeoffs.

Changes

softmax_v6 Kernel, Benchmark, Test, and Docs

Layer / File(s) Summary
softmax_v6 three-pass AVX2 kernel
ml_kernels/include/ml_kernels/softmax.h
Implements softmax_v6 with Pass 1 (8-way 64-element max reduction + tails), Pass 2 (4-way 32-element exp256_ps_v2 accumulation + scalar tail), and Pass 3 (8-way 64-element normalization + tails), with early returns for n==0 and sum_val==0.0f.
Benchmark registration and correctness test
ml_kernels/src/kernel_bench.cpp, ml_kernels/src/test_naive_ops.cpp
SoftmaxV6Benchmark overrides name()/run() to invoke softmax_v6 on pooled buffers and is registered. test_softmax_v6() runs both naive and v6 on a 72-element input, compares outputs with tolerance, and calls exit(1) on mismatch; invoked from main().
Register-pressure design notes
.jules/thunderbolt.md
Adds a 2024-06-17 dev-log entry documenting mixed-ILP register-spilling behavior in AVX2 kernels and recommending heterogeneous unrolling sized by counting registers in the most complex pass.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • bugparty/cpu_math_kernels_pri#31: Introduces the earlier softmax variant and exp256_ps_v2 that softmax_v6 directly builds on, following the same 3-pass max→exp+sum→normalize pattern with the same benchmark/test wiring structure.

Poem

🐇 Hop through registers, eight at a time,
AVX2 vectors marching in rhyme,
Max and exp and normalize too —
No spilling allowed in this kernel stew!
The rabbit unrolls what the compiler can't see,
Heterogeneous phases, as they should be! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Thunderbolt: softmax_v6 — Heterogeneous AVX2 Loop Unrolling' directly and specifically summarizes the main change: introducing a new softmax_v6 implementation that uses heterogeneous AVX2 loop unrolling. It is clear, concise, and accurately represents the core contribution of the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch thunderbolt-avx2-softmax-v6-16743516370251725513

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
ml_kernels/src/test_naive_ops.cpp (1)

184-207: ⚡ Quick win

Consider adding a test case for the scalar tail path.

The 72-element input exercises the 64-element and 8-element vector loops, but not the scalar remainder path (requires n % 8 != 0). Adding a second test with, e.g., 73 or 75 elements would improve coverage for the scalar cleanup loops in all three passes.

Also, for consistency with test_softmax_v3/v4/v5, consider adding the sum-to-one verification:

float sum = 0.0f;
for (size_t i = 0; i < input.size(); ++i) {
    // existing comparison...
    sum += output_v6[i];
}
if (std::abs(sum - 1.0f) > 1e-4f) {
    std::cerr << "Sum mismatch: expected 1.0, got " << sum << std::endl;
    exit(1);
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/test_naive_ops.cpp` around lines 184 - 207, The
test_softmax_v6 function is missing two things: (1) it does not verify that the
output values sum to 1.0, which is necessary for consistency with
test_softmax_v3, test_softmax_v4, and test_softmax_v5, and (2) the 72-element
input does not trigger the scalar remainder path because 72 is divisible by 8,
so it does not properly test the scalar cleanup loops. To fix this, add a
sum-to-one verification loop after the element-wise comparison that accumulates
output_v6 values and checks if the sum is approximately 1.0, and change the
input vector size from 72 elements to 73 or 75 elements to ensure the scalar
tail path is exercised during all three passes of the softmax computation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@ml_kernels/src/test_naive_ops.cpp`:
- Around line 184-207: The test_softmax_v6 function is missing two things: (1)
it does not verify that the output values sum to 1.0, which is necessary for
consistency with test_softmax_v3, test_softmax_v4, and test_softmax_v5, and (2)
the 72-element input does not trigger the scalar remainder path because 72 is
divisible by 8, so it does not properly test the scalar cleanup loops. To fix
this, add a sum-to-one verification loop after the element-wise comparison that
accumulates output_v6 values and checks if the sum is approximately 1.0, and
change the input vector size from 72 elements to 73 or 75 elements to ensure the
scalar tail path is exercised during all three passes of the softmax
computation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3144497e-a9b0-4b9f-932e-8eff88be55c3

📥 Commits

Reviewing files that changed from the base of the PR and between acca01e and abb7357.

📒 Files selected for processing (4)
  • .jules/thunderbolt.md
  • ml_kernels/include/ml_kernels/softmax.h
  • ml_kernels/src/kernel_bench.cpp
  • ml_kernels/src/test_naive_ops.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant