⚡ Thunderbolt: softmax_v6 — Heterogeneous Unrolling (8x-4x-8x) by bugparty · Pull Request #57 · bugparty/cpu_math_kernels_pri

bugparty · 2026-06-19T20:08:00Z

💡 What
Added softmax_v6 to ml_kernels which utilizes a heterogeneous unrolling strategy (8x-4x-8x) across the max reduction, exponentiation, and normalization phases.

🎯 Why
Multi-phase AVX2 kernels like Softmax have varying register pressure per phase. The exp phase uses Horner's scheme, requiring multiple YMM registers for constants, making it susceptible to register spilling if unrolled too aggressively (e.g. 8x). However, simpler phases like max reduction and normalization can safely be unrolled 8x to fully saturate the CPU's execution ports and hide operation latency.

🏗️ How
The loop unrolling factors were split:

Max reduction phase: 8x unroll (8 independent accumulators)
Exp/sum phase: 4x unroll (to avoid YMM register spilling)
Normalization phase: 8x unroll (to saturate ports)

📊 Impact
Benchmarking with ml_kernel_bench on N=4096 (fixed memory allocation) shows softmax_v6 performing comparably or slightly faster than softmax_v5 (~5.35 GFLOP/s vs ~5.22 GFLOP/s) with stable cache utilization while avoiding register spills.

🖥️ Tested on
Tested on the standard repository CI environments (AVX2-capable CPUs) using DISABLE_CPU_BINDING=1.

🔬 How to reproduce

cd build && make ml_kernel_bench
DISABLE_CPU_BINDING=1 ./ml_kernels/ml_kernel_bench --sizes 4096,16384,65536 --iters 5000 --warmup 100 --filter 'softmax'

PR created automatically by Jules for task 4712665452163057862 started by @bugparty

Summary by CodeRabbit

New Features
- Added an optimized softmax implementation using adaptive SIMD unroll factors across different computation phases for improved performance and reduced register pressure.
Documentation
- Added guidance on heterogeneous unrolling techniques for multi-phase SIMD kernels, including benchmark comparisons and phase-by-phase profiling strategies.
Tests
- Added comprehensive unit tests and benchmarks for the new optimized implementation.

Implements `softmax_v6` utilizing an 8x-4x-8x unrolling strategy for AVX2 to balance execution port saturation with YMM register pressure. Includes corresponding tests and benchmarks. Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>

google-labs-jules · 2026-06-19T20:08:02Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

coderabbitai · 2026-06-19T20:08:17Z

📝 Walkthrough

Walkthrough

Adds softmax_v6, an AVX2 softmax kernel using heterogeneous SIMD unrolling: 8x for max-reduction and normalization phases, 4x for the exponentiation/sum phase. A matching SoftmaxV6Benchmark, a 72-element correctness test against softmax_naive, and a design note in .jules/thunderbolt.md are also added.

Changes

softmax_v6 kernel, benchmark, test, and docs

Layer / File(s)	Summary
`softmax_v6` AVX2 kernel `ml_kernels/include/ml_kernels/softmax.h`	New `softmax_v6` function (130 lines): 8x-unrolled max reduction, 4x-unrolled exp+sum phase with 8-wide tail, 8x-unrolled normalization, and early returns for `n==0` / `sum==0`.
Benchmark and unit test `ml_kernels/src/kernel_bench.cpp`, `ml_kernels/src/test_naive_ops.cpp`	`SoftmaxV6Benchmark` extends `SoftmaxBenchmark` and is registered; `test_softmax_v6()` validates against `softmax_naive` on a 72-element input with 1e-4 tolerance and a sum-to-1 check, then wired into `main()`.
Design documentation `.jules/thunderbolt.md`	New section on heterogeneous unrolling strategy with phase-specific guidance, `softmax_v6` vs `softmax_v5` benchmark evidence, and a per-phase profiling action rule.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

bugparty/cpu_math_kernels_pri#31: Introduced exp256_ps_v2 and the 3-phase max→exp+sum→normalize softmax structure that softmax_v6 directly extends with heterogeneous unroll tuning.

Poem

🐇 Hop, hop through the SIMD lane,
Eight-wide max, then four for exp's gain,
Normalization springs back to eight,
No register spills — oh, isn't that great?
The rabbit unrolls at just the right rate! 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: introducing softmax_v6 with a heterogeneous unrolling strategy (8x-4x-8x), which is the core focus of all file changes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch thunderbolt-softmax-v6-4712665452163057862

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (4)

ml_kernels/include/ml_kernels/softmax.h (1)
511-512: ⚡ Quick win

Place softmax_v6 function braces on their own lines.

This function body opening brace is inline with the signature; repository style requires function-body braces on separate lines.
♻️ Proposed fix
-inline void softmax_v6(const float *input, float *output, std::size_t n) {
+inline void softmax_v6(const float *input, float *output, std::size_t n)
+{
As per coding guidelines, **/*.{c,cpp,cc,h,hpp} requires: "Keep braces on their own lines for function bodies."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/include/ml_kernels/softmax.h` around lines 511 - 512, The opening
brace for the softmax_v6 function is placed inline with the function signature
on the same line, but the repository style guide requires function body braces
to be on separate lines. Move the opening brace of softmax_v6 to its own line
immediately following the function signature line, ensuring it adheres to the
coding guidelines for C/C++ header files.
Source: Coding guidelines
ml_kernels/src/test_naive_ops.cpp (2)
185-186: ⚡ Quick win

Apply function brace style to test_softmax_v6 and main.

Both changed function bodies use inline opening braces; move braces to their own lines to match project style.

As per coding guidelines, **/*.{c,cpp,cc,h,hpp} requires: "Keep braces on their own lines for function bodies."

Also applies to: 216-216
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/test_naive_ops.cpp` around lines 185 - 186, Move the opening
braces in the function declarations for test_softmax_v6 and main to their own
lines to comply with the project's coding style guidelines. In both functions,
the opening brace currently appears on the same line as the function signature;
instead, place each opening brace on a separate line immediately following the
function declaration.
Source: Coding guidelines

187-198: ⚡ Quick win

Add one non-multiple-of-8 case to cover the scalar remainder path.

The new 72-element case validates 64 + 8 vector processing, but it does not execute the scalar tail in softmax_v6. Add a case like n=73 (or n=71) to exercise that branch.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/test_naive_ops.cpp` around lines 187 - 198, The current
72-element test input is a multiple of 8, so it doesn't exercise the scalar
remainder path in the softmax_v6 function. Add an additional test case with an
input size that is not a multiple of 8, such as 73 or 71 elements, to ensure the
scalar tail handling code path is properly tested. Create this new test input
vector with the appropriate non-multiple-of-8 size to validate the remainder
processing logic.
ml_kernels/src/kernel_bench.cpp (1)
338-344: ⚡ Quick win

Use separate-line braces for new function bodies in SoftmaxV6Benchmark.

The newly added methods still use inline opening braces; switch to the repository’s function brace style.
♻️ Proposed fix
-    const char *name() const override { return "softmax_v6"; }
+    const char *name() const override
+    {
+        return "softmax_v6";
+    }
 
-    void run() override {
+    void run() override
+    {
         ml_kernels::softmax_v6(inputs_[current_idx_].data(), outputs_[current_idx_].data(), inputs_[0].size());
         current_idx_ = (current_idx_ + 1) % pool_size_;
     }
As per coding guidelines, **/*.{c,cpp,cc,h,hpp} requires: "Keep braces on their own lines for function bodies."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/kernel_bench.cpp` around lines 338 - 344, The name() method in
the SoftmaxV6Benchmark class uses inline opening braces (the opening brace is on
the same line as the function declaration), which violates the repository's C++
coding style that requires braces to be on separate lines for function bodies.
Refactor the name() method by moving the opening brace to its own line, placing
the method body (the return statement) on the following line(s), and keeping the
closing brace on its own line.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Around line 511-512: The opening brace for the softmax_v6 function is placed
inline with the function signature on the same line, but the repository style
guide requires function body braces to be on separate lines. Move the opening
brace of softmax_v6 to its own line immediately following the function signature
line, ensuring it adheres to the coding guidelines for C/C++ header files.

In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 338-344: The name() method in the SoftmaxV6Benchmark class uses
inline opening braces (the opening brace is on the same line as the function
declaration), which violates the repository's C++ coding style that requires
braces to be on separate lines for function bodies. Refactor the name() method
by moving the opening brace to its own line, placing the method body (the return
statement) on the following line(s), and keeping the closing brace on its own
line.

In `@ml_kernels/src/test_naive_ops.cpp`:
- Around line 185-186: Move the opening braces in the function declarations for
test_softmax_v6 and main to their own lines to comply with the project's coding
style guidelines. In both functions, the opening brace currently appears on the
same line as the function signature; instead, place each opening brace on a
separate line immediately following the function declaration.
- Around line 187-198: The current 72-element test input is a multiple of 8, so
it doesn't exercise the scalar remainder path in the softmax_v6 function. Add an
additional test case with an input size that is not a multiple of 8, such as 73
or 71 elements, to ensure the scalar tail handling code path is properly tested.
Create this new test input vector with the appropriate non-multiple-of-8 size to
validate the remainder processing logic.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a678e815-f7fe-4f0b-82d4-8ad40914be25

📥 Commits

Reviewing files that changed from the base of the PR and between acca01e and 78fd7d3.

📒 Files selected for processing (4)

.jules/thunderbolt.md
ml_kernels/include/ml_kernels/softmax.h
ml_kernels/src/kernel_bench.cpp
ml_kernels/src/test_naive_ops.cpp

coderabbitai Bot reviewed Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡ Thunderbolt: softmax_v6 — Heterogeneous Unrolling (8x-4x-8x)#57

⚡ Thunderbolt: softmax_v6 — Heterogeneous Unrolling (8x-4x-8x)#57
bugparty wants to merge 1 commit into
mainfrom
thunderbolt-softmax-v6-4712665452163057862

bugparty commented Jun 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

google-labs-jules Bot commented Jun 19, 2026

Uh oh!

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bugparty commented Jun 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

google-labs-jules Bot commented Jun 19, 2026

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bugparty commented Jun 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading