Skip to content

⚡ Thunderbolt: softmax_v6 — Heterogeneous Unrolling (8x-4x-8x)#57

Open
bugparty wants to merge 1 commit into
mainfrom
thunderbolt-softmax-v6-4712665452163057862
Open

⚡ Thunderbolt: softmax_v6 — Heterogeneous Unrolling (8x-4x-8x)#57
bugparty wants to merge 1 commit into
mainfrom
thunderbolt-softmax-v6-4712665452163057862

Conversation

@bugparty

@bugparty bugparty commented Jun 19, 2026

Copy link
Copy Markdown
Owner

💡 What
Added softmax_v6 to ml_kernels which utilizes a heterogeneous unrolling strategy (8x-4x-8x) across the max reduction, exponentiation, and normalization phases.

🎯 Why
Multi-phase AVX2 kernels like Softmax have varying register pressure per phase. The exp phase uses Horner's scheme, requiring multiple YMM registers for constants, making it susceptible to register spilling if unrolled too aggressively (e.g. 8x). However, simpler phases like max reduction and normalization can safely be unrolled 8x to fully saturate the CPU's execution ports and hide operation latency.

🏗️ How
The loop unrolling factors were split:

  • Max reduction phase: 8x unroll (8 independent accumulators)
  • Exp/sum phase: 4x unroll (to avoid YMM register spilling)
  • Normalization phase: 8x unroll (to saturate ports)

📊 Impact
Benchmarking with ml_kernel_bench on N=4096 (fixed memory allocation) shows softmax_v6 performing comparably or slightly faster than softmax_v5 (~5.35 GFLOP/s vs ~5.22 GFLOP/s) with stable cache utilization while avoiding register spills.

🖥️ Tested on
Tested on the standard repository CI environments (AVX2-capable CPUs) using DISABLE_CPU_BINDING=1.

🔬 How to reproduce

cd build && make ml_kernel_bench
DISABLE_CPU_BINDING=1 ./ml_kernels/ml_kernel_bench --sizes 4096,16384,65536 --iters 5000 --warmup 100 --filter 'softmax'

PR created automatically by Jules for task 4712665452163057862 started by @bugparty

Summary by CodeRabbit

  • New Features

    • Added an optimized softmax implementation using adaptive SIMD unroll factors across different computation phases for improved performance and reduced register pressure.
  • Documentation

    • Added guidance on heterogeneous unrolling techniques for multi-phase SIMD kernels, including benchmark comparisons and phase-by-phase profiling strategies.
  • Tests

    • Added comprehensive unit tests and benchmarks for the new optimized implementation.

Implements `softmax_v6` utilizing an 8x-4x-8x unrolling strategy for AVX2
to balance execution port saturation with YMM register pressure. Includes
corresponding tests and benchmarks.

Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>
@google-labs-jules

Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds softmax_v6, an AVX2 softmax kernel using heterogeneous SIMD unrolling: 8x for max-reduction and normalization phases, 4x for the exponentiation/sum phase. A matching SoftmaxV6Benchmark, a 72-element correctness test against softmax_naive, and a design note in .jules/thunderbolt.md are also added.

Changes

softmax_v6 kernel, benchmark, test, and docs

Layer / File(s) Summary
softmax_v6 AVX2 kernel
ml_kernels/include/ml_kernels/softmax.h
New softmax_v6 function (130 lines): 8x-unrolled max reduction, 4x-unrolled exp+sum phase with 8-wide tail, 8x-unrolled normalization, and early returns for n==0 / sum==0.
Benchmark and unit test
ml_kernels/src/kernel_bench.cpp, ml_kernels/src/test_naive_ops.cpp
SoftmaxV6Benchmark extends SoftmaxBenchmark and is registered; test_softmax_v6() validates against softmax_naive on a 72-element input with 1e-4 tolerance and a sum-to-1 check, then wired into main().
Design documentation
.jules/thunderbolt.md
New section on heterogeneous unrolling strategy with phase-specific guidance, softmax_v6 vs softmax_v5 benchmark evidence, and a per-phase profiling action rule.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • bugparty/cpu_math_kernels_pri#31: Introduced exp256_ps_v2 and the 3-phase max→exp+sum→normalize softmax structure that softmax_v6 directly extends with heterogeneous unroll tuning.

Poem

🐇 Hop, hop through the SIMD lane,
Eight-wide max, then four for exp's gain,
Normalization springs back to eight,
No register spills — oh, isn't that great?
The rabbit unrolls at just the right rate! 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: introducing softmax_v6 with a heterogeneous unrolling strategy (8x-4x-8x), which is the core focus of all file changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch thunderbolt-softmax-v6-4712665452163057862

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
ml_kernels/include/ml_kernels/softmax.h (1)

511-512: ⚡ Quick win

Place softmax_v6 function braces on their own lines.

This function body opening brace is inline with the signature; repository style requires function-body braces on separate lines.

♻️ Proposed fix
-inline void softmax_v6(const float *input, float *output, std::size_t n) {
+inline void softmax_v6(const float *input, float *output, std::size_t n)
+{

As per coding guidelines, **/*.{c,cpp,cc,h,hpp} requires: "Keep braces on their own lines for function bodies."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/include/ml_kernels/softmax.h` around lines 511 - 512, The opening
brace for the softmax_v6 function is placed inline with the function signature
on the same line, but the repository style guide requires function body braces
to be on separate lines. Move the opening brace of softmax_v6 to its own line
immediately following the function signature line, ensuring it adheres to the
coding guidelines for C/C++ header files.

Source: Coding guidelines

ml_kernels/src/test_naive_ops.cpp (2)

185-186: ⚡ Quick win

Apply function brace style to test_softmax_v6 and main.

Both changed function bodies use inline opening braces; move braces to their own lines to match project style.

As per coding guidelines, **/*.{c,cpp,cc,h,hpp} requires: "Keep braces on their own lines for function bodies."

Also applies to: 216-216

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/test_naive_ops.cpp` around lines 185 - 186, Move the opening
braces in the function declarations for test_softmax_v6 and main to their own
lines to comply with the project's coding style guidelines. In both functions,
the opening brace currently appears on the same line as the function signature;
instead, place each opening brace on a separate line immediately following the
function declaration.

Source: Coding guidelines


187-198: ⚡ Quick win

Add one non-multiple-of-8 case to cover the scalar remainder path.

The new 72-element case validates 64 + 8 vector processing, but it does not execute the scalar tail in softmax_v6. Add a case like n=73 (or n=71) to exercise that branch.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/test_naive_ops.cpp` around lines 187 - 198, The current
72-element test input is a multiple of 8, so it doesn't exercise the scalar
remainder path in the softmax_v6 function. Add an additional test case with an
input size that is not a multiple of 8, such as 73 or 71 elements, to ensure the
scalar tail handling code path is properly tested. Create this new test input
vector with the appropriate non-multiple-of-8 size to validate the remainder
processing logic.
ml_kernels/src/kernel_bench.cpp (1)

338-344: ⚡ Quick win

Use separate-line braces for new function bodies in SoftmaxV6Benchmark.

The newly added methods still use inline opening braces; switch to the repository’s function brace style.

♻️ Proposed fix
-    const char *name() const override { return "softmax_v6"; }
+    const char *name() const override
+    {
+        return "softmax_v6";
+    }
 
-    void run() override {
+    void run() override
+    {
         ml_kernels::softmax_v6(inputs_[current_idx_].data(), outputs_[current_idx_].data(), inputs_[0].size());
         current_idx_ = (current_idx_ + 1) % pool_size_;
     }

As per coding guidelines, **/*.{c,cpp,cc,h,hpp} requires: "Keep braces on their own lines for function bodies."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/kernel_bench.cpp` around lines 338 - 344, The name() method in
the SoftmaxV6Benchmark class uses inline opening braces (the opening brace is on
the same line as the function declaration), which violates the repository's C++
coding style that requires braces to be on separate lines for function bodies.
Refactor the name() method by moving the opening brace to its own line, placing
the method body (the return statement) on the following line(s), and keeping the
closing brace on its own line.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Around line 511-512: The opening brace for the softmax_v6 function is placed
inline with the function signature on the same line, but the repository style
guide requires function body braces to be on separate lines. Move the opening
brace of softmax_v6 to its own line immediately following the function signature
line, ensuring it adheres to the coding guidelines for C/C++ header files.

In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 338-344: The name() method in the SoftmaxV6Benchmark class uses
inline opening braces (the opening brace is on the same line as the function
declaration), which violates the repository's C++ coding style that requires
braces to be on separate lines for function bodies. Refactor the name() method
by moving the opening brace to its own line, placing the method body (the return
statement) on the following line(s), and keeping the closing brace on its own
line.

In `@ml_kernels/src/test_naive_ops.cpp`:
- Around line 185-186: Move the opening braces in the function declarations for
test_softmax_v6 and main to their own lines to comply with the project's coding
style guidelines. In both functions, the opening brace currently appears on the
same line as the function signature; instead, place each opening brace on a
separate line immediately following the function declaration.
- Around line 187-198: The current 72-element test input is a multiple of 8, so
it doesn't exercise the scalar remainder path in the softmax_v6 function. Add an
additional test case with an input size that is not a multiple of 8, such as 73
or 71 elements, to ensure the scalar tail handling code path is properly tested.
Create this new test input vector with the appropriate non-multiple-of-8 size to
validate the remainder processing logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a678e815-f7fe-4f0b-82d4-8ad40914be25

📥 Commits

Reviewing files that changed from the base of the PR and between acca01e and 78fd7d3.

📒 Files selected for processing (4)
  • .jules/thunderbolt.md
  • ml_kernels/include/ml_kernels/softmax.h
  • ml_kernels/src/kernel_bench.cpp
  • ml_kernels/src/test_naive_ops.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant