Skip to content

⚡ Thunderbolt: softmax_v6 — Single-FMA range reduction optimization#60

Open
bugparty wants to merge 1 commit into
mainfrom
thunderbolt-softmax-v6-8697455812306532988
Open

⚡ Thunderbolt: softmax_v6 — Single-FMA range reduction optimization#60
bugparty wants to merge 1 commit into
mainfrom
thunderbolt-softmax-v6-8697455812306532988

Conversation

@bugparty

@bugparty bugparty commented Jun 22, 2026

Copy link
Copy Markdown
Owner

💡 What: Implemented softmax_v6 with a new exp256_ps_v3 AVX2 approximation that uses a single FMA for range reduction instead of splitting ln(2).
🎯 Why: In transcendental approximations, splitting ln(2) for exact precision adds instructions to the critical path. ML workloads like Softmax are shift-invariant and can tolerate slightly reduced precision.
🏗️ How: Replaced the split range reduction step with a single _mm256_fnmadd_ps(n, ln2_constant, x). Unrolled the inner loop 4x to hide latency.
📊 Impact: ~10% throughput improvement. softmax_v6 achieves 5.30 GFLOP/s compared to softmax_v5 at 4.81 GFLOP/s (N=65536, Fixed Memory mode). Passes all validation tests within 1e-4 tolerance.
🖥️ Tested on: Haswell+ AVX2 CPU (Linux build container)
🔬 How to reproduce: Compile and run DISABLE_CPU_BINDING=1 ./build/ml_kernels/ml_kernel_bench --filter 'softmax_v[56]' --sizes 65536


PR created automatically by Jules for task 8697455812306532988 started by @bugparty

Summary by CodeRabbit

Release Notes

  • New Features

    • Added optimized softmax implementation for improved performance in ML kernel operations.
  • Documentation

    • Added technical documentation on range reduction optimization for transcendental approximations.
  • Tests

    • Added comprehensive testing for the new softmax variant.

💡 What: Implemented `softmax_v6` with a new `exp256_ps_v3` AVX2 approximation that uses a single FMA for range reduction instead of splitting `ln(2)`.
🎯 Why: In transcendental approximations, splitting `ln(2)` for exact precision adds instructions to the critical path. ML workloads like Softmax are shift-invariant and can tolerate slightly reduced precision.
🏗️ How: Replaced `_mm256_fnmadd_ps(n, c1, x)` and `_mm256_fnmadd_ps(n, c2, r)` with a single `_mm256_fnmadd_ps(n, ln2_constant, x)`. Unrolled the inner loop 4x to hide latency as before.
📊 Impact: ~10% throughput improvement. `softmax_v6` achieves 5.30 GFLOP/s compared to `softmax_v5` at 4.81 GFLOP/s (N=65536, Fixed Memory mode). Passes all validation tests within 1e-4 tolerance.
🖥️ Tested on: Haswell+ AVX2 CPU
🔬 How to reproduce: `DISABLE_CPU_BINDING=1 ./build/ml_kernels/ml_kernel_bench --filter 'softmax_v[56]' --sizes 65536`

Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>
@google-labs-jules

Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds exp256_ps_v3, an AVX2 exponential using a single-FMA range reduction with a combined ln(2) constant, and softmax_v6, a 32-wide AVX2 softmax using that helper. A matching benchmark class and correctness test against softmax_naive are added, plus a design note in .jules/thunderbolt.md.

softmax_v6 AVX2 Implementation

Layer / File(s) Summary
exp256_ps_v3 and softmax_v6 core implementation
ml_kernels/include/ml_kernels/softmax.h
exp256_ps_v3 performs range reduction via a single FNMADD with a combined ln(2) constant, followed by polynomial evaluation and 2^n bit construction. softmax_v6 runs a four-accumulator 32-wide max reduction, calls exp256_ps_v3 on shifted inputs, accumulates the exp sum, then normalizes with 1/sum_val in 32-wide, 8-wide, and scalar tail passes.
Benchmark and correctness tests
ml_kernels/src/kernel_bench.cpp, ml_kernels/src/test_naive_ops.cpp
SoftmaxV6Benchmark derives from SoftmaxBenchmark, calls softmax_v6 with pooled buffers, and is registered via REGISTER_BENCHMARK. test_softmax_v6 validates per-element outputs against softmax_naive and asserts the sum is approximately 1.0; main() is updated to invoke it.
Design note
.jules/thunderbolt.md
Dated subsection documents the single-FMA technique, cites benchmark evidence comparing softmax_v6 to softmax_v5, and states a guideline to prefer this approach for ML workloads unless strict IEEE-754 precision is required.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • bugparty/cpu_math_kernels_pri#28: Extends the same softmax.h with earlier exp256_ps_v* helpers and softmax_v* kernels using the same benchmark/test integration pattern.
  • bugparty/cpu_math_kernels_pri#31: Introduces exp256_ps_v2 and softmax_v5 in the same file using a related FMA-based range reduction approach that exp256_ps_v3 directly improves upon.

Poem

🐇 A bunny once measured each ln(2) split,
Two constants for precision — a fiddly bit!
One FNMADD to rule them all,
The latency dropped, the benchmarks stood tall.
softmax_v6 hops to the ML feast,
Single FMA — the fastest beast! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title directly and clearly describes the main change: introducing softmax_v6 with single-FMA range reduction optimization, matching the core objective of this changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch thunderbolt-softmax-v6-8697455812306532988

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
ml_kernels/include/ml_kernels/softmax.h (1)

505-505: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Move function opening braces to their own lines for the new softmax/exp helpers.

exp256_ps_v3 and softmax_v6 currently place { on the signature line; this conflicts with the repo’s C/C++ function-body brace rule.

As per coding guidelines, **/*.{c,cpp,cc,h,hpp}: Keep braces on their own lines for function bodies.

Also applies to: 539-539

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/include/ml_kernels/softmax.h` at line 505, The functions
`exp256_ps_v3` and `softmax_v6` have their opening braces `{` on the same line
as the function signature, which violates the repository's C/C++ coding style
guidelines that require function body braces to be on their own lines. Move the
opening brace for both `exp256_ps_v3` (at line 505) and `softmax_v6` (at line
539) to a new line immediately following their respective function signatures.

Source: Coding guidelines

ml_kernels/src/test_naive_ops.cpp (2)

185-185: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Move function opening braces to their own lines in the newly added test entry points.

test_softmax_v6 and main use same-line opening braces, which conflicts with the repo C/C++ function-body brace rule.

As per coding guidelines, **/*.{c,cpp,cc,h,hpp}: Keep braces on their own lines for function bodies.

Also applies to: 220-220

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/test_naive_ops.cpp` at line 185, The function declarations for
test_softmax_v6 and main have their opening braces on the same line as the
function signature, which violates the repo's C/C++ coding guidelines requiring
function body braces to be on separate lines. Move the opening brace to its own
line for both functions: refactor test_softmax_v6 to place the opening brace on
a new line after the function signature, and apply the same change to the main
function around line 220. This ensures consistency with the repository's brace
placement rule for function bodies.

Source: Coding guidelines


187-215: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

test_softmax_v6 misses scalar-tail coverage.

Current input size is 72, so softmax_v6 only exercises the 32-wide and 8-wide loops; the scalar i < n tails are never tested. Add one extra element (e.g., size 73) to cover the scalar fallback path.

Suggested test tweak
-        0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f
+        0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f,
+        0.123f // force scalar tail path (n % 8 != 0)
     };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/test_naive_ops.cpp` around lines 187 - 215, The
test_softmax_v6 function currently uses an input vector with 72 elements, which
allows softmax_v6 to process all data through the 32-wide and 8-wide vector
loops without triggering the scalar tail fallback path. To ensure the scalar
remainder handling is tested, add one additional element to the input vector
initialization (making it 73 elements total) so that the scalar loop handling
for the remaining elements is exercised during the function calls to
softmax_naive and softmax_v6.
ml_kernels/src/kernel_bench.cpp (1)

337-343: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Keep benchmark method braces on their own lines to match repo C/C++ style.

name() and run() use same-line opening braces in this changed block, which violates the project rule for function bodies.

As per coding guidelines, **/*.{c,cpp,cc,h,hpp}: Keep braces on their own lines for function bodies.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/kernel_bench.cpp` around lines 337 - 343, The opening braces
for the `name()` and `run()` methods are placed on the same line as the method
declarations, violating the project's C++ style guidelines that require braces
on their own lines for function bodies. Move the opening brace of both the
`name()` method (currently containing the return statement on the same line) and
the `run()` method to separate lines below their respective method signatures to
match the coding standard.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 335-344: The SoftmaxV6Benchmark class does not override the
verify() method and inherits the stricter tolerance of 1e-5f from
SoftmaxBenchmark, while the test_softmax_v6 test uses a tolerance of 1e-4f. This
mismatch causes false verification failures for acceptable v6 accuracy. Add an
override of the verify() method in SoftmaxV6Benchmark that uses the same 1e-4f
tolerance as test_softmax_v6 to ensure consistency between the benchmark and
test acceptance criteria.

---

Nitpick comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Line 505: The functions `exp256_ps_v3` and `softmax_v6` have their opening
braces `{` on the same line as the function signature, which violates the
repository's C/C++ coding style guidelines that require function body braces to
be on their own lines. Move the opening brace for both `exp256_ps_v3` (at line
505) and `softmax_v6` (at line 539) to a new line immediately following their
respective function signatures.

In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 337-343: The opening braces for the `name()` and `run()` methods
are placed on the same line as the method declarations, violating the project's
C++ style guidelines that require braces on their own lines for function bodies.
Move the opening brace of both the `name()` method (currently containing the
return statement on the same line) and the `run()` method to separate lines
below their respective method signatures to match the coding standard.

In `@ml_kernels/src/test_naive_ops.cpp`:
- Line 185: The function declarations for test_softmax_v6 and main have their
opening braces on the same line as the function signature, which violates the
repo's C/C++ coding guidelines requiring function body braces to be on separate
lines. Move the opening brace to its own line for both functions: refactor
test_softmax_v6 to place the opening brace on a new line after the function
signature, and apply the same change to the main function around line 220. This
ensures consistency with the repository's brace placement rule for function
bodies.
- Around line 187-215: The test_softmax_v6 function currently uses an input
vector with 72 elements, which allows softmax_v6 to process all data through the
32-wide and 8-wide vector loops without triggering the scalar tail fallback
path. To ensure the scalar remainder handling is tested, add one additional
element to the input vector initialization (making it 73 elements total) so that
the scalar loop handling for the remaining elements is exercised during the
function calls to softmax_naive and softmax_v6.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 27f309aa-d0bf-4386-aa65-300d24342507

📥 Commits

Reviewing files that changed from the base of the PR and between acca01e and 79c1580.

📒 Files selected for processing (4)
  • .jules/thunderbolt.md
  • ml_kernels/include/ml_kernels/softmax.h
  • ml_kernels/src/kernel_bench.cpp
  • ml_kernels/src/test_naive_ops.cpp

Comment on lines +335 to +344
class SoftmaxV6Benchmark : public SoftmaxBenchmark {
public:
const char *name() const override { return "softmax_v6"; }

void run() override {
ml_kernels::softmax_v6(inputs_[current_idx_].data(), outputs_[current_idx_].data(), inputs_[0].size());
current_idx_ = (current_idx_ + 1) % pool_size_;
}
};
REGISTER_BENCHMARK(SoftmaxV6Benchmark);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

SoftmaxV6Benchmark inherits a stricter verify tolerance than the new v6 acceptance target.

Line 335 adds SoftmaxV6Benchmark without overriding verify(), so it inherits SoftmaxBenchmark::verify() (1e-5f) while test_softmax_v6 validates v6 with 1e-4f. This mismatch can produce false benchmark verification failures for otherwise acceptable v6 accuracy.

Suggested fix
 class SoftmaxV6Benchmark : public SoftmaxBenchmark {
 public:
     const char *name() const override { return "softmax_v6"; }

     void run() override {
         ml_kernels::softmax_v6(inputs_[current_idx_].data(), outputs_[current_idx_].data(), inputs_[0].size());
         current_idx_ = (current_idx_ + 1) % pool_size_;
     }
+
+    bool verify() override {
+        current_idx_ = 0;
+        run();
+        constexpr float tol = 1e-4f;
+        for (std::size_t i = 0; i < outputs_[0].size(); ++i) {
+            if (std::fabs(outputs_[0][i] - output_ref_[i]) > tol) {
+                return false;
+            }
+        }
+        return true;
+    }
 };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/kernel_bench.cpp` around lines 335 - 344, The
SoftmaxV6Benchmark class does not override the verify() method and inherits the
stricter tolerance of 1e-5f from SoftmaxBenchmark, while the test_softmax_v6
test uses a tolerance of 1e-4f. This mismatch causes false verification failures
for acceptable v6 accuracy. Add an override of the verify() method in
SoftmaxV6Benchmark that uses the same 1e-4f tolerance as test_softmax_v6 to
ensure consistency between the benchmark and test acceptance criteria.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant