⚡ Thunderbolt: softmax_v6 — AVX2 single-FMA exp256 optimization by bugparty · Pull Request #59 · bugparty/cpu_math_kernels_pri

bugparty · 2026-06-21T20:16:58Z

💡 What:
Implemented softmax_v6 using a new exp256_ps_v3 AVX2 approximation. This new version combines the split ln(2) subtraction constants into a single FMA instruction (_mm256_fnmadd_ps(n, _mm256_set1_ps(0.6931471805599453f), x)).

🎯 Why:
The inner loop of softmax_v5 was bottlenecked by execution port pressure. The standard split ln(2) precision technique (r = x - n*ln2_hi - n*ln2_lo) requires two subtraction instructions.

🏗️ How:
By leveraging the shift-invariance of Softmax, we can tolerate the slight loss in precision from using the single 32-bit floating point constant 0.6931471805599453f for ln(2). This replaces two instructions with one fused multiply-subtract, reducing port contention within the heavily unrolled (4x) loop. The test suite correctly verifies outputs remain strictly within the 1e-4 tolerance.

📊 Impact:

Softmax N=16384 (Fixed Memory): 4.87 GFLOP/s -> 5.44 GFLOP/s (~11.7% increase).
Softmax N=1048576 (Fixed Memory): 3.86 GFLOP/s -> 4.20 GFLOP/s (~8.8% increase).

🖥️ Tested on:
Haswell+ x86-64 CPU environment (via ml_kernel_bench).

🔬 How to reproduce:
cd build && DISABLE_CPU_BINDING=1 ./ml_kernels/ml_kernel_bench | grep -E 'benchmark|softmax_v5|softmax_v6'

PR created automatically by Jules for task 18069717889667174757 started by @bugparty

Summary by CodeRabbit

Release Notes

New Features
- Introduced an optimized softmax kernel with improved AVX2 performance through enhanced range reduction and FMA operations.
Tests
- Added validation tests for the new softmax kernel variant with comprehensive assertions and tolerance checks.
Documentation
- Added optimization notes documenting softmax performance improvements and implementation guidelines.

… share with you: ⚡ Thunderbolt: softmax_v6 — AVX2 single-FMA exp256 optimization 💡 What: I implemented `softmax_v6` using a new `exp256_ps_v3` AVX2 approximation. This new version combines the split `ln(2)` subtraction constants into a single FMA instruction (`_mm256_fnmadd_ps`). 🎯 Why: The inner loop of `softmax_v5` was bottlenecked by execution port pressure. The split `ln(2)` precision technique (`r = x - n*ln2_hi - n*ln2_lo`) requires two subtraction instructions. 🏗️ How: By leveraging the shift-invariance of Softmax, we can tolerate the slight loss in precision from using the single 32-bit floating point constant `0.6931471805599453f` for `ln(2)`. This replaces two instructions with one fused multiply-add, reducing port contention within the heavily unrolled (4x) loop while keeping outputs within the 1e-4 tolerance. 📊 Impact: - Softmax N=1048576 (Fixed Memory): 3.86 GFLOP/s -> 4.20 GFLOP/s (~8.8% throughput increase). 🖥️ Tested on: Haswell+ x86-64 CPU environment (via `ml_kernel_bench`). 🔬 How to reproduce: You can run the benchmarks with `cd build && DISABLE_CPU_BINDING=1 ./ml_kernels/ml_kernel_bench` and then filter the output for 'benchmark', 'softmax_v5', or 'softmax_v6' to see the results. Let me know if you need any further adjustments! Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>

google-labs-jules · 2026-06-21T20:17:00Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

coderabbitai · 2026-06-21T20:17:12Z

📝 Walkthrough

Walkthrough

Adds exp256_ps_v3, an AVX2 exp approximation using a single fnmadd_ps for ln(2) range reduction, and softmax_v6 that uses it for numerically-stable softmax. A matching benchmark and correctness test are registered, and the FMA optimization rationale is logged to the dev journal.

Changes

AVX2 softmax_v6 with FMA-optimized exp helper

Layer / File(s)	Summary
`exp256_ps_v3`: FMA range-reduction exp helper `ml_kernels/include/ml_kernels/softmax.h`	Adds `exp256_ps_v3` which clamps inputs, converts `x*log2(e)` to an integer exponent via `cvtps_epi32`, computes the reduced residual with a single `fnmadd_ps` using the full `ln(2)` constant, evaluates the polynomial via Horner's method, and reconstructs `exp(x)` from integer exponent bits.
`softmax_v6`: 32-wide vectorized softmax `ml_kernels/include/ml_kernels/softmax.h`	Adds `softmax_v6` with 32-wide unrolled max reduction, vectorized exp/sum accumulation calling `exp256_ps_v3`, early-return on zero sum, and vectorized normalization with 32/8-stride and scalar tail loops.
Benchmark and correctness test `ml_kernels/src/kernel_bench.cpp`, `ml_kernels/src/test_naive_ops.cpp`	Adds `SoftmaxV6Benchmark` derived from `SoftmaxBenchmark` registered via `REGISTER_BENCHMARK`; adds `test_softmax_v6()` asserting per-element agreement with `softmax_naive` within `1e-4f` and verifying probability sum; `main()` updated to invoke the new test.
Dev journal entry `.jules/thunderbolt.md`	Records the FMA constant consolidation optimization with estimated instruction savings and the shift-invariance precondition for applying the technique.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

bugparty/cpu_math_kernels_pri#31: Directly parallels this PR's structure — adds exp256_ps_v2 and softmax_v5 with the same benchmark/test wiring pattern that softmax_v6 now extends.
bugparty/cpu_math_kernels_pri#7: Adds unit tests for softmax_naive, which is the reference implementation test_softmax_v6() compares against.

Poem

🐰 A rabbit hops through SIMD lanes so wide,
One fnmadd where two constants once vied,
exp256_ps_v3 blooms with Horner's grace,
softmax_v6 sprints at a vectorized pace —
The journal logs what every kernel knows:
fewer instructions, the faster math flows! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately captures the main optimization: a new softmax variant (v6) using an AVX2 single-FMA exp256 approach, which aligns with the core technical change and the journal entry added.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch thunderbolt-softmax-fma-exp-18069717889667174757

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

ml_kernels/src/test_naive_ops.cpp (1)

187-219: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Add a non-multiple-of-8 length case to cover scalar tail paths in softmax_v6.

This input has 72 elements, so only 32-wide and 8-wide loops run. The scalar tail in softmax_v6 remains untested.

✅ Minimal test tweak

-        -3.14f, -2.71f, -1.41f, -1.73f
+        -3.14f, -2.71f, -1.41f, -1.73f,
+        0.123f

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/test_naive_ops.cpp` around lines 187 - 219, The current test
case uses an input vector with 72 elements, which is a multiple of 8, so the
scalar tail path in the softmax_v6 function is never executed during testing.
Modify the input vector in this test to have a length that is not a multiple of
8 (for example, add or remove a few elements to make it 73, 75, or another
non-multiple-of-8 value) so that the scalar tail handling code in softmax_v6
gets properly tested alongside the vectorized paths.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Around line 505-506: Move the opening brace from the same line as the function
signature to its own separate line for the functions exp256_ps_v3 and any other
functions in this header file that have the same pattern (around line 544-545).
In each case, place the opening brace { on a new line immediately following the
function signature to comply with the header file coding guidelines that require
function body braces to be on their own lines.
- Around line 509-511: The `_mm256_cvtps_epi32` conversion in the `exp256_ps_v3`
function follows the current MXCSR rounding mode rather than forcing
round-to-nearest as the comment suggests, which can cause softmax accuracy drift
if callers modify MXCSR rounding control bits. Force deterministic rounding by
first applying `_mm256_round_ps(x_log2e, _MM_FROUND_TO_NEAREST_INT)` to
explicitly round to nearest before converting the result to int32 with
`_mm256_cvtepi32_ps`, ensuring range reduction behavior is stable regardless of
MXCSR settings.

In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 337-343: Reformat the opening braces for the name() and run()
methods to comply with the coding guidelines. Move the opening brace for the
name() const override method from the same line as the function signature to its
own line before the return statement. Similarly, move the opening brace for the
run() override method to its own line before the method body containing the
ml_kernels::softmax_v6 call and current_idx_ update.

In `@ml_kernels/src/test_naive_ops.cpp`:
- Around line 185-186: Reformat the opening braces for the test_softmax_v6()
function and the main() function to follow the coding guidelines. Currently,
both functions have their opening brace placed on the same line as the function
signature (e.g., void test_softmax_v6() {). Move each opening brace to its own
line so that the function body starts on the line following the signature,
ensuring consistency with the required code style for C/C++ files.

---

Nitpick comments:
In `@ml_kernels/src/test_naive_ops.cpp`:
- Around line 187-219: The current test case uses an input vector with 72
elements, which is a multiple of 8, so the scalar tail path in the softmax_v6
function is never executed during testing. Modify the input vector in this test
to have a length that is not a multiple of 8 (for example, add or remove a few
elements to make it 73, 75, or another non-multiple-of-8 value) so that the
scalar tail handling code in softmax_v6 gets properly tested alongside the
vectorized paths.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d8da8d9a-2e5c-4833-96e4-27c06c38486b

📥 Commits

Reviewing files that changed from the base of the PR and between acca01e and 94b2504.

📒 Files selected for processing (4)

.jules/thunderbolt.md
ml_kernels/include/ml_kernels/softmax.h
ml_kernels/src/kernel_bench.cpp
ml_kernels/src/test_naive_ops.cpp

coderabbitai · 2026-06-21T20:22:11Z

+inline __m256 exp256_ps_v3(__m256 x) {
+    x = _mm256_max_ps(x, _mm256_set1_ps(-87.3f));


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Place function opening braces on their own lines in this header.

Line 505 and Line 544 currently place { on the same line as the function signature.

As per coding guidelines, **/*.{c,cpp,cc,h,hpp}: “Keep braces on their own lines for function bodies”.

Also applies to: 544-545

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/include/ml_kernels/softmax.h` around lines 505 - 506, Move the opening brace from the same line as the function signature to its own separate line for the functions exp256_ps_v3 and any other functions in this header file that have the same pattern (around line 544-545). In each case, place the opening brace { on a new line immediately following the function signature to comply with the header file coding guidelines that require function body braces to be on their own lines.

Source: Coding guidelines

coderabbitai · 2026-06-21T20:22:11Z

+    // cvtps_epi32 defaults to round-to-nearest in AVX2, avoiding round_ps
+    __m256i n_int = _mm256_cvtps_epi32(x_log2e);
+    __m256 n = _mm256_cvtepi32_ps(n_int);


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

For AVX2 intrinsics, does _mm256_cvtps_epi32 use MXCSR rounding mode, or is it always round-to-nearest?

💡 Result:

The _mm256_cvtps_epi32 intrinsic uses the current rounding mode defined in the MXCSR register [1][2][3]. It does not default to round-to-nearest in a hardcoded manner, but rather follows the rounding control bits specified in the MXCSR register (which default to round-to-nearest-even on most systems) [4][5][6]. When this intrinsic performs a conversion and the result is inexact, it rounds the value according to the current setting of the rounding control (RC) bits in the MXCSR register [2][3][7]. If you require a different behavior, such as truncation (round toward zero), you should use the _mm256_cvttps_epi32 intrinsic, which performs truncation regardless of the MXCSR rounding mode [5][6].

Citations:

1: https://intel-intrinsics.dpldocs.info/inteli.avxintrin._mm256_cvtps_epi32.html

2: https://www.felixcloutier.com/x86/cvtps2dq.html

3: https://hjlebbink.github.io/x86doc/html/CVTPS2DQ.html

4: https://stackoverflow.com/questions/63680437/avx2-equivalent-of-lrintf

5: https://stackoverflow.com/questions/77821832/what-conversion-is-the-simd-packed-fp32-to-packed-int32-doing

6: https://laplace.physics.ubc.ca/Doc/intel50/c_ug/linux62a.htm

7: https://www.laruence.com/x86/CVTPS2DQ.html

🏁 Script executed:

# Check if the file exists and read the relevant lines if [ -f "ml_kernels/include/ml_kernels/softmax.h" ]; then echo "=== File found ===" wc -l ml_kernels/include/ml_kernels/softmax.h echo "=== Lines 505-515 ===" sed -n '505,515p' ml_kernels/include/ml_kernels/softmax.h else echo "File not found at ml_kernels/include/ml_kernels/softmax.h" # Try to find the file find . -name "softmax.h" -type f 2>/dev/null fi

Repository: bugparty/cpu_math_kernels_pri

Length of output: 693

🏁 Script executed:

# Also search for the _mm256_cvtps_epi32 usage pattern to understand context rg "_mm256_cvtps_epi32" -B 3 -A 3

Repository: bugparty/cpu_math_kernels_pri

Length of output: 2414

Force deterministic rounding in exp256_ps_v3 range reduction.

Line 510 uses _mm256_cvtps_epi32(x_log2e), which follows the current MXCSR rounding mode rather than being hardcoded to round-to-nearest as the code comment suggests. If any caller changes the MXCSR rounding control bits, range reduction behavior changes and softmax accuracy can drift.

Proposed fix

- // cvtps_epi32 defaults to round-to-nearest in AVX2, avoiding round_ps - __m256i n_int = _mm256_cvtps_epi32(x_log2e); - __m256 n = _mm256_cvtepi32_ps(n_int); + // Force deterministic round-to-nearest for range reduction. + __m256 x_rounded = _mm256_round_ps( + x_log2e, _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC); + __m256i n_int = _mm256_cvtps_epi32(x_rounded); + __m256 n = _mm256_cvtepi32_ps(n_int);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/include/ml_kernels/softmax.h` around lines 509 - 511, The `_mm256_cvtps_epi32` conversion in the `exp256_ps_v3` function follows the current MXCSR rounding mode rather than forcing round-to-nearest as the comment suggests, which can cause softmax accuracy drift if callers modify MXCSR rounding control bits. Force deterministic rounding by first applying `_mm256_round_ps(x_log2e, _MM_FROUND_TO_NEAREST_INT)` to explicitly round to nearest before converting the result to int32 with `_mm256_cvtepi32_ps`, ensuring range reduction behavior is stable regardless of MXCSR settings.

coderabbitai · 2026-06-21T20:22:11Z

+    const char *name() const override { return "softmax_v6"; }
+
+    void run() override {
+        ml_kernels::softmax_v6(inputs_[current_idx_].data(), outputs_[current_idx_].data(), inputs_[0].size());
+        current_idx_ = (current_idx_ + 1) % pool_size_;
+    }
+};


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use own-line opening braces for function bodies in the new benchmark methods.

name() and run() in this added block currently use same-line {.

As per coding guidelines, **/*.{c,cpp,cc,h,hpp}: “Keep braces on their own lines for function bodies”.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/src/kernel_bench.cpp` around lines 337 - 343, Reformat the opening braces for the name() and run() methods to comply with the coding guidelines. Move the opening brace for the name() const override method from the same line as the function signature to its own line before the return statement. Similarly, move the opening brace for the run() override method to its own line before the method body containing the ml_kernels::softmax_v6 call and current_idx_ update.

Source: Coding guidelines

coderabbitai · 2026-06-21T20:22:12Z

+void test_softmax_v6() {
+    std::cout << "Running test_softmax_v6..." << std::endl;


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Keep function opening braces on their own lines in the new test additions.

test_softmax_v6() and main() currently place { on the same line as the signature.

As per coding guidelines, **/*.{c,cpp,cc,h,hpp}: “Keep braces on their own lines for function bodies”.

Also applies to: 224-224

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/src/test_naive_ops.cpp` around lines 185 - 186, Reformat the opening braces for the test_softmax_v6() function and the main() function to follow the coding guidelines. Currently, both functions have their opening brace placed on the same line as the function signature (e.g., void test_softmax_v6() {). Move each opening brace to its own line so that the function body starts on the line following the signature, ensuring consistency with the required code style for C/C++ files.

Source: Coding guidelines

coderabbitai Bot reviewed Jun 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡ Thunderbolt: softmax_v6 — AVX2 single-FMA exp256 optimization#59

⚡ Thunderbolt: softmax_v6 — AVX2 single-FMA exp256 optimization#59
bugparty wants to merge 1 commit into
mainfrom
thunderbolt-softmax-fma-exp-18069717889667174757

bugparty commented Jun 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

google-labs-jules Bot commented Jun 21, 2026

Uh oh!

coderabbitai Bot commented Jun 21, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 21, 2026

Uh oh!

coderabbitai Bot Jun 21, 2026

Uh oh!

coderabbitai Bot Jun 21, 2026

Uh oh!

coderabbitai Bot Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		inline __m256 exp256_ps_v3(__m256 x) {
		x = _mm256_max_ps(x, _mm256_set1_ps(-87.3f));

		void test_softmax_v6() {
		std::cout << "Running test_softmax_v6..." << std::endl;

Conversation

bugparty commented Jun 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

google-labs-jules Bot commented Jun 21, 2026

Uh oh!

coderabbitai Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bugparty commented Jun 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 21, 2026 •

edited

Loading