⚡ Thunderbolt: softmax_v6 — Heterogeneous AVX2 Loop Unrolling#56
⚡ Thunderbolt: softmax_v6 — Heterogeneous AVX2 Loop Unrolling#56bugparty wants to merge 1 commit into
Conversation
Implements `softmax_v6` with 8x-4x-8x heterogeneous loop unrolling. This avoids YMM register spilling in the Horner polynomial phase while saturating the execution ports for the max and norm reductions. Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
📝 WalkthroughWalkthroughAdds Changessoftmax_v6 Kernel, Benchmark, Test, and Docs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
ml_kernels/src/test_naive_ops.cpp (1)
184-207: ⚡ Quick winConsider adding a test case for the scalar tail path.
The 72-element input exercises the 64-element and 8-element vector loops, but not the scalar remainder path (requires
n % 8 != 0). Adding a second test with, e.g., 73 or 75 elements would improve coverage for the scalar cleanup loops in all three passes.Also, for consistency with
test_softmax_v3/v4/v5, consider adding the sum-to-one verification:float sum = 0.0f; for (size_t i = 0; i < input.size(); ++i) { // existing comparison... sum += output_v6[i]; } if (std::abs(sum - 1.0f) > 1e-4f) { std::cerr << "Sum mismatch: expected 1.0, got " << sum << std::endl; exit(1); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/src/test_naive_ops.cpp` around lines 184 - 207, The test_softmax_v6 function is missing two things: (1) it does not verify that the output values sum to 1.0, which is necessary for consistency with test_softmax_v3, test_softmax_v4, and test_softmax_v5, and (2) the 72-element input does not trigger the scalar remainder path because 72 is divisible by 8, so it does not properly test the scalar cleanup loops. To fix this, add a sum-to-one verification loop after the element-wise comparison that accumulates output_v6 values and checks if the sum is approximately 1.0, and change the input vector size from 72 elements to 73 or 75 elements to ensure the scalar tail path is exercised during all three passes of the softmax computation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@ml_kernels/src/test_naive_ops.cpp`:
- Around line 184-207: The test_softmax_v6 function is missing two things: (1)
it does not verify that the output values sum to 1.0, which is necessary for
consistency with test_softmax_v3, test_softmax_v4, and test_softmax_v5, and (2)
the 72-element input does not trigger the scalar remainder path because 72 is
divisible by 8, so it does not properly test the scalar cleanup loops. To fix
this, add a sum-to-one verification loop after the element-wise comparison that
accumulates output_v6 values and checks if the sum is approximately 1.0, and
change the input vector size from 72 elements to 73 or 75 elements to ensure the
scalar tail path is exercised during all three passes of the softmax
computation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3144497e-a9b0-4b9f-932e-8eff88be55c3
📒 Files selected for processing (4)
.jules/thunderbolt.mdml_kernels/include/ml_kernels/softmax.hml_kernels/src/kernel_bench.cppml_kernels/src/test_naive_ops.cpp
💡 What: Added
softmax_v6, a heterogeneous unrolled AVX2 implementation (8x max reduction, 4x Horner polynomial exp, 8x normalization).🎯 Why: Uniformly unrolling Pass 2 (the Horner polynomial exp approximation) by 8x maxes out the 16 available YMM registers in AVX2, causing register spilling to the stack. By strategically keeping the dense compute phase at a 4x unroll while pushing the simpler passes to 8x, we minimize loop overhead and saturate the execution ports without incurring spill penalties.
🏗️ How: Separated the unroll factors manually in the 3-pass loop. Pass 1 and Pass 3 are explicitly unrolled 8x (processing 64 floats per iteration). Pass 2 remains at 4x (32 floats).
📊 Impact:
N=1048576 (Pool Mode):
softmax_v6(1.620ms avg, 2.59 GFLOP/s) matched or slightly beatsoftmax_v5(1.681ms avg, 2.49 GFLOP/s).Microbenchmark of loop combinations showed ~10% latency reduction (1040us -> 933us) in L1 cache-resident buffers.
🖥️ Tested on: GCC 13.3.0, AVX2 environment
🔬 How to reproduce:
cd build && make ml_kernel_benchDISABLE_CPU_BINDING=1 ./ml_kernels/ml_kernel_bench --sizes 1048576 --iters 50 --filter "softmax"PR created automatically by Jules for task 16743516370251725513 started by @bugparty
Summary by CodeRabbit
Release Notes
New Features
Tests
Documentation