⚡ Thunderbolt: softmax_v6 — Single-FMA range reduction optimization#60
⚡ Thunderbolt: softmax_v6 — Single-FMA range reduction optimization#60bugparty wants to merge 1 commit into
Conversation
💡 What: Implemented `softmax_v6` with a new `exp256_ps_v3` AVX2 approximation that uses a single FMA for range reduction instead of splitting `ln(2)`. 🎯 Why: In transcendental approximations, splitting `ln(2)` for exact precision adds instructions to the critical path. ML workloads like Softmax are shift-invariant and can tolerate slightly reduced precision. 🏗️ How: Replaced `_mm256_fnmadd_ps(n, c1, x)` and `_mm256_fnmadd_ps(n, c2, r)` with a single `_mm256_fnmadd_ps(n, ln2_constant, x)`. Unrolled the inner loop 4x to hide latency as before. 📊 Impact: ~10% throughput improvement. `softmax_v6` achieves 5.30 GFLOP/s compared to `softmax_v5` at 4.81 GFLOP/s (N=65536, Fixed Memory mode). Passes all validation tests within 1e-4 tolerance. 🖥️ Tested on: Haswell+ AVX2 CPU 🔬 How to reproduce: `DISABLE_CPU_BINDING=1 ./build/ml_kernels/ml_kernel_bench --filter 'softmax_v[56]' --sizes 65536` Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
📝 WalkthroughWalkthroughAdds softmax_v6 AVX2 Implementation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
ml_kernels/include/ml_kernels/softmax.h (1)
505-505: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winMove function opening braces to their own lines for the new softmax/exp helpers.
exp256_ps_v3andsoftmax_v6currently place{on the signature line; this conflicts with the repo’s C/C++ function-body brace rule.As per coding guidelines,
**/*.{c,cpp,cc,h,hpp}: Keep braces on their own lines for function bodies.Also applies to: 539-539
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/include/ml_kernels/softmax.h` at line 505, The functions `exp256_ps_v3` and `softmax_v6` have their opening braces `{` on the same line as the function signature, which violates the repository's C/C++ coding style guidelines that require function body braces to be on their own lines. Move the opening brace for both `exp256_ps_v3` (at line 505) and `softmax_v6` (at line 539) to a new line immediately following their respective function signatures.Source: Coding guidelines
ml_kernels/src/test_naive_ops.cpp (2)
185-185: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winMove function opening braces to their own lines in the newly added test entry points.
test_softmax_v6andmainuse same-line opening braces, which conflicts with the repo C/C++ function-body brace rule.As per coding guidelines,
**/*.{c,cpp,cc,h,hpp}: Keep braces on their own lines for function bodies.Also applies to: 220-220
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/src/test_naive_ops.cpp` at line 185, The function declarations for test_softmax_v6 and main have their opening braces on the same line as the function signature, which violates the repo's C/C++ coding guidelines requiring function body braces to be on separate lines. Move the opening brace to its own line for both functions: refactor test_softmax_v6 to place the opening brace on a new line after the function signature, and apply the same change to the main function around line 220. This ensures consistency with the repository's brace placement rule for function bodies.Source: Coding guidelines
187-215: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win
test_softmax_v6misses scalar-tail coverage.Current input size is 72, so
softmax_v6only exercises the 32-wide and 8-wide loops; the scalari < ntails are never tested. Add one extra element (e.g., size 73) to cover the scalar fallback path.Suggested test tweak
- 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f + 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, + 0.123f // force scalar tail path (n % 8 != 0) };🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/src/test_naive_ops.cpp` around lines 187 - 215, The test_softmax_v6 function currently uses an input vector with 72 elements, which allows softmax_v6 to process all data through the 32-wide and 8-wide vector loops without triggering the scalar tail fallback path. To ensure the scalar remainder handling is tested, add one additional element to the input vector initialization (making it 73 elements total) so that the scalar loop handling for the remaining elements is exercised during the function calls to softmax_naive and softmax_v6.ml_kernels/src/kernel_bench.cpp (1)
337-343: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winKeep benchmark method braces on their own lines to match repo C/C++ style.
name()andrun()use same-line opening braces in this changed block, which violates the project rule for function bodies.As per coding guidelines,
**/*.{c,cpp,cc,h,hpp}: Keep braces on their own lines for function bodies.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/src/kernel_bench.cpp` around lines 337 - 343, The opening braces for the `name()` and `run()` methods are placed on the same line as the method declarations, violating the project's C++ style guidelines that require braces on their own lines for function bodies. Move the opening brace of both the `name()` method (currently containing the return statement on the same line) and the `run()` method to separate lines below their respective method signatures to match the coding standard.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 335-344: The SoftmaxV6Benchmark class does not override the
verify() method and inherits the stricter tolerance of 1e-5f from
SoftmaxBenchmark, while the test_softmax_v6 test uses a tolerance of 1e-4f. This
mismatch causes false verification failures for acceptable v6 accuracy. Add an
override of the verify() method in SoftmaxV6Benchmark that uses the same 1e-4f
tolerance as test_softmax_v6 to ensure consistency between the benchmark and
test acceptance criteria.
---
Nitpick comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Line 505: The functions `exp256_ps_v3` and `softmax_v6` have their opening
braces `{` on the same line as the function signature, which violates the
repository's C/C++ coding style guidelines that require function body braces to
be on their own lines. Move the opening brace for both `exp256_ps_v3` (at line
505) and `softmax_v6` (at line 539) to a new line immediately following their
respective function signatures.
In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 337-343: The opening braces for the `name()` and `run()` methods
are placed on the same line as the method declarations, violating the project's
C++ style guidelines that require braces on their own lines for function bodies.
Move the opening brace of both the `name()` method (currently containing the
return statement on the same line) and the `run()` method to separate lines
below their respective method signatures to match the coding standard.
In `@ml_kernels/src/test_naive_ops.cpp`:
- Line 185: The function declarations for test_softmax_v6 and main have their
opening braces on the same line as the function signature, which violates the
repo's C/C++ coding guidelines requiring function body braces to be on separate
lines. Move the opening brace to its own line for both functions: refactor
test_softmax_v6 to place the opening brace on a new line after the function
signature, and apply the same change to the main function around line 220. This
ensures consistency with the repository's brace placement rule for function
bodies.
- Around line 187-215: The test_softmax_v6 function currently uses an input
vector with 72 elements, which allows softmax_v6 to process all data through the
32-wide and 8-wide vector loops without triggering the scalar tail fallback
path. To ensure the scalar remainder handling is tested, add one additional
element to the input vector initialization (making it 73 elements total) so that
the scalar loop handling for the remaining elements is exercised during the
function calls to softmax_naive and softmax_v6.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 27f309aa-d0bf-4386-aa65-300d24342507
📒 Files selected for processing (4)
.jules/thunderbolt.mdml_kernels/include/ml_kernels/softmax.hml_kernels/src/kernel_bench.cppml_kernels/src/test_naive_ops.cpp
| class SoftmaxV6Benchmark : public SoftmaxBenchmark { | ||
| public: | ||
| const char *name() const override { return "softmax_v6"; } | ||
|
|
||
| void run() override { | ||
| ml_kernels::softmax_v6(inputs_[current_idx_].data(), outputs_[current_idx_].data(), inputs_[0].size()); | ||
| current_idx_ = (current_idx_ + 1) % pool_size_; | ||
| } | ||
| }; | ||
| REGISTER_BENCHMARK(SoftmaxV6Benchmark); |
There was a problem hiding this comment.
SoftmaxV6Benchmark inherits a stricter verify tolerance than the new v6 acceptance target.
Line 335 adds SoftmaxV6Benchmark without overriding verify(), so it inherits SoftmaxBenchmark::verify() (1e-5f) while test_softmax_v6 validates v6 with 1e-4f. This mismatch can produce false benchmark verification failures for otherwise acceptable v6 accuracy.
Suggested fix
class SoftmaxV6Benchmark : public SoftmaxBenchmark {
public:
const char *name() const override { return "softmax_v6"; }
void run() override {
ml_kernels::softmax_v6(inputs_[current_idx_].data(), outputs_[current_idx_].data(), inputs_[0].size());
current_idx_ = (current_idx_ + 1) % pool_size_;
}
+
+ bool verify() override {
+ current_idx_ = 0;
+ run();
+ constexpr float tol = 1e-4f;
+ for (std::size_t i = 0; i < outputs_[0].size(); ++i) {
+ if (std::fabs(outputs_[0][i] - output_ref_[i]) > tol) {
+ return false;
+ }
+ }
+ return true;
+ }
};🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ml_kernels/src/kernel_bench.cpp` around lines 335 - 344, The
SoftmaxV6Benchmark class does not override the verify() method and inherits the
stricter tolerance of 1e-5f from SoftmaxBenchmark, while the test_softmax_v6
test uses a tolerance of 1e-4f. This mismatch causes false verification failures
for acceptable v6 accuracy. Add an override of the verify() method in
SoftmaxV6Benchmark that uses the same 1e-4f tolerance as test_softmax_v6 to
ensure consistency between the benchmark and test acceptance criteria.
💡 What: Implemented
softmax_v6with a newexp256_ps_v3AVX2 approximation that uses a single FMA for range reduction instead of splittingln(2).🎯 Why: In transcendental approximations, splitting
ln(2)for exact precision adds instructions to the critical path. ML workloads like Softmax are shift-invariant and can tolerate slightly reduced precision.🏗️ How: Replaced the split range reduction step with a single
_mm256_fnmadd_ps(n, ln2_constant, x). Unrolled the inner loop 4x to hide latency.📊 Impact: ~10% throughput improvement.
softmax_v6achieves 5.30 GFLOP/s compared tosoftmax_v5at 4.81 GFLOP/s (N=65536, Fixed Memory mode). Passes all validation tests within 1e-4 tolerance.🖥️ Tested on: Haswell+ AVX2 CPU (Linux build container)
🔬 How to reproduce: Compile and run
DISABLE_CPU_BINDING=1 ./build/ml_kernels/ml_kernel_bench --filter 'softmax_v[56]' --sizes 65536PR created automatically by Jules for task 8697455812306532988 started by @bugparty
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests