Replace platform-specific SIMD with portable wide crate#127
Replace platform-specific SIMD with portable wide crate#127lilith wants to merge 3 commits intoImageOptim:mainfrom
Conversation
Replace hand-written SSE2 (x86_64) and NEON (aarch64) intrinsics in f_pixel::diff() with safe, portable SIMD using the wide crate's f32x4. - Direct bytemuck cast from ARGBF to f32x4 (both are Pod) - Single implementation works on all platforms (x86, ARM, WASM, etc.) - Includes scalar reference and brute-force comparison test - ~70 lines removed, eliminates all unsafe in this function
The union was used to reinterpret the same memory as either mc_sort_value (u32) during median cut sorting, or likely_palette_index (PalIndex) after. Since PalIndex fits in u32, we can simply store u32 and cast on read. This eliminates unsafe union field access with no runtime cost.
This sorts 3 pivot indices and accesses them 3 times total. The bounds check overhead is negligible compared to the partitioning work that follows. Safe indexing is simpler and equally fast.
Follow-up: MaybeUninit removal feasibilityI've prototyped making Benchmark results (512×512 image, 3 runs each)
The difference is within noise (~1-3%). The zero-initialization cost is negligible compared to the actual quantization work. Changes required
Would you be interested in this approach for the main repo? It would allow |
|
Thanks for the PR. Sorry, I'm busy and can't review PRs in a timely manner. It's nice to have portable SIMD, but it's a shame it requires adding another dependency. I'd rather keep the existing implementations for x86 and ARM. How about making the |
|
I'd prefer changes in separate PRs, so I can merge them separately. The removal of |
|
I really wanted to get rid of panics in quicksort. It used to have only that one bad case, but I checked it now, and there was half a dozen other missed cases ;( LLVM's optimizations there have regressed a lot. I suspect it's all due to I fudged it to be panic-free. Hopefully will get all the perf back when LLVM fixes regressions. |
|
Sorry for the lousy PR, I actually did abandon this approach and should have commented when I closed it. Specialized intrinsics might be best for this crate.
But I've moved about a seven projects over to my archmage crate, which allows #!forbid(unsafe_code] + runtime dispatch + safe core::arch + safe_unaligned_simd for SIMD load/store/mem ops. Normally, intrinsics are a pain, but when they can be compile-time-verified-safe, I don't mind having Claude generate and brute-force-test them against the scalar implementations. I made |
Summary
f_pixel::diff()with portablewide::f32x4HistSortTmpunion in favor of plainu32(union was unnecessary sincePalIndexfits inu32)get_uncheckedinqsort_pivot(3 bounds checks is negligible overhead)Details
The
diff()function had three separate implementations using unsafe intrinsics:#[cfg(target_arch = "x86_64")]with SSE2#[cfg(target_arch = "aarch64")]with NEONNow there's a single portable implementation using
wide::f32x4that compiles to equivalent SIMD on both architectures. A scalar reference implementation is retained under#[cfg(test)]with a brute-force comparison test that verifies both implementations match across edge cases.Test plan
diff_simd_matches_scalartest verifies SIMD matches scalar referencecargo clippypasses