Skip to content

Replace platform-specific SIMD with portable wide crate#127

Closed
lilith wants to merge 3 commits intoImageOptim:mainfrom
lilith:pr-safe-simd
Closed

Replace platform-specific SIMD with portable wide crate#127
lilith wants to merge 3 commits intoImageOptim:mainfrom
lilith:pr-safe-simd

Conversation

@lilith
Copy link
Contributor

@lilith lilith commented Jan 18, 2026

Summary

  • Replace platform-specific SSE2/NEON intrinsics in f_pixel::diff() with portable wide::f32x4
  • Remove HistSortTmp union in favor of plain u32 (union was unnecessary since PalIndex fits in u32)
  • Remove unnecessary get_unchecked in qsort_pivot (3 bounds checks is negligible overhead)

Details

The diff() function had three separate implementations using unsafe intrinsics:

  • #[cfg(target_arch = "x86_64")] with SSE2
  • #[cfg(target_arch = "aarch64")] with NEON
  • Scalar fallback

Now there's a single portable implementation using wide::f32x4 that compiles to equivalent SIMD on both architectures. A scalar reference implementation is retained under #[cfg(test)] with a brute-force comparison test that verifies both implementations match across edge cases.

Test plan

  • All existing tests pass
  • New diff_simd_matches_scalar test verifies SIMD matches scalar reference
  • cargo clippy passes

Replace hand-written SSE2 (x86_64) and NEON (aarch64) intrinsics in
f_pixel::diff() with safe, portable SIMD using the wide crate's f32x4.

- Direct bytemuck cast from ARGBF to f32x4 (both are Pod)
- Single implementation works on all platforms (x86, ARM, WASM, etc.)
- Includes scalar reference and brute-force comparison test
- ~70 lines removed, eliminates all unsafe in this function
The union was used to reinterpret the same memory as either mc_sort_value
(u32) during median cut sorting, or likely_palette_index (PalIndex) after.

Since PalIndex fits in u32, we can simply store u32 and cast on read.
This eliminates unsafe union field access with no runtime cost.
This sorts 3 pivot indices and accesses them 3 times total.
The bounds check overhead is negligible compared to the partitioning
work that follows. Safe indexing is simpler and equally fast.
@lilith
Copy link
Contributor Author

lilith commented Jan 18, 2026

Follow-up: MaybeUninit removal feasibility

I've prototyped making MaybeUninit opt-in via an unsafe-uninit feature flag. The default path uses zero-initialized buffers instead, which eliminates most unsafe code for pure Rust users.

Benchmark results (512×512 image, 3 runs each)

Mode quantize remap/dither histogram
Safe (zero-init) 35.9 / 35.7 / 36.9 ms 11.6 / 11.9 / 11.5 ms 5.5 / 6.4 / 5.3 ms
MaybeUninit 35.5 / 35.2 / 35.9 ms 11.9 / 11.2 / 11.5 ms 5.7 / 5.9 / 5.9 ms

The difference is within noise (~1-3%). The zero-initialization cost is negligible compared to the actual quantization work.

Changes required

  • Slot<T> type alias: MaybeUninit<T> or T based on feature
  • Helper functions that work in both modes
  • Image::new_fn gated behind unsafe-uninit (callback API inherently requires unsafe)
  • _internal_c_ffi implies unsafe-uninit (C users keep current behavior)

Would you be interested in this approach for the main repo? It would allow #![deny(unsafe_code)] for pure Rust builds while maintaining backward compatibility for C FFI users.

@lilith lilith closed this Feb 8, 2026
@kornelski kornelski reopened this Feb 10, 2026
@kornelski
Copy link
Member

Thanks for the PR. Sorry, I'm busy and can't review PRs in a timely manner.

It's nice to have portable SIMD, but it's a shame it requires adding another dependency. I'd rather keep the existing implementations for x86 and ARM.

How about making the wide dependency target-specific and used only on non-arm non-x86 platforms?

@kornelski
Copy link
Member

kornelski commented Feb 11, 2026

I'd prefer changes in separate PRs, so I can merge them separately.

The removal of union is fine, but it left the code setting the field a bit cryptic. I've cherry-picked it and added a setter. 21a6396

@kornelski
Copy link
Member

kornelski commented Feb 11, 2026

I really wanted to get rid of panics in quicksort. It used to have only that one bad case, but I checked it now, and there was half a dozen other missed cases ;( LLVM's optimizations there have regressed a lot. I suspect it's all due to if/else making LLVM forget which variables were in bounds.

I fudged it to be panic-free. Hopefully will get all the perf back when LLVM fixes regressions.

@lilith
Copy link
Contributor Author

lilith commented Feb 11, 2026

Sorry for the lousy PR, I actually did abandon this approach and should have commented when I closed it. Specialized intrinsics might be best for this crate.

wide is great for wasm 128 and all compile-time optimization - really a beautiful approach that gets 80% of the perf gains. Unfortunately, it doesn't do runtime 'upgrade' based on CPU features, so you have to combine it with the multiversion/multiversed macros to generate specialized, optimized versions with runtime dispatch. And since wide is compile-time, it produces substandard intrinsic and we have to rely on LVVM to hopefully upgrade them. Which it does absurdly well about 60% of the time.

But wide probably isn't for libimagequant. Nor is archmage, but I wanted to mention it since I've been using it in this type of library a lot recently.

I've moved about a seven projects over to my archmage crate, which allows #!forbid(unsafe_code] + runtime dispatch + safe core::arch + safe_unaligned_simd for SIMD load/store/mem ops. Normally, intrinsics are a pain, but when they can be compile-time-verified-safe, I don't mind having Claude generate and brute-force-test them against the scalar implementations. I made magetypes as a token-proof runtime version of wide, but I'm getting less and less invested in portable simd-style solutions as testing and compile-time safety become enough.

// Your code:
#[arcane]
fn add(token: Desktop64, a: __m256, b: __m256) -> __m256 {
    _mm256_add_ps(a, b)
}

// Generated:
fn add(token: Desktop64, a: __m256, b: __m256) -> __m256 {
    #[target_feature(enable = "avx2,fma,bmi1,bmi2")]
    #[inline]
    unsafe fn __inner(token: Desktop64, a: __m256, b: __m256) -> __m256 {
        _mm256_add_ps(a, b)  // Safe inside #[target_feature]!
    }
    // SAFETY: Token proves CPU support was verified
    unsafe { __inner(token, a, b) }
}

use archmage::{Desktop64, Arm64, Simd128Token, SimdToken};

pub fn process(data: &mut [f32]) {
    // Try x86 AVX2
    if let Some(token) = Desktop64::summon() {
        return process_x86(token, data);
    }

    // Try ARM NEON
    if let Some(token) = Arm64::summon() {
        return process_arm(token, data);
    }

    // Try WASM SIMD
    if let Some(token) = Simd128Token::summon() {
        return process_wasm(token, data);
    }

    // Scalar fallback
    process_scalar(data);
}

#[arcane]
fn process_x86(token: Desktop64, data: &mut [f32]) { /* ... */ }

#[arcane]
fn process_arm(token: Arm64, data: &mut [f32]) { /* ... */ }

#[arcane]
fn process_wasm(token: Simd128Token, data: &mut [f32]) { /* ... */ }

fn process_scalar(data: &mut [f32]) { /* ... */ }

@kornelski kornelski closed this Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants