Skip to content

tscore: SIMD ts::memcpy_tolower; use in URL.cc cache-key fast path#13167

Draft
phongn wants to merge 1 commit into
apache:masterfrom
phongn:simd-bulk-tolower
Draft

tscore: SIMD ts::memcpy_tolower; use in URL.cc cache-key fast path#13167
phongn wants to merge 1 commit into
apache:masterfrom
phongn:simd-bulk-tolower

Conversation

@phongn
Copy link
Copy Markdown
Collaborator

@phongn phongn commented May 14, 2026

Summary

Add a SIMD-accelerated bulk ASCII tolower helper ts::memcpy_tolower in tscore, and use it in place of the byte-at-a-time loop on the URL canonicalization fast path that produces the cache-key digest. Header-only helper with a compile-time ISA cascade: 64-byte AVX-512BW, 32-byte AVX2, 16-byte SSE2 on x86_64, plus 16-byte NEON on ARMv8. Selection is purely compile-time; runtime ifunc dispatch is left for a follow-up. Operators get the wider path automatically by raising -march (x86-64-v3 = AVX2, x86-64-v4 = AVX-512BW); a stock x86_64 build keeps SSE2.

Behavior matches ParseRules::ink_tolower exactly: bytes in A..Z map to a..z, all others (including 0x80..0xFF) pass through unchanged.

Implementation notes

  • Cascade: wider bodies drain into narrower ones, so the worst-case scalar tail is always <16 bytes regardless of build flags.
  • AVX-512BW kernel uses _mm512_mask_add_epi8 to fuse the conditional +0x20 into a single op, and a masked load/store for the 1–63-byte tail in one SIMD pass. Inspired by Tony Finch's copytolower64.c.
  • The whole AVX-512BW block is gated at n ≥ 64, because the masked load/store carries ~7 ns of fixed setup that loses to the narrower paths for short inputs; below 64 bytes the AVX-512BW build falls through to its AVX2 + SSE2 cascade.

src/proxy/hdrs/URL.cc drops its static-inline memcpy_tolower and calls ts::memcpy_tolower instead.

Performance — measured on Xeon Gold 6338 (Ice Lake, 2.0 GHz)

Mean ns per call from tools/benchmark/benchmark_memcpy_tolower:

Size scalar SSE2 AVX2 (-mavx2) AVX-512BW (-mavx512bw)
16 B 10.4 2.15 1.75 1.98
32 B 15.4 2.90 2.24 2.31
64 B 28.0 4.43 2.85 2.61
256 B 113 13.87 7.57 6.20
1024 B 425 50.47 24.23 17.49

Speedup vs scalar at 1024 B: SSE2 8.4×, AVX2 17.5×, AVX-512BW 24.3×.

URL hot path inputs: HTTP schemes ("http"/"https") are 4–5 bytes and stay on the scalar tail with no change. Typical host names (16+ bytes) get the full 4–14× speedup depending on build flags.

Test plan

  • Microbench tools/benchmark/benchmark_memcpy_tolower runs 269 correctness assertions covering:
    • Sizes 0, 1, 5, 15, 16, 17, 23, 31, 32, 33, 64, 257 (bracketing each SIMD body) against the scalar reference.
    • An exhaustive sweep of all 256 byte values verifying that only A..Z are remapped — guards against any future widening of the case-fold range.
  • All paths run correctness clean on:
    • Broadwell (AVX2-capable) with -mavx2
    • Ice Lake (AVX-512BW) with -mavx2, -mavx512bw, and the default
  • cmake --build build -t format clean.
  • src/proxy/hdrs/libhdrs.a builds clean with the updated URL.cc.
  • Jenkins CI green.

Notes for reviewers

  • No new compile flags or dependencies. Just baseline SSE2 (x86_64 ABI) and baseline NEON (ARMv8 ABI); wider paths kick in automatically with -march=x86-64-v3 or -march=x86-64-v4.
  • The header includes <immintrin.h> / <arm_neon.h> only inside the #if that needs them, so other architectures don't pull them in.
  • AVX-512BW kernel design (mask_add + masked tail) was adapted from Tony Finch's vectolower, license-compatible (0BSD / MIT-0).
  • Other call sites with the same byte-at-a-time tolower pattern (HPACK.cc, QPACK.cc, UrlRewrite.cc) could also benefit; left untouched here to keep this PR focused.

🤖 Generated with Claude Code

@phongn phongn force-pushed the simd-bulk-tolower branch from ed9a596 to d975fb2 Compare May 14, 2026 20:30
The bulk ASCII tolower loop used to canonicalize the scheme and host
portions of a URL before hashing into the cache key runs at ~1.5 GB/s
scalar (one byte and one ParseRules table lookup per iteration). The
work is trivially data-parallel and there is no per-byte branching, so
a SIMD kernel that lowercases a whole register at once gives a
straightforward speedup once the input is long enough to amortize the
vector setup.

Add a header-only helper ts::memcpy_tolower under
include/tscore/ink_memcpy_tolower.h with a compile-time-selected
cascade of SIMD bodies: 64-byte AVX-512BW, 32-byte AVX2, 16-byte SSE2
on x86_64, plus 16-byte NEON on ARMv8. Wider bodies fall through to
narrower drain loops, so the worst-case scalar tail is always <16
bytes. Selection is purely compile-time; runtime ifunc dispatch is
left for a follow-up.

The AVX-512BW body uses _mm512_mask_add_epi8 to fuse the conditional
"+0x20 where upper" into a single op, and a masked load/store handles
1..63 leftover bytes in a single SIMD pass (inspired by Tony Finch's
copytolower64.c, https://dotat.at/cgi/git/vectolower.git/). The whole
AVX-512BW block is gated at n >= 64 because the masked load/store has
~7 ns of fixed setup that loses to the narrower paths for short
inputs; below 64 bytes we fall through to the AVX2 + SSE2 cascade.

Semantics match the existing ParseRules::ink_tolower table exactly:
bytes in 'A'..'Z' map to 'a'..'z', all others (including 0x80..0xFF)
pass through unchanged.

Replace the static inline memcpy_tolower in src/proxy/hdrs/URL.cc with
this helper. Baseline x86_64 builds use the 16-byte SSE2 path; builds
that opt into a wider -march (x86-64-v3 = AVX2, x86-64-v4 = AVX-512BW)
get the wider bodies automatically. Sub-16-byte inputs (e.g. short
HTTP schemes like "http") use the scalar tail and see no perf change.

Measured throughput on a 2.0 GHz Ice Lake Xeon Gold 6338, mean ns:

  size   scalar   SSE2     AVX2     AVX-512BW
  ----   ------   ----     ----     ---------
  16 B   10.4     2.15     1.75     1.98
  32 B   15.4     2.90     2.24     2.31
  64 B   28.0     4.43     2.85     2.61
  256 B  113      13.87    7.57     6.20
  1024 B 425      50.47    24.23    17.49

Speedup vs scalar at 1024 B: SSE2 8.4x, AVX2 17.5x, AVX-512BW 24.3x.

A new microbenchmark under tools/benchmark covers correctness across
sizes 0..257 (bracketing each SIMD body size) plus an exhaustive byte-
value sweep that guards against any future widening of the case-fold
range, alongside scalar-vs-SIMD throughput numbers and a config-print
case that emits the selected ISA path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@phongn phongn force-pushed the simd-bulk-tolower branch from d975fb2 to 32016aa Compare May 14, 2026 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant