tscore: SIMD ts::memcpy_tolower; use in URL.cc cache-key fast path by phongn · Pull Request #13167 · apache/trafficserver

phongn · 2026-05-14T19:55:30Z

Summary

Add a SIMD-accelerated bulk ASCII tolower helper ts::memcpy_tolower in tscore, and use it in place of the byte-at-a-time loop on the URL canonicalization fast path that produces the cache-key digest. Header-only helper with a compile-time ISA cascade: 64-byte AVX-512BW, 32-byte AVX2, 16-byte SSE2 on x86_64, plus 16-byte NEON on ARMv8. Selection is purely compile-time; runtime ifunc dispatch is left for a follow-up. Operators get the wider path automatically by raising -march (x86-64-v3 = AVX2, x86-64-v4 = AVX-512BW); a stock x86_64 build keeps SSE2.

Behavior matches ParseRules::ink_tolower exactly: bytes in A..Z map to a..z, all others (including 0x80..0xFF) pass through unchanged.

Implementation notes

Cascade: wider bodies drain into narrower ones, so the worst-case scalar tail is always <16 bytes regardless of build flags.
AVX-512BW kernel uses _mm512_mask_add_epi8 to fuse the conditional +0x20 into a single op, and a masked load/store for the 1–63-byte tail in one SIMD pass. Inspired by Tony Finch's copytolower64.c.
The whole AVX-512BW block is gated at n ≥ 64, because the masked load/store carries ~7 ns of fixed setup that loses to the narrower paths for short inputs; below 64 bytes the AVX-512BW build falls through to its AVX2 + SSE2 cascade.

src/proxy/hdrs/URL.cc drops its static-inline memcpy_tolower and calls ts::memcpy_tolower instead.

Performance — measured on Xeon Gold 6338 (Ice Lake, 2.0 GHz)

Mean ns per call from tools/benchmark/benchmark_memcpy_tolower:

Size	scalar	SSE2	AVX2 (`-mavx2`)	AVX-512BW (`-mavx512bw`)
16 B	10.4	2.15	1.75	1.98
32 B	15.4	2.90	2.24	2.31
64 B	28.0	4.43	2.85	2.61
256 B	113	13.87	7.57	6.20
1024 B	425	50.47	24.23	17.49

Speedup vs scalar at 1024 B: SSE2 8.4×, AVX2 17.5×, AVX-512BW 24.3×.

URL hot path inputs: HTTP schemes ("http"/"https") are 4–5 bytes and stay on the scalar tail with no change. Typical host names (16+ bytes) get the full 4–14× speedup depending on build flags.

Test plan

Microbench tools/benchmark/benchmark_memcpy_tolower runs 269 correctness assertions covering:
- Sizes 0, 1, 5, 15, 16, 17, 23, 31, 32, 33, 64, 257 (bracketing each SIMD body) against the scalar reference.
- An exhaustive sweep of all 256 byte values verifying that only A..Z are remapped — guards against any future widening of the case-fold range.
All paths run correctness clean on:
- Broadwell (AVX2-capable) with -mavx2
- Ice Lake (AVX-512BW) with -mavx2, -mavx512bw, and the default
cmake --build build -t format clean.
src/proxy/hdrs/libhdrs.a builds clean with the updated URL.cc.
Jenkins CI green.

Notes for reviewers

No new compile flags or dependencies. Just baseline SSE2 (x86_64 ABI) and baseline NEON (ARMv8 ABI); wider paths kick in automatically with -march=x86-64-v3 or -march=x86-64-v4.
The header includes <immintrin.h> / <arm_neon.h> only inside the #if that needs them, so other architectures don't pull them in.
AVX-512BW kernel design (mask_add + masked tail) was adapted from Tony Finch's vectolower, license-compatible (0BSD / MIT-0).
Other call sites with the same byte-at-a-time tolower pattern (HPACK.cc, QPACK.cc, UrlRewrite.cc) could also benefit; left untouched here to keep this PR focused.

🤖 Generated with Claude Code

The bulk ASCII tolower loop used to canonicalize the scheme and host portions of a URL before hashing into the cache key runs at ~1.5 GB/s scalar (one byte and one ParseRules table lookup per iteration). The work is trivially data-parallel and there is no per-byte branching, so a SIMD kernel that lowercases a whole register at once gives a straightforward speedup once the input is long enough to amortize the vector setup. Add a header-only helper ts::memcpy_tolower under include/tscore/ink_memcpy_tolower.h with a compile-time-selected cascade of SIMD bodies: 64-byte AVX-512BW, 32-byte AVX2, 16-byte SSE2 on x86_64, plus 16-byte NEON on ARMv8. Wider bodies fall through to narrower drain loops, so the worst-case scalar tail is always <16 bytes. Selection is purely compile-time; runtime ifunc dispatch is left for a follow-up. The AVX-512BW body uses _mm512_mask_add_epi8 to fuse the conditional "+0x20 where upper" into a single op, and a masked load/store handles 1..63 leftover bytes in a single SIMD pass (inspired by Tony Finch's copytolower64.c, https://dotat.at/cgi/git/vectolower.git/). The whole AVX-512BW block is gated at n >= 64 because the masked load/store has ~7 ns of fixed setup that loses to the narrower paths for short inputs; below 64 bytes we fall through to the AVX2 + SSE2 cascade. Semantics match the existing ParseRules::ink_tolower table exactly: bytes in 'A'..'Z' map to 'a'..'z', all others (including 0x80..0xFF) pass through unchanged. Replace the static inline memcpy_tolower in src/proxy/hdrs/URL.cc with this helper. Baseline x86_64 builds use the 16-byte SSE2 path; builds that opt into a wider -march (x86-64-v3 = AVX2, x86-64-v4 = AVX-512BW) get the wider bodies automatically. Sub-16-byte inputs (e.g. short HTTP schemes like "http") use the scalar tail and see no perf change. Measured throughput on a 2.0 GHz Ice Lake Xeon Gold 6338, mean ns: size scalar SSE2 AVX2 AVX-512BW ---- ------ ---- ---- --------- 16 B 10.4 2.15 1.75 1.98 32 B 15.4 2.90 2.24 2.31 64 B 28.0 4.43 2.85 2.61 256 B 113 13.87 7.57 6.20 1024 B 425 50.47 24.23 17.49 Speedup vs scalar at 1024 B: SSE2 8.4x, AVX2 17.5x, AVX-512BW 24.3x. A new microbenchmark under tools/benchmark covers correctness across sizes 0..257 (bracketing each SIMD body size) plus an exhaustive byte- value sweep that guards against any future widening of the case-fold range, alongside scalar-vs-SIMD throughput numbers and a config-print case that emits the selected ISA path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

phongn force-pushed the simd-bulk-tolower branch from ed9a596 to d975fb2 Compare May 14, 2026 20:30

phongn force-pushed the simd-bulk-tolower branch from d975fb2 to 32016aa Compare May 14, 2026 20:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tscore: SIMD ts::memcpy_tolower; use in URL.cc cache-key fast path#13167

tscore: SIMD ts::memcpy_tolower; use in URL.cc cache-key fast path#13167
phongn wants to merge 1 commit into
apache:masterfrom
phongn:simd-bulk-tolower

phongn commented May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

phongn commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation notes

Performance — measured on Xeon Gold 6338 (Ice Lake, 2.0 GHz)

Test plan

Notes for reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

phongn commented May 14, 2026 •

edited

Loading