tscore: SIMD ts::memcpy_tolower; use in URL.cc cache-key fast path#13167
Draft
phongn wants to merge 1 commit into
Draft
tscore: SIMD ts::memcpy_tolower; use in URL.cc cache-key fast path#13167phongn wants to merge 1 commit into
phongn wants to merge 1 commit into
Conversation
The bulk ASCII tolower loop used to canonicalize the scheme and host portions of a URL before hashing into the cache key runs at ~1.5 GB/s scalar (one byte and one ParseRules table lookup per iteration). The work is trivially data-parallel and there is no per-byte branching, so a SIMD kernel that lowercases a whole register at once gives a straightforward speedup once the input is long enough to amortize the vector setup. Add a header-only helper ts::memcpy_tolower under include/tscore/ink_memcpy_tolower.h with a compile-time-selected cascade of SIMD bodies: 64-byte AVX-512BW, 32-byte AVX2, 16-byte SSE2 on x86_64, plus 16-byte NEON on ARMv8. Wider bodies fall through to narrower drain loops, so the worst-case scalar tail is always <16 bytes. Selection is purely compile-time; runtime ifunc dispatch is left for a follow-up. The AVX-512BW body uses _mm512_mask_add_epi8 to fuse the conditional "+0x20 where upper" into a single op, and a masked load/store handles 1..63 leftover bytes in a single SIMD pass (inspired by Tony Finch's copytolower64.c, https://dotat.at/cgi/git/vectolower.git/). The whole AVX-512BW block is gated at n >= 64 because the masked load/store has ~7 ns of fixed setup that loses to the narrower paths for short inputs; below 64 bytes we fall through to the AVX2 + SSE2 cascade. Semantics match the existing ParseRules::ink_tolower table exactly: bytes in 'A'..'Z' map to 'a'..'z', all others (including 0x80..0xFF) pass through unchanged. Replace the static inline memcpy_tolower in src/proxy/hdrs/URL.cc with this helper. Baseline x86_64 builds use the 16-byte SSE2 path; builds that opt into a wider -march (x86-64-v3 = AVX2, x86-64-v4 = AVX-512BW) get the wider bodies automatically. Sub-16-byte inputs (e.g. short HTTP schemes like "http") use the scalar tail and see no perf change. Measured throughput on a 2.0 GHz Ice Lake Xeon Gold 6338, mean ns: size scalar SSE2 AVX2 AVX-512BW ---- ------ ---- ---- --------- 16 B 10.4 2.15 1.75 1.98 32 B 15.4 2.90 2.24 2.31 64 B 28.0 4.43 2.85 2.61 256 B 113 13.87 7.57 6.20 1024 B 425 50.47 24.23 17.49 Speedup vs scalar at 1024 B: SSE2 8.4x, AVX2 17.5x, AVX-512BW 24.3x. A new microbenchmark under tools/benchmark covers correctness across sizes 0..257 (bracketing each SIMD body size) plus an exhaustive byte- value sweep that guards against any future widening of the case-fold range, alongside scalar-vs-SIMD throughput numbers and a config-print case that emits the selected ISA path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a SIMD-accelerated bulk ASCII tolower helper
ts::memcpy_tolowerintscore, and use it in place of the byte-at-a-time loop on the URL canonicalization fast path that produces the cache-key digest. Header-only helper with a compile-time ISA cascade: 64-byte AVX-512BW, 32-byte AVX2, 16-byte SSE2 on x86_64, plus 16-byte NEON on ARMv8. Selection is purely compile-time; runtime ifunc dispatch is left for a follow-up. Operators get the wider path automatically by raising-march(x86-64-v3= AVX2,x86-64-v4= AVX-512BW); a stock x86_64 build keeps SSE2.Behavior matches
ParseRules::ink_tolowerexactly: bytes inA..Zmap toa..z, all others (including0x80..0xFF) pass through unchanged.Implementation notes
_mm512_mask_add_epi8to fuse the conditional+0x20into a single op, and a masked load/store for the 1–63-byte tail in one SIMD pass. Inspired by Tony Finch's copytolower64.c.src/proxy/hdrs/URL.ccdrops its static-inlinememcpy_tolowerand callsts::memcpy_tolowerinstead.Performance — measured on Xeon Gold 6338 (Ice Lake, 2.0 GHz)
Mean ns per call from
tools/benchmark/benchmark_memcpy_tolower:-mavx2)-mavx512bw)Speedup vs scalar at 1024 B: SSE2 8.4×, AVX2 17.5×, AVX-512BW 24.3×.
URL hot path inputs: HTTP schemes ("http"/"https") are 4–5 bytes and stay on the scalar tail with no change. Typical host names (16+ bytes) get the full 4–14× speedup depending on build flags.
Test plan
tools/benchmark/benchmark_memcpy_tolowerruns 269 correctness assertions covering:A..Zare remapped — guards against any future widening of the case-fold range.-mavx2-mavx2,-mavx512bw, and the defaultcmake --build build -t formatclean.src/proxy/hdrs/libhdrs.abuilds clean with the updated URL.cc.Notes for reviewers
-march=x86-64-v3or-march=x86-64-v4.<immintrin.h>/<arm_neon.h>only inside the#ifthat needs them, so other architectures don't pull them in.HPACK.cc,QPACK.cc,UrlRewrite.cc) could also benefit; left untouched here to keep this PR focused.🤖 Generated with Claude Code