TASK-083: apply iteration-1 review fixes to warm-path and route-lookup benches

etr · claude · etr · commit df74e8361b5b · 2026-06-22T19:26:55.000-07:00
Address worked review findings on top of the CI-gate wiring: - bench_warm_path: bump OUTER 11->51 so the median is the exact midpoint and 1-2 OS-scheduling outlier rounds cannot flip a sub-10ns gate; hoist the 64 KiB pmr backing buffer out of the timed lambda (buf5/buf6) so the per-iteration 64 KiB zeroing no longer swamps the measured pmr/string cost; switch the arena upstream to null_memory_resource so overflow is a hard error instead of a silent heap fallback; add an absolute additive noise floor (5 ns) to the regression gate so sub-ns timer quantization cannot false-trip the sub-10ns measurements. - bench_route_lookup: document the <=256-iteration LRU re-warm window per outer round (~0.25% of INNER_RADIX, conservative) and the kNumPaths > ROUTE_CACHE_MAX_SIZE coupling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Rkuh4aSmrD8m2f2vYqakb6
diff --git a/test/bench_route_lookup.cpp b/test/bench_route_lookup.cpp
@@ -239,6 +239,23 @@ int main() {
     // its capacity, so the steady-state hit rate is effectively zero and
     // each lookup pays the full radix walk.
     //
+    // Note (performance-reviewer-iter1-1): during the first ≤256 iterations
+    // of each outer round, the LRU progressively re-warms from the
+    // invalidated state. These iterations see a mix of cold-miss and
+    // warming costs rather than pure radix latency. However, with
+    // INNER_RADIX=100K this warm-up window is only ~0.25% of inner
+    // iterations per round (256 / 100K), so its effect on the median
+    // ns/call across OUTER rounds is negligible and conservative (slightly
+    // inflates the measured cost, making the gate stricter, not looser).
+    // The comment "each lookup pays the full radix walk" holds for >99.75%
+    // of measured iterations; the first-256-per-round warm-up does not
+    // compromise the gate intent.
+    //
+    // If kNumPaths is ever changed, keep it above ROUTE_CACHE_MAX_SIZE
+    // (currently 256) so the LRU-eviction guarantee holds. If
+    // ROUTE_CACHE_MAX_SIZE itself changes, this constant must be updated
+    // manually -- see test-quality-reviewer-iter1-3.
+    //
     // invalidate_route_cache() runs once per OUTER round (not per inner
     // iteration), so its cost (clearing a ≤256-entry map) is amortised
     // across INNER_RADIX (100K) lookups and does not taint the median.
diff --git a/test/bench_warm_path.cpp b/test/bench_warm_path.cpp
@@ -195,11 +195,20 @@ int main() {
         return 0;
     }
 
-    // Project convention: 11 outer rounds, 1M inner iterations
-    // (TASK-058 acceptance criterion).  serialize_allow_405 is more
-    // expensive than the other measurements, so 100K inner is enough
-    // to keep the wall time bounded.
-    constexpr std::size_t OUTER = 11;
+    // OUTER=51 matches bench_hook_overhead and bench_route_lookup: with 51
+    // rounds the median is the 26th sorted value (exact midpoint), so 1-2
+    // high outlier rounds from OS scheduling noise cannot shift the median.
+    // With OUTER=11 (the old value) the median was the 6th value and a
+    // single outlier round could flip a <10 ns gate; on a shared CI runner
+    // this produced spurious failures independent of real regressions.
+    // Wall-clock cost: 51 rounds * 1M inner at ~12 ns/call ≈ 612 ms total,
+    // acceptable for an out-of-band `make bench` run. (test-quality-reviewer-iter1-1)
+    //
+    // INNER_405=100K: serialize_allow_405 is more expensive than the other
+    // measurements (~40 ns), so 100K inner (4 ms of signal per outer round)
+    // keeps wall time bounded. The 51-outer median is still stable at this
+    // inner count because we care about median ns/call, not absolute signal.
+    constexpr std::size_t OUTER = 51;
     constexpr std::size_t INNER = 1'000'000;
     constexpr std::size_t INNER_405 = 100'000;
 
@@ -306,14 +315,24 @@ int main() {
     // allocations stay inside the arena and zero global-heap allocation
     // occurs during the measured window.
     //
-    // Prior design flaw: the old loop accumulated all 1M values under a
-    // single key without resetting the arena.  The arena was exhausted
-    // after ~819 iterations (~65536 / 80 bytes per pmr::string), so
-    // every subsequent call spilled to the upstream heap -- measuring
+    // Prior design flaw (TASK-083): the old loop accumulated all 1M values
+    // under a single key without resetting the arena.  The arena was
+    // exhausted after ~819 iterations (~65536 / 80 bytes per pmr::string),
+    // so every subsequent call spilled to the upstream heap -- measuring
     // exactly the allocation overhead the task was supposed to eliminate.
-    // The new design (fresh impl per call) removes this flaw and makes
-    // the bench an honest proof of the zero-heap-alloc guarantee.
-    // (performance-reviewer-iter1-1) -----
+    // The new design (fresh monotonic_buffer_resource per call) removes
+    // this flaw and makes the bench an honest proof of the zero-heap-alloc
+    // guarantee.
+    //
+    // buf is declared OUTSIDE the lambda (performance-reviewer-iter1-2):
+    // zero-initialising a 65536-byte stack array inside the inner loop
+    // would touch 64 KiB of cache traffic per iteration (64 GB/round),
+    // swamping the measured pmr/string cost.  Instead, buf is zeroed ONCE
+    // before the timed loop and the lambda constructs a fresh
+    // monotonic_buffer_resource from the same backing storage on every
+    // call (cheap: just sets two internal pointers), which resets the
+    // bump-pointer without re-zeroing the memory.  The arena writes over
+    // stale bytes from the previous call, which is correct. -----
     {
         using httpserver::detail::http_request_impl;
         using httpserver::detail::arguments_accumulator;
@@ -325,17 +344,20 @@ int main() {
         static const char* kValue =
             "a%2Fbcdefghijklmnopqrstuvwxyz_padding_to_force_heap";
 
+        // Backing buffer zeroed once; reused across all inner iterations
+        // by constructing a fresh monotonic_buffer_resource each call.
+        alignas(std::max_align_t) std::array<std::byte, 65536> buf5{};
+        do_not_optimize(buf5);  // prevent compiler from eliding the buffer
+
         std::printf("bench_warm_path (5): build_request_args "
                     "(%%2F unescape via arena, fresh impl per call)\n");
         med_build_args_pct2f = measure_median_ns(
             "build_request_args_pct2f",
             [&]() {
-                // Each call: fresh arena-backed impl, one insert, destroy.
-                // This mirrors the production per-request lifecycle and
-                // keeps the arena within capacity on every call.
-                alignas(std::max_align_t) std::array<std::byte, 65536> buf{};
+                // Fresh monotonic_buffer_resource resets the bump pointer
+                // to buf5.data() without touching the bytes -- cheap.
                 std::pmr::monotonic_buffer_resource arena(
-                    buf.data(), buf.size(), std::pmr::new_delete_resource());
+                    buf5.data(), buf5.size(), std::pmr::null_memory_resource());
                 impl_alloc_t alloc(&arena);
                 auto* p = alloc.new_object<http_request_impl>(
                     nullptr, nullptr, alloc);
@@ -356,7 +378,8 @@ int main() {
     // Same fresh-impl-per-call structure as (5) so the timings are
     // directly comparable.  The median for (5) should land within noise
     // of this baseline once TASK-072 lands.
-    // (performance-reviewer-iter1-1) -----
+    // buf declared outside the lambda for the same reason as (5) above
+    // (performance-reviewer-iter1-2). -----
     {
         using httpserver::detail::http_request_impl;
         using httpserver::detail::arguments_accumulator;
@@ -366,14 +389,16 @@ int main() {
         static const char* kValue =
             "abcdefghijklmnopqrstuvwxyz_no_escape_baseline_padding";
 
+        alignas(std::max_align_t) std::array<std::byte, 65536> buf6{};
+        do_not_optimize(buf6);
+
         std::printf("bench_warm_path (6): build_request_args "
                     "(no-escape baseline, fresh impl per call)\n");
         med_build_args_plain = measure_median_ns(
             "build_request_args_plain",
             [&]() {
-                alignas(std::max_align_t) std::array<std::byte, 65536> buf{};
                 std::pmr::monotonic_buffer_resource arena(
-                    buf.data(), buf.size(), std::pmr::new_delete_resource());
+                    buf6.data(), buf6.size(), std::pmr::null_memory_resource());
                 impl_alloc_t alloc(&arena);
                 auto* p = alloc.new_object<http_request_impl>(
                     nullptr, nullptr, alloc);
@@ -394,23 +419,39 @@ int main() {
     // Each median is compared against its committed per-platform baseline
     // (bench_baseline.hpp). A median more than kAllowedRegressionRatio
     // (5%) over baseline fails the bench.
+    //
+    // Noise floor: for sub-10 ns measurements (e.g. WARM_SHOULD_SKIP_AUTH_EMPTY_NS
+    // baseline = 2 ns) a pure 5% ratio produces an allowed window of only 0.1 ns,
+    // which is below steady_clock resolution on many platforms. We add an absolute
+    // additive floor so the gate ceiling is at least baseline + kAbsoluteNoiseFloorNs
+    // regardless of the ratio, matching the pattern used in bench_hook_overhead.
+    // (test-quality-reviewer-iter1-2, performance-reviewer-iter1-3)
+    constexpr double kAbsoluteNoiseFloorNs = 5.0;
+
     namespace bb = httpserver::bench_baseline;
     std::printf("\nbench_warm_path summary (baselines from "
-                "bench_baseline.hpp, +%.0f%% allowed):\n",
-                100.0 * (bb::kAllowedRegressionRatio - 1.0));
+                "bench_baseline.hpp, +%.0f%% or +%.0f ns floor allowed):\n",
+                100.0 * (bb::kAllowedRegressionRatio - 1.0),
+                kAbsoluteNoiseFloorNs);
 
     int rc = 0;
     const auto check = [&](const char* label, double measured,
                            double baseline) {
-        const double allowed = baseline * bb::kAllowedRegressionRatio;
+        // Use the larger of the ratio-based ceiling and the absolute noise
+        // floor so sub-ns timer quantization cannot false-trip the gate.
+        const double ratio_ceiling  = baseline * bb::kAllowedRegressionRatio;
+        const double floor_ceiling  = baseline + kAbsoluteNoiseFloorNs;
+        const double allowed = std::max(ratio_ceiling, floor_ceiling);
         const double pct = 100.0 * (measured / baseline - 1.0);
         std::printf("  %-26s median=%8.3f ns  baseline=%8.3f ns  %+6.1f%%\n",
                     label, measured, baseline, pct);
         if (measured > allowed) {
-            std::printf("FAIL: %s median %.3f ns exceeds baseline*%.2f = "
-                        "%.3f ns (regression %+.1f%%)\n",
-                        label, measured, bb::kAllowedRegressionRatio,
-                        allowed, pct);
+            std::printf("FAIL: %s median %.3f ns exceeds gate ceiling %.3f ns "
+                        "(baseline*%.2f=%.3f, baseline+%.0fns=%.3f, "
+                        "regression %+.1f%%)\n",
+                        label, measured, allowed,
+                        bb::kAllowedRegressionRatio, ratio_ceiling,
+                        kAbsoluteNoiseFloorNs, floor_ceiling, pct);
             rc = 1;
         }
     };
@@ -427,8 +468,10 @@ int main() {
           bb::WARM_BUILD_REQUEST_ARGS_PLAIN_NS);
 
     if (rc == 0) {
-        std::printf("PASS: all warm-path medians within %.0f%% of baseline\n",
-                    100.0 * (bb::kAllowedRegressionRatio - 1.0));
+        std::printf("PASS: all warm-path medians within gate "
+                    "(+%.0f%% or +%.0fns floor over baseline)\n",
+                    100.0 * (bb::kAllowedRegressionRatio - 1.0),
+                    kAbsoluteNoiseFloorNs);
     }
     return rc;
 }