[Multi-Tenancy Test]WireConnectionSharingInBenchmark#48131
[Multi-Tenancy Test]WireConnectionSharingInBenchmark#48131xinlian12 wants to merge 31 commits intoAzure:mainfrom
Conversation
Code changes: - Add connectionSharingAcrossClientsEnabled field/getter/setter to TenantWorkloadConfig - Add switch case in applyField() so tenants.json value is properly applied - Add -connectionSharingAcrossClientsEnabled CLI parameter to Configuration (JCommander) - Apply connectionSharingAcrossClientsEnabled on CosmosClientBuilder in AsyncBenchmark - Wire through fromConfiguration() for legacy CLI path - Add to toString() for debug visibility Test plan updates: - Expand S8 from 9 to 30 scenarios (3 protocols x 5 workloads x 2 sharing modes) - Add ReadLatency and WriteLatency workloads - Add isolated vs shared connection pool dimension - Update operations per tenant to 1,000,000 - Add metrics catalog with availability status (available vs needs SDK change) - Update execution runbook B13-B42 for 30-scenario matrix - Update run-baseline-matrix.sh script for 30 scenarios
- F2: Align with analysis doc - IMDS client is now ephemeral, A25/A27 resolved - F4: Add detailed before/after table for every A1/A2 resource claim - F5: New finding - connectionSharingAcrossClientsEnabled was dead config (now fixed) - F6: New finding - Reactor Netty pool metrics and H2 stream metrics are gaps
SDK change:
- Add fixedConnectionProviderBuilder.metrics(true) in HttpClient.createFixed()
Emits reactor.netty.connection.provider.{total,active,idle,pending}.connections
gauges tagged by remote.address (hostname:port) to Micrometer globalRegistry
Benchmark change:
- Add SimpleMeterRegistry to Metrics.globalRegistry so pool metrics are queryable
- Add logPoolMetrics() helper that logs all pool metrics at POST_CREATE and POST_WORKLOAD
Shows remote.address tag to verify pooling is by hostname (not resolved IP)
- Isolated mode: pool name = 'cosmos-pool-<endpoint-host>' - Shared mode: pool name = 'cosmos-shared-pool' - Enables distinguishing pools by name in Reactor Netty metrics tags
…metrics export - Build cosmosMicrometerRegistry once in run(), add to Metrics.globalRegistry - Pass to prepareTenants() as parameter (no duplicate creation) - Reactor Netty pool metrics now export to both SimpleMeterRegistry (local) and App Insights/Graphite (if configured) via globalRegistry
New class: - NettyHttpMetricsReporter: periodically samples Reactor Netty connection pool metrics from Micrometer registry and writes to netty-pool-metrics.csv - Columns: timestamp, metric, pool_id, pool_name, remote_address, value - Started/stopped alongside the Dropwizard CsvReporter in BenchmarkOrchestrator Cleanup: - Remove logPoolMetrics() ad-hoc method and all its calls - Remove SimpleMeterRegistry (cosmosMicrometerRegistry on globalRegistry is sufficient) - Remove POOL_METRICS_TAGS debug dump - Remove unused Gauge/Meter imports
SDK: - Add COSMOS.NETTY_HTTP_CLIENT_METRICS_ENABLED system property (default false) - Add COSMOS_NETTY_HTTP_CLIENT_METRICS_ENABLED env var fallback - ConnectionProvider.metrics(true) only called when property is enabled - Generic name allows enabling future Netty HTTP metrics beyond pool gauges Benchmark: - Add -enableNettyHttpMetrics CLI flag to Configuration - Wire through BenchmarkConfig to BenchmarkOrchestrator - Orchestrator sets system property before client creation - NettyHttpMetricsReporter only starts when flag is enabled - run-baseline-matrix.sh passes -enableNettyHttpMetrics
…pplyField) Completes the http2Enabled wiring in TenantWorkloadConfig so it can be set via tenants.json globalDefaults or per-tenant overrides. HTTP/2 can also be enabled via -DCOSMOS.HTTP2_ENABLED=true system property (existing path).
The reporter was defined but never instantiated in the run() method. Now creates and starts it before PRE_CREATE, stops it in cleanup.
…gauges Without a backing registry in Metrics.globalRegistry, the CompositeMeterRegistry registers gauges but Gauge.value() returns 0. Adding a SimpleMeterRegistry ensures the gauges have actual storage for their values, making netty-pool-metrics.csv report real connection counts.
…coded 100 Flux.merge concurrency during document pre-population was hardcoded to 100, causing ~100 TCP connections to be opened per tenant regardless of the configured concurrency (typically 20). Now uses min(cfg.getConcurrency(), 100) so the number of pre-warmed connections matches the actual workload concurrency.
Add .gitignore entries for: - .github/agents/ - .github/skills/ - sdk/cosmos/azure-cosmos-benchmark/docs/ - sdk/cosmos/azure-cosmos-benchmark/scripts/ These are local-only files that should not be tracked in the repository. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…/xinlian12/azure-sdk-for-java into wireConnectionSharingInBenchmark
…/xinlian12/azure-sdk-for-java into wireConnectionSharingInBenchmark
All 50 tenants were sharing a single Timer with synchronized HdrHistogramResetOnSnapshotReservoir,
causing ~7x throughput reduction for *Latency operations (9,400 vs 68,600 ops/s).
Fix: prefix meter names with tenant ID (e.g., 'tenant-0.Latency') so each tenant
gets its own Timer instance. Throughput/failure Meters also prefixed for consistency.
Root cause: MetricRegistry.register('Latency', timer) registered ONE timer in the shared
registry. All 50 tenants' LatencySubscriber.hookOnComplete() called context.stop() which
serialized through the same synchronized reservoir.update() method.
…contention" This reverts commit 922835a.
There was a problem hiding this comment.
Pull request overview
This PR wires connectionSharingAcrossClientsEnabled and a new Netty HTTP pool metrics toggle through the Cosmos multi-tenancy benchmark harness, and adds an SDK-side switch to enable Reactor Netty ConnectionProvider Micrometer metrics via a system property.
Changes:
- Add
COSMOS.NETTY_HTTP_CLIENT_METRICS_ENABLEDconfig and conditionally enable Reactor Netty connection pool metrics in the Cosmos SDK HTTP client. - Extend the benchmark harness to support
connectionSharingAcrossClientsEnabledand-enableNettyHttpMetrics, including a new CSV reporter for Netty pool gauges. - Adjust benchmark document pre-population merge concurrency to be capped by tenant concurrency.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/HttpClient.java | Enables ConnectionProvider.metrics(true) when the new config flag is set. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/clienttelemetry/ClientTelemetry.java | Consolidates IMDS failure debug logging into a single message. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java | Adds COSMOS.NETTY_HTTP_CLIENT_METRICS_ENABLED system-property/env-var toggle. |
| sdk/cosmos/azure-cosmos-benchmark/src/main/java/com/azure/cosmos/benchmark/TenantWorkloadConfig.java | Adds per-tenant connectionSharingAcrossClientsEnabled config plumbing. |
| sdk/cosmos/azure-cosmos-benchmark/src/main/java/com/azure/cosmos/benchmark/NettyHttpMetricsReporter.java | New reporter that samples Reactor Netty pool gauges from Micrometer and writes CSV. |
| sdk/cosmos/azure-cosmos-benchmark/src/main/java/com/azure/cosmos/benchmark/Configuration.java | Adds CLI flags for connection sharing and Netty metrics enablement. |
| sdk/cosmos/azure-cosmos-benchmark/src/main/java/com/azure/cosmos/benchmark/BenchmarkOrchestrator.java | Turns on the system property and attaches a SimpleMeterRegistry for Netty gauge backing. |
| sdk/cosmos/azure-cosmos-benchmark/src/main/java/com/azure/cosmos/benchmark/BenchmarkConfig.java | Wires the new enableNettyHttpMetrics setting into the internal config model. |
| sdk/cosmos/azure-cosmos-benchmark/src/main/java/com/azure/cosmos/benchmark/AsyncBenchmark.java | Wires connection sharing into the client builder and changes pre-pop merge concurrency selection. |
...s/azure-cosmos-benchmark/src/main/java/com/azure/cosmos/benchmark/BenchmarkOrchestrator.java
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos-benchmark/src/main/java/com/azure/cosmos/benchmark/AsyncBenchmark.java
Outdated
Show resolved
Hide resolved
...re-cosmos/src/main/java/com/azure/cosmos/implementation/clienttelemetry/ClientTelemetry.java
Outdated
Show resolved
Hide resolved
...zure-cosmos-benchmark/src/main/java/com/azure/cosmos/benchmark/NettyHttpMetricsReporter.java
Outdated
Show resolved
Hide resolved
...s/azure-cosmos-benchmark/src/main/java/com/azure/cosmos/benchmark/BenchmarkOrchestrator.java
Outdated
Show resolved
Hide resolved
...s/azure-cosmos-benchmark/src/main/java/com/azure/cosmos/benchmark/BenchmarkOrchestrator.java
Outdated
Show resolved
Hide resolved
- Move NETTY_HTTP_CLIENT_METRICS_ENABLED system property into setGlobalSystemProperties - Wrap run() lifecycle in try/finally to ensure cleanup on exceptions - Stop NettyHttpMetricsReporter and remove SimpleMeterRegistry in cleanup - Guard against zero prePopConcurrency by clamping to 1 and skipping empty list - Log IOException with full stack trace in NettyHttpMetricsReporter - Reword IMDS metadata debug message to avoid definitive claim Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This is an orchestrator-level JVM-global system property, not a per-tenant config. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| * | ||
| * <p>CSV columns: timestamp, metric, pool_id, pool_name, remote_address, value</p> | ||
| */ | ||
| public class NettyHttpMetricsReporter { |
There was a problem hiding this comment.
Can you please also add a way to report accumulated results/stats into the COsmos reporter. It amkes it so much easier to prodcue reprots when runnign repeatedly different testcases - manually merging combining csvs is just a headache - so, reproting min/max/average of values across the lifecycle of the process or simialr. This exists for system cpu and memory already - could follow simialr pattern.
FabianMeiswinkel
left a comment
There was a problem hiding this comment.
LGTM except for oen small ask.
Summary
Wire
connectionSharingAcrossClientsEnabledthrough the benchmark harness and add Reactor Netty HTTP client connection pool metrics for multi-tenancy testing.Code Changes
Benchmark Harness (azure-cosmos-benchmark)
New classes:
NettyHttpMetricsReporter- Samples Reactor Netty connection pool metrics to CSVWired new config fields:
connectionSharingAcrossClientsEnabled- CLI flag + tenants.json + CosmosClientBuilderenableNettyHttpMetrics- CLI flag to enable Reactor Netty pool metricsSDK Changes (azure-cosmos)
Configs.java: AddCOSMOS.NETTY_HTTP_CLIENT_METRICS_ENABLEDsystem property (default false)HttpClient.java: Conditionally callConnectionProvider.metrics(true)when enabledBaseline Test Results
Test environment: Azure VM D16s_v5 (16 cores, 64 GB) in West US 2, same region as Cosmos DB accounts
Common parameters:
-Xmx8g -Xms8g, G1GCReadThroughput: HTTP/1.1 vs HTTP/2 x Isolated vs Shared
Measures aggregate ops/sec via Codahale
Meter.Throughput:
xychart-beta title "ReadThroughput: HTTP/1.1 vs HTTP/2 (ops/s, 1-min rate)" x-axis "Minutes" [1,3,5,7,9,11,13,15,17,19,21,23,25,27,29] y-axis "ops/sec" 40000 --> 75000 line "H1.1" [55467,68904,70545,69989,69072,69141,69231,69159,68968,66870,65884,67137,67847,67957,68467] line "H2" [49116,56673,57755,57508,57254,57217,57164,57170,57183,57171,57214,57178,57244,57237,56784]Resource utilization over time (ReadThroughput):
xychart-beta title "CPU Usage: HTTP/1.1 vs HTTP/2 (% of 16 cores)" x-axis "Minutes" [1,3,5,7,9,11,13,15,17,19,21,23,25,27,29] y-axis "CPU %" 0 --> 100 line "H1.1" [0,77,83,86,89,93,93,94,95,94,95,95,95,97,97] line "H2" [5,62,72,77,81,84,87,89,91,93,93,94,94,94,95]Resource Consumption:
/proc)ThreadMXBean)Connection Pool:
Connection utilization over time (ReadThroughput, regional endpoints):
xychart-beta title "HTTP/1.1: Active vs Idle Connections (total=1000)" x-axis "Minutes" [1,3,5,7,9,11,13,15,17,19,21,23,25,27,29] y-axis "Connections" 0 --> 1100 line "Active" [681,670,711,688,693,698,678,632,725,692,655,654,664,759,676] line "Idle" [321,323,267,340,295,302,344,377,295,289,338,356,297,244,339]xychart-beta title "HTTP/2: Active Connections vs Active Streams (total TCP=800)" x-axis "Minutes" [1,3,5,7,9,11,13,15,17,19,21,23,25,27,29] y-axis "Count" 0 --> 850 line "TCP conns" [720,800,800,800,800,800,800,800,800,800,800,800,800,800,800] line "H2 active conns" [0,218,216,210,228,204,236,255,165,183,208,227,201,198,220] line "Active streams" [19,765,734,642,672,706,736,761,753,719,792,728,742,720,690]HTTP/2 Connection Metrics: Per-Account Breakdown (4 sample accounts, regional endpoints):
total.connections)active.connectionsidle.connectionsactive.connections(carrying streams)idle.connections(no active streams)active.streamsAll accounts show the same pattern: 16 TCP connections (=
minConnectionPoolSize), 5-8 actively streaming, 13-19 active streams (~2-3 streams/connection). Base pool reports all 16 as "active" (open TCP), while H2 pool shows real stream-level usage. Multiplexing is happening but at a low ratio higher concurrency or fewer connections would increase stream density.Thread Breakdown Java threads only (mid-run snapshot via
ThreadMXBean):transport-response-bounded-elastictenant-workerpartition-availability-staleness-checkcosmos-daemon-cosmos-global-endpoint-mgrreactor-http-epollparallelcosmos-parallelboundedElastic-evictorReadLatency: HTTP/1.1 vs HTTP/2
Measures per-operation latency via Codahale
Timerwith HDR Histogram.Latency Percentiles (ms):
| P99 | 4.60 | 4.56 | 6.18 | 2.99 |
Throughput:
Latency over time (ms):
xychart-beta title "ReadLatency P50: HTTP/1.1 vs HTTP/2 (ms)" x-axis "Minutes" [1,3,5,7,9,11,13,15,17,19,21,23,25,27,29] y-axis "P50 ms" 1.5 --> 2.5 line "H1.1" [2.02,1.99,2.00,1.97,1.98,1.98,1.98,1.98,1.99,1.98,1.98,1.97,1.98,1.98,1.97] line "H2" [2.31,2.11,2.13,2.13,2.13,2.13,2.13,2.13,2.13,2.13,2.13,2.13,2.13,2.13,2.13]xychart-beta title "ReadLatency P99: HTTP/1.1 vs HTTP/2 (ms)" x-axis "Minutes" [1,3,5,7,9,11,13,15,17,19,21,23,25,27,29] y-axis "P99 ms" 3 --> 8 line "H1.1" [4.85,4.55,4.62,4.55,4.59,4.62,4.55,4.65,4.62,4.62,4.65,4.52,4.62,4.55,4.55] line "H2" [5.70,5.77,5.73,5.73,5.70,5.73,5.80,5.77,5.73,5.73,5.67,5.67,5.67,5.80,5.73]Resource Consumption:
WriteThroughput: HTTP/1.1 vs HTTP/2 x Isolated vs Shared
Measures aggregate write ops/sec via Codahale
Meter.Throughput:
Throughput over time (1-min rate, ops/s):
xychart-beta title "WriteThroughput: HTTP/1.1 vs HTTP/2 (ops/s)" x-axis "Minutes" [1,3,5,7,9,11,13,15,17,19,21,23,25,27,29] y-axis "ops/sec" 25000 --> 75000 line "H1.1" [34511,63717,66585,66737,67877,68544,68763,68833,69028,69154,67873,68678,68903,68913,68853] line "H2" [28474,51141,53366,53708,54359,54755,54865,54788,54685,54810,54838,54768,54786,54791,54905]Resource Consumption:
Connection Pool:
WriteLatency: HTTP/1.1 vs HTTP/2
Measures per-operation write latency via Codahale
Timerwith HDR Histogram.Latency Percentiles (ms):
Throughput:
Resource Consumption:
Key Findings
F7: Per-Client Thread Cost (~6.2 threads each)
cosmos-global-endpoint-mgrpartition-availability-staleness-checktransport-response-bounded-elasticF8: Pre-Population Concurrency Fix
FDs: 5,100 -> 1,100. Connection utilization: 15% -> 67.6%. Throughput unchanged.
F9: Connection Pool Keyed by Hostname, Not IP
50 accounts -> 4 IPs -> but 100 pool slots (hostname-based).
connectionSharingAcrossClientsEnabledis a no-op for multi-account.Future Optimization Opportunities