Skip to content

[GATEWAY V2]: Add aggressive HTTP timeout policies.#47879

Open
jeet1995 wants to merge 103 commits intoAzure:mainfrom
jeet1995:AzCosmos_HttpTimeoutPolicyChangesGatewayV2
Open

[GATEWAY V2]: Add aggressive HTTP timeout policies.#47879
jeet1995 wants to merge 103 commits intoAzure:mainfrom
jeet1995:AzCosmos_HttpTimeoutPolicyChangesGatewayV2

Conversation

@jeet1995
Copy link
Member

@jeet1995 jeet1995 commented Feb 3, 2026

Aggressive HTTP Timeout Policies for Gateway V2 (Thin Client)

Description

Adds shorter HTTP timeout policies for document operations routed through Gateway V2 (Thin Client). The default Gateway timeout is 60 seconds, but Thin Client operations are expected to complete in single-digit seconds. This PR introduces HttpTimeoutPolicyForGatewayV2 with 6s/6s/10s timeouts to enable faster failover and retry.

Changes

  • New HttpTimeoutPolicyForGatewayV2 with two singleton instances:

    • INSTANCE_FOR_POINT_READ — for point read operations
    • INSTANCE_FOR_QUERY_AND_CHANGE_FEED — for query and change feed operations
    • Both use a 6s → 6s → 10s retry progression with zero backoff delay (vs. 60s/60s/60s for default Gateway)
  • Updated HttpTimeoutPolicy.getTimeoutPolicy() routing logic:

    • When useThinClientMode == true and resourceType == Document:
      • OperationType.Read → point read policy
      • OperationType.Query / change feed → query policy
    • Write operations remain unretried on timeout (existing behavior preserved)
  • Extended ResponseTimeoutAndDelays with a Duration-based constructor overload to support zero-duration delays

  • HTTP response timeout now surfaced in Gateway diagnosticsClientSideRequestStatistics.GatewayStatistics captures the httpResponseTimeout from each request and serializes it as httpNwResponseTimeout in the diagnostics JSON. This makes it possible to see exactly which timeout value was applied to each gateway call, aiding debuggability when different policies (default vs. Gateway V2) are in play.

Note: The timeout values for point read and query/change feed are currently identical (6s/6s/10s), but they are maintained as separate singleton instances intentionally. This separation allows either policy to be tuned independently via a targeted hotfix without risk of affecting the other operation type.

Timeout Comparison

Policy Attempt 1 Attempt 2 Attempt 3 Backoff
Default Gateway 60s 60s 60s 0s, 1s
Gateway V2 — Point Read 6s 6s 10s 0s, 0s
Gateway V2 — Query/Change Feed 6s 6s 10s 0s, 0s

H2 Connection Lifecycle Instrumentation

Every gateway request now surfaces HTTP/2 connection identity in CosmosDiagnostics:

"gatewayStatisticsList": [{
  "channelId": "f37707c7/2",
  "parentChannelId": "f37707c7",
  "isHttp2": true,
  "httpNetworkResponseTimeout": "PT6S"
}]
  • channelId — H2 stream channel (e.g., f37707c7/2 = stream 2 on parent f37707c7). New stream per request (RFC 9113 §5.1.1).
  • parentChannelId — TCP connection (NioSocketChannel) multiplexing all streams. This is the connection reuse identity.
  • isHttp2 — protocol flag. Only present for HTTP/2.

These appear on every gatewayStatisticsList entry — success (200), timeout (408/10002), and e2e cancel (408/20008) — enabling support engineers to correlate stream failures to their parent TCP connection.

Additional production changes

File Change
ReactorNettyClient ConnectionObserver captures channel IDs at CONNECTED / ACQUIRED / STREAM_CONFIGURED. Extracted captureChannelIds() and getRequestRecordFromConnection() helpers.
ReactorNettyRequestRecord New fields: channelId, parentChannelId, isHttp2
StoreResponse / StoreResponseDiagnostics Channel ID fields threaded through success and error paths
ClientSideRequestStatistics GatewayStatistics serializes channel IDs; isHttp2 only when true
BridgeInternal 5-arg recordGatewayResponse overload accepting ReactorNettyRequestRecord
RxGatewayStoreModel Channel IDs threaded through success / error / cancel paths. Cancel-path now always sets requestUri.
ThinClientStoreModel refCnt() guards at all 6 ByteBuf release sites

Connection Lifecycle Tests

Why

Before enabling aggressive timeouts, we needed to prove that stream-level ReadTimeoutException does not close the parent TCP connection. Confirmed from reactor-netty 1.2.13 source: ReadTimeoutHandler is on the H2 stream pipeline (HttpClientOperations.onOutboundComplete()), not the parent.

SDK Fault Injection (in CI)

3 tests in FaultInjectionServerErrorRuleOnGatewayV2Tests — connection reuse after timeout, connection survives e2e timeout, connection survives for next request.

Docker tc netem (manual — ManualNetworkDelayConnectionLifecycleTests)

Test E2E Delay What fires What it proves
connectionReuseAfterRealNettyTimeout 15s 8s ReadTimeoutHandler (6s) 408/10002 during delay; recovery 27ms, same parentChannelId
connectionSurvivesE2ETimeoutWithRealDelay 7s 8s ReadTimeout + e2e cancel Both fire; parent survives both
multiParentChannelConnectionReuse 15s 8s ReadTimeoutHandler 100-concurrency → 10 parents; survivalRate=10/10
parentChannelSurvivesE2ECancelWithoutReadTimeout 3s 8s Only e2e cancel No 408/10002; recovery parent from known pool
retryUsesConsistentParentChannelId 25s 8s ReadTimeout ×2 + 3rd retry succeeds All retries on same parent; 3rd retry (10s) absorbs 8s delay

Follow-ups (separate PRs)

  • Part 2 — TCP connect timeout bifurcation: 1s for GW V2 data plane (port 10250), 45s for metadata (port 443). Branch: AzCosmos_H2ConnectAcquireTimeout.
  • Part 3 — H2 PING health checker: detects dead connections via PING timeout, GOAWAY handling, CPU-aware eviction. Branch: AzCosmos_H2ChannelHealthChecker.

Testing

  • Unit tests in WebExceptionRetryPolicyTest: verify timeout progression (6s/6s/10s) and zero-backoff for both point read and query in Thin Client mode; confirm writes are not retried
  • Fault injection E2E test in FaultInjectionServerErrorRuleOnGatewayV2Tests: injects a single 61s server response delay — the first attempt times out at 6s, the retry hits the server after the injected delay has cleared and succeeds, asserting end-to-end latency stays under 8 seconds

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

@github-actions github-actions bot added the Cosmos label Feb 3, 2026
@jeet1995
Copy link
Member Author

jeet1995 commented Feb 3, 2026

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Member Author

jeet1995 commented Feb 5, 2026

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Member Author

jeet1995 commented Feb 9, 2026

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995 jeet1995 changed the title Az cosmos http timeout policy changes gateway v2 [GATEWAY V2]: Add aggressive HTTP timeout policies. Feb 9, 2026
jeet1995 and others added 16 commits February 9, 2026 10:44
* fix few tests part 2

---------

Co-authored-by: Annie Liang <anniemac@Annies-MacBook-Pro.local>
…ning effort configuration (Azure#47772)

Co-authored-by: Xiting Zhang <xitzhang@microsoft.com>
* [VoiceLive]Release 1.0.0-beta.4

Updated release date for version 1.0.0-beta.4 and added feature details.

* Revise CHANGELOG for clarity and bug fixes

Updated changelog to remove breaking changes section and added details about bug fixes.
…Java-5433741 (Azure#46952)

* Configurations:  'specification/nginx/Nginx.Management/tspconfig.yaml', API Version: 2025-03-01-preview, SDK Release Type: beta, and CommitSHA: 'aae85aa3e7e4fda95ea2d3abac0ba1d8159db214' in SpecRepo: 'https://github.com/Azure/azure-rest-api-specs' Pipeline run: https://dev.azure.com/azure-sdk/internal/_build/results?buildId=5433741 Refer to https://eng.ms/docs/products/azure-developer-experience/develop/sdk-release/sdk-release-prerequisites to prepare for SDK release.

* Configurations:  'specification/nginx/Nginx.Management/tspconfig.yaml', API Version: 2025-03-01-preview, SDK Release Type: beta, and CommitSHA: 'de8103ff8e94ea51c56bb22094ded5d2dfc45a6a' in SpecRepo: 'https://github.com/Azure/azure-rest-api-specs' Pipeline run: https://dev.azure.com/azure-sdk/internal/_build/results?buildId=5857234 Refer to https://eng.ms/docs/products/azure-developer-experience/develop/sdk-release/sdk-release-prerequisites to prepare for SDK release.

---------

Co-authored-by: Weidong Xu <weidxu@microsoft.com>
false can't be assigned to int in java. Updating type to boolean
* Deprecating azure-resourcemanager-mixedreality

* Typos

* use 1.0.1 as version

* Update CHANGELOG.md

---------

Co-authored-by: Michael Zappe <michaelzappe@microsoft.com>
Co-authored-by: Weidong Xu <weidxu@microsoft.com>
* fix few tests part 3


---------

Co-authored-by: Annie Liang <anniemac@Annies-MacBook-Pro.local>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Member Author

jeet1995 commented Feb 26, 2026

@FabianMeiswinkel and @xinlian12 - as discussed offline a RST_STREAM is not closing the parent channel, so we need an explicit mechanism to close channels. Also, connect timeouts and connection acquire timeouts should be bifurcated between the Gateway V1 and Gateway V2 endpoints to prevent regressions as there is no latency SLA for metadata requests.

Two tracking issues:

  1. [FEATURE REQ] [Thin Client][Gateway V2] Connect timeout bifurcation — 1s for thin client proxy (port 10250) vs 45s for gateway (port 443) #48092
  2. [FEATURE REQ] [Thin Client][Gateway V2] HTTP/2 channel health checker — PING-based liveness and broken connection eviction #48093

@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.