Skip to content

Az cosmos h2 connect acquire timeout#48174

Draft
jeet1995 wants to merge 104 commits intoAzure:mainfrom
jeet1995:AzCosmos_H2ConnectAcquireTimeout
Draft

Az cosmos h2 connect acquire timeout#48174
jeet1995 wants to merge 104 commits intoAzure:mainfrom
jeet1995:AzCosmos_H2ConnectAcquireTimeout

Conversation

@jeet1995
Copy link
Member

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new swagger spec, a link to the pull request containing these swagger spec changes has been included above.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

jeet1995 and others added 30 commits February 2, 2026 17:17
* fix few tests part 2

---------

Co-authored-by: Annie Liang <anniemac@Annies-MacBook-Pro.local>
…ning effort configuration (Azure#47772)

Co-authored-by: Xiting Zhang <xitzhang@microsoft.com>
* [VoiceLive]Release 1.0.0-beta.4

Updated release date for version 1.0.0-beta.4 and added feature details.

* Revise CHANGELOG for clarity and bug fixes

Updated changelog to remove breaking changes section and added details about bug fixes.
…Java-5433741 (Azure#46952)

* Configurations:  'specification/nginx/Nginx.Management/tspconfig.yaml', API Version: 2025-03-01-preview, SDK Release Type: beta, and CommitSHA: 'aae85aa3e7e4fda95ea2d3abac0ba1d8159db214' in SpecRepo: 'https://github.com/Azure/azure-rest-api-specs' Pipeline run: https://dev.azure.com/azure-sdk/internal/_build/results?buildId=5433741 Refer to https://eng.ms/docs/products/azure-developer-experience/develop/sdk-release/sdk-release-prerequisites to prepare for SDK release.

* Configurations:  'specification/nginx/Nginx.Management/tspconfig.yaml', API Version: 2025-03-01-preview, SDK Release Type: beta, and CommitSHA: 'de8103ff8e94ea51c56bb22094ded5d2dfc45a6a' in SpecRepo: 'https://github.com/Azure/azure-rest-api-specs' Pipeline run: https://dev.azure.com/azure-sdk/internal/_build/results?buildId=5857234 Refer to https://eng.ms/docs/products/azure-developer-experience/develop/sdk-release/sdk-release-prerequisites to prepare for SDK release.

---------

Co-authored-by: Weidong Xu <weidxu@microsoft.com>
false can't be assigned to int in java. Updating type to boolean
* Deprecating azure-resourcemanager-mixedreality

* Typos

* use 1.0.1 as version

* Update CHANGELOG.md

---------

Co-authored-by: Michael Zappe <michaelzappe@microsoft.com>
Co-authored-by: Weidong Xu <weidxu@microsoft.com>
* fix few tests part 3


---------

Co-authored-by: Annie Liang <anniemac@Annies-MacBook-Pro.local>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Initial regeneration using TypeSpec

* Working on migrating tests, adding back convenience APIs that are being kept

* Complete most of the migration

* Additional work

* Stable point before tests

* Newer TypeSpec SHA

* Add back SearchAudience support

* Last changes before testing

* Rerecord tests and misc fixes along the way

* Fix a few recordings and stress tests

* Fix a few recordings and linting

* Few more fixes

* Another round of recording

* Rerun TypeSpec codegen

* Remove errant import

* Cleanup APIs

* Regeneration

* Clean up linting
* escape non-ascii character for pkValue

---------

Co-authored-by: Annie Liang <anniemac@Annies-MacBook-Pro.local>
Co-authored-by: Fabian Meiswinkel <fabianm@microsoft.com>
…k connector 4.43.0 (Azure#47968)

* Release azure-cosmos 4.78.0, azure-cosmos-encryption 2.27.0, and Spark connector 4.43.0

---------

Co-authored-by: Annie Liang <anniemac@Annies-MacBook-Pro.local>
Co-authored-by: Fabian Meiswinkel <fabianm@microsoft.com>
ibrandes and others added 26 commits February 19, 2026 18:23
* add transactional bulk config in configuration reference

---------

Co-authored-by: Annie Liang <anniemac@Annies-MacBook-Pro.local>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…zure#48022)

* Add status code tracking to bulk operations with compressed consecutive
identical (statusCode, subStatusCode) pairs into single entries with count
and time range.

---------

Co-authored-by: Annie Liang <anniemac@Annies-MacBook-Pro.local>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>
Co-authored-by: Kushagra Thapar <kuthapar@microsoft.com>
Azure#48053)

* Fix Netty ByteBuf leak in RxGatewayStoreModel via doFinally safety net

Add AtomicReference<ByteBuf> tracking to the retained buffer lifecycle in
toDocumentServiceResponse(). The retained ByteBuf is tracked when retained,
cleared when consumed (StoreResponse creation) or discarded (doOnDiscard),
and released as a safety net in a doFinally handler. This handles edge cases
where cancellation racing with publishOn's async delivery prevents the
doOnDiscard handler from firing, causing the retained ByteBuf to leak.

Also fixes a minor typo in the logger format string ({] -> {}).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Improve ByteBuf leak test: use Sinks.One for timing control

Replace Mockito-based HttpResponse mock with concrete subclass to avoid
final method interception issues with withRequest(). Use Sinks.One to
control body emission timing and simulate ByteBufFlux.aggregate()'s
auto-release behavior. Run 20 iterations to reliably catch the race
condition between publishOn's async delivery and cancellation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Add Azure Artifacts Feed Setup section to CONTRIBUTING.md

* Added detail Azure Artifacts feed setup instructions

Added detailed steps for setting up the Maven credential provider for Azure Artifacts feed authentication.

* Add new terms to cspell configuration

* Enhance CONTRIBUTING.md with Azure Artifacts setup

Added detailed setup instructions for external and internal contributors regarding Azure Artifacts feed authentication and troubleshooting 401 errors.

* Apply suggestion from @weshaggard

---------

Co-authored-by: Wes Haggard <weshaggard@users.noreply.github.com>
* Updated parent pom file to use internal azure artifact feed

* Added auth task

* Added auth task to more pipeline files

* Removed network isolation policy for testing
* Added back pom changes for spring service

* Updated variables

* Fixed spring test template

* validate with test feed

* Updated env name

* updated dep version

* use feed name

* update test step

* removed repository settings

* added auth step in some missing places

* Reverted token env changes

* added mvn extensions to the gitignore file

* removed extensions.xml

* override repo settings

* removed settings file

* removed extra script

* Updated feed to azure-sdk-for-java

* Cleaned up test feed

* Adjusted the place to auth maven feed to avoid duplicated auth warnings
… for non-maven.org feeds (Azure#48048)

* Initial plan

* Fix PublishDevFeedPackage missing dependsOn VerifyReleaseVersion for non-maven.org feeds

Co-authored-by: raych1 <20296335+raych1@users.noreply.github.com>

* Use else condition for non-maven.org dependsOn in PublishDevFeedPackage

Co-authored-by: raych1 <20296335+raych1@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: raych1 <20296335+raych1@users.noreply.github.com>
…iteBase.java (Azure#47970)

* Initial plan

* Add AsyncDocumentClient.Builder support to rx/TestSuiteBase.java

Co-authored-by: xinlian12 <64233642+xinlian12@users.noreply.github.com>

* Update OfferReadReplaceTest to use rx.TestSuiteBase and fix code review comments

Co-authored-by: xinlian12 <64233642+xinlian12@users.noreply.github.com>

* Migrate OfferQueryTest, ReadFeedOffersTest, ResourceTokenTest, GatewayServiceConfigurationReaderTest to rx.TestSuiteBase

Co-authored-by: xinlian12 <64233642+xinlian12@users.noreply.github.com>

* Migrate ChangeFeedTest to rx.TestSuiteBase and add internal API methods

Co-authored-by: xinlian12 <64233642+xinlian12@users.noreply.github.com>

* Remove implementation/TestSuiteBase.java and tests using internal APIs

Per reviewer feedback, removed internal API support from rx/TestSuiteBase.java
and deleted tests that depended on implementation/TestSuiteBase.java or
internal APIs like AsyncDocumentClient.

- Deleted implementation/TestSuiteBase.java
- Deleted OfferQueryTest, ReadFeedOffersTest, ResourceTokenTest, ChangeFeedTest
- Deleted implementation tests: ConsistencyTests*, SessionTest, RetryTests, etc.
- Deleted directconnectivity tests: DCDocumentCrudTest, GatewayAddressCacheTest
- Rewrote OfferReadReplaceTest to use public throughput APIs
- Moved createSessionToken helper to CosmosItemTest.java

Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

* Rewrite deleted tests using rx/TestSuiteBase.java with public APIs

Per reviewer feedback, rewrote tests to use rx/TestSuiteBase.java base class
and public APIs (CosmosAsyncClient, CosmosAsyncContainer) instead of internal
AsyncDocumentClient APIs.

Rewritten tests:
- ChangeFeedTest.java - uses CosmosAsyncContainer.queryChangeFeed()
- OfferQueryTest.java - uses container.readThroughput()
- ReadFeedOffersTest.java - uses container.readThroughput()
- ResourceTokenTest.java - uses CosmosAsyncUser and CosmosPermissionProperties

Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

* Restore all deleted tests to maintain test coverage

Per reviewer feedback, restored all deleted tests that require internal APIs.
Both implementation/TestSuiteBase.java and rx/TestSuiteBase.java are kept.

Restored tests:
- implementation/ package: ConsistencyTests, SessionTest, RetryTests, etc. (17 files)
- directconnectivity/ package: DCDocumentCrudTest, GatewayAddressCacheTest, etc. (4 files)
- throughputControl/ package: ThroughputControlTests, etc. (2 files)
- rx/ package: ChangeFeedTest, OfferQueryTest, ResourceTokenTest, etc. (5 files)

Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

* Add internal API support to rx/TestSuiteBase.java

Added AsyncDocumentClient.Builder constructor and internal API helper methods
to rx/TestSuiteBase.java to enable future migration of tests from
implementation/TestSuiteBase.java:

- AsyncDocumentClient.Builder constructor and clientBuilder() accessor
- Internal client factory methods: createGatewayHouseKeepingDocumentClient(),
  createGatewayRxDocumentClient(), createDirectRxDocumentClient()
- internalClientBuilders DataProvider for AsyncDocumentClient.Builder tests
- Helper methods: createCollection(), createDocument(), createUser(), etc.
- Cleanup methods: safeDeleteDatabase(), safeClose(), etc.
- Validation methods: validateSuccess(), validateResourceResponseFailure()
- Internal collection definitions: getInternalCollectionDefinition()

Full migration blocked by type incompatibilities between internal types
(Database, DocumentCollection) and public types (CosmosAsyncDatabase,
CosmosContainerProperties).

Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

* Delete implementation/TestSuiteBase.java and migrate tests to rx/TestSuiteBase

- Deleted implementation/TestSuiteBase.java
- Updated 20 test files to import and use rx.TestSuiteBase
- Added internal API methods to rx/TestSuiteBase.java:
  - createDatabase(AsyncDocumentClient, ...) overloads
  - safeDeleteDatabase(AsyncDocumentClient, ...) overloads
  - deleteCollection(AsyncDocumentClient, collectionLink)
  - deleteDocument(AsyncDocumentClient, ...)
  - truncateCollection(DocumentCollection)
  - getInternalCollectionDefinition(String) overload
- Updated tests to use:
  - SHARED_DATABASE_INTERNAL instead of SHARED_DATABASE
  - SHARED_MULTI_PARTITION_COLLECTION_INTERNAL instead of SHARED_MULTI_PARTITION_COLLECTION
  - getInternalCollectionDefinition() instead of getCollectionDefinition()
  - createInternalGatewayRxDocumentClient() instead of createGatewayRxDocumentClient()
  - validateResourceResponseFailure() instead of validateFailure()
  - internalClientBuilders DataProvider instead of clientBuilders

Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

* Fix code review issues: remove duplicate imports and refactor method

- Removed duplicate TestSuiteBase imports in directconnectivity tests
- Refactored getInternalCollectionDefinition() to call parameterized version

Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

* Fix dataProvider type mismatch in internal API tests

Changed dataProvider from "clientBuilders" to "internalClientBuilders" in tests
using AsyncDocumentClient.Builder constructor:
- DocumentQuerySpyWireContentTest.java
- RequestHeadersSpyWireTest.java
- RetryCreateDocumentTest.java
- GatewayAddressCacheTest.java
- GatewayServiceConfigurationReaderTest.java

The "clientBuilders" returns CosmosClientBuilder (5 params) while
"internalClientBuilders" returns AsyncDocumentClient.Builder (4 params).

Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

* Fix SessionTest data provider - use internalClientBuildersWithSessionConsistency

SessionTest constructor expects AsyncDocumentClient.Builder but was using
"clientBuildersWithDirectSession" which returns CosmosClientBuilder.
Changed to "internalClientBuildersWithSessionConsistency" which returns
AsyncDocumentClient.Builder.

Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

* Fix StoreHeaderTests data provider mismatch

Changed @factory data provider from "clientBuildersWithDirect" (returns
CosmosClientBuilder) to "internalClientBuilders" (returns
AsyncDocumentClient.Builder) to match the constructor signature.

Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

* Fix NetworkFailureTest data provider mismatch

Changed @factory data provider from "simpleClientBuildersWithDirect" (returns
CosmosClientBuilder) to "internalClientBuilders" (returns
AsyncDocumentClient.Builder) to match the constructor signature.

Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

* Fix internal shared collections to set selfLink

The internal DocumentCollection and Database instances need selfLink set
for tests to work properly. Without selfLink, calls to getCollectionLink()
return null which causes "collectionLink" IllegalArgumentException errors.

Added setSelfLink() calls for:
- SHARED_DATABASE_INTERNAL: "dbs/{databaseId}"
- SHARED_MULTI_PARTITION_COLLECTION_INTERNAL: "dbs/{databaseId}/colls/{collId}"
- SHARED_SINGLE_PARTITION_COLLECTION_INTERNAL: "dbs/{databaseId}/colls/{collId}"
- SHARED_MULTI_PARTITION_COLLECTION_WITH_COMPOSITE_AND_SPATIAL_INDEXES_INTERNAL

Fixes: NetworkFailureTest.createCollectionWithUnreachableHost
Fixes: StoredProcedureRetryThrottleTest.storedProcedureRetryThrottle

Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

* Update sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/rx/TestSuiteBase.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/rx/TestSuiteBase.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Implement real truncateCollection with bulk delete

Restore the original truncateCollection implementation from
implementation/TestSuiteBase.java that properly empties collections.

The implementation:
- Uses bulk delete operations to remove all documents
- Deletes all triggers, stored procedures, and UDFs
- Handles partition keys properly using PartitionKeyHelper
- Sets appropriate timeouts for bulk operations

This addresses review comment 2820262506 which flagged that the no-op
placeholder could cause tests to be non-deterministic.

Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

* Use direct mode data provider for StoreHeaderTests

Changed from internalClientBuilders (gateway only) to
internalClientBuildersWithSessionConsistency which includes:
- Gateway mode
- Direct HTTPS mode
- Direct TCP mode

This restores the original direct connectivity coverage that was lost
when migrating from clientBuildersWithDirect.

Addresses review feedback comment 2820262539.

Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

* Fix internal shared resources initialization with full properties

The internal shared resources (SHARED_DATABASE_INTERNAL, SHARED_*_COLLECTION_INTERNAL)
were missing required properties causing test failures:

1. SessionTest failures - SHARED_DATABASE_INTERNAL was missing resourceId, causing
   getDocumentLink() to build URLs with "dbs/null/colls/..." when using resourceId-based links.

2. truncateCollection failures - Internal collections were missing altLink, causing
   NullPointerException when truncateCollection() tried to split the altLink string.

Fixed by:
- Adding resourceId to SHARED_DATABASE_INTERNAL
- Creating getInternalDocumentCollection() helper that sets id, resourceId, selfLink,
  altLink, and partitionKey on DocumentCollection objects
- Using this helper for all internal shared collections

Addresses test failures:
- SessionTest.sessionConsistency_ReadYourWrites
- SessionTest.sessionTokenInDocumentRead
- SessionTest.sessionTokenNotRequired
- RequestHeadersSpyWireTest.before_DocumentQuerySpyWireContentTest
- DocumentQuerySpyWireContentTest.before_DocumentQuerySpyWireContentTest

Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: xinlian12 <64233642+xinlian12@users.noreply.github.com>
Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>
Co-authored-by: Kushagra Thapar <kuthapar@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- multiParentChannelConnectionReuse: forces >1 parent H2 channel via concurrent
  requests, verifies all parent channels survive ReadTimeoutException
- retryUsesConsistentParentChannelId: captures parentChannelId from ALL retry
  attempts (6s/6s/10s), verifies parent channels survive post-delay
- extractAllParentChannelIds helper for multi-entry gatewayStatisticsList parsing
- OpenSpec tasks updated with worktree layout
Run 1 (6/7): parentChannelId=87ad9b15 survived all 6 passing tests.
multiParentChannelConnectionReuse validated 2 parent channels (4039fb0b, 87ad9b15)
survived with survivalRate=2/2.

Run 2 (6/7): retryChannels=[b3da290a, 16223a82, 769bcc5b] != warmup=e66210f1.
This proves strictConnectionReuse=false allows new parent channels during retries.

Run 3 (5/7): Some connection closures between sequential tests due to tc netem
disrupting TCP layer (kernel RST on delayed packets). This is expected behavior
for real network disruption, not an SDK bug.

Key finding: Under tc netem delay, the pool may create new parent channels because
the kernel's TCP retransmission timeout closes connections that had queued/delayed
packets. The SDK correctly handles this — it acquires from the pool (new or existing).

Evidence MD: .dev-tracker/gateway-v2/evidence-part1-netem-run1.md
…rcation

Corrected per user requirement:
- Metadata → GW V1 (port 443) → 45s/60s timeout (unchanged)
- Data plane → GW V2 (port 10250) → 1s connect/acquire timeout
- Added GATEWAY_V2_DATA_PLANE_PORT constant reference
- Updated test names to dataPlaneRequest_GwV2 / metadataRequest_GwV1
Fixes:
- postTimeoutReadCompletesQuickly: relaxed parentChannelId equality to log-only,
  primary assertion remains recovery latency <10s (30ms actual)
- retryUsesConsistentParentChannelId: removed allKnownChannels.contains assertion,
  validates recovery succeeds + logs channel allocation for observability

Root cause: tc netem delay causes kernel TCP retransmission timeout to RST
connections. Post-delay reads may use new parent channels. This is expected
kernel behavior, not an SDK bug.

Docker run: 7/7 passed, 157s total. multiParentChannel: 7 parents survived (7/7).
recoveryLatency: 30ms. retryChannels: [e9318a31, 9d084ab3, c0bf38f3].
* Generating from latest spec

* CHANGELOG.md

* Added methods from builder exposing the OpenAIClient

* Subclient documentation

* Restored Agents samples

* Schedule parameter rename and latest spec

* Latest commit codegen

* Latest commit codegen

* README update

* Spec up to date

* fixed pom.xml

* Using current version

* Adjusted the name of the env vars

* Made stainless deps transitive

* Latest working codegen

* EvaluationClient naming feedback

* Method renames

* Token fix and test/sample renames applied

* Uri -> URL renames

* pom fix

* pom fix

* Adjusted other method calls

* Renames applied to tests and samples

* using actually released version of azure-ai-agents

* Making the CI happy

* Code gen latest

* string -> utcDateTime to get Offset in Java

* codegen: singularize Credentials -> Credential for connection credential models

- Rename BaseCredentials -> BaseCredential
- Rename ApiKeyCredentials -> ApiKeyCredential
- Rename EntraIdCredentials -> EntraIdCredential
- Rename SasCredentials -> SasCredential
- Rename NoAuthenticationCredentials -> NoAuthenticationCredential
- Rename AgenticIdentityPreviewCredentials -> AgenticIdentityPreviewCredential
- Update tsp-location.yaml commit to 2d01b1ba98da58699e4c080e45451574f375af86

* Regenerate from TypeSpec commit 1f9e30204b790f289ac387a3d8b3cf83b0b28202 - rename ConnectionType.APIKEY to API_KEY

* regenrating with latest upstream changes

* Using expandable enums

* Re-ran codegen

* CHANGELOG/README updates

* Updating readme version
…into AzCosmos_HttpTimeoutPolicyChangesGatewayV2
@jeet1995 jeet1995 force-pushed the AzCosmos_H2ConnectAcquireTimeout branch from 3b37f42 to e58f218 Compare February 28, 2026 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.