HTTP/2 server communication triggers random timeouts

Since upgrading to 3.0.8, we starting seeing about 0.5% of our relevant requests (the chart below is as filtered as I could get it) timeout exclusively when communicating with a service that uses HTTP/2. Conveniently, we communicate with an _identical_ API that uses a different server and that server requires HTTP/1.1 from the same backend service.

Once we eliminated other changes from being a factor, we downgraded from 3.0.8 to 3.0.7 and the timeouts went away while talking to the exact same services with no other code changes (literally just the downgrade).

<img width="387" height="201" alt="Image" src="https://github.com/user-attachments/assets/a9a9cf45-dea5-4cea-9588-f1935afd0c6b" />

In this chart, you can see the errors and you can see that after a little past 18:00, we began rolling back the `async-http-client` version across our various deployments and the timeouts started disappearing.

There are a few changes in 3.0.8, but I think the one that stands out given the noted behavioral difference is https://github.com/AsyncHttpClient/async-http-client/pull/2144 (HTTP/2 multiplexing).

Some background details that will hopefully help:

- We never found the internal-AHC cause of why it was timing out (our focus was on eliminating the timeouts).
- In our APM, when these types of timeouts happened, the core giveaway that `async-http-client` was the cause was that the other end, which we can also see, never even received the request (indicating it never actually happened).
- The symptom that we saw was timeouts while making API calls to `BoundRequestBuilder::execute` and then a simple `future.get()` with the timeouts configured on the client, forcing the call to become blocking. (Stack trace below)
- Retries (from our own wrapper code, not AHC) would work and happen as expected.
- It was 0.5% of our relevant requests, but it was a larger percentage of the requests that used HTTP/2 -- but definitely not the majority -- because that service is used _much_ less.
- The remote server says that it supports HTTP/2 (I can see its APM data, but not its internals), but it could obviously be causing issues via bad responses, but it inherently worked most of the time.

Example stacktrace:

```
java.util.concurrent.TimeoutException: Request timeout to our-service/1.2.3.4:4567 after 50000 ms
	at org.asynchttpclient.netty.timeout.TimeoutTimerTask.expire(TimeoutTimerTask.java:45)
	at org.asynchttpclient.netty.timeout.RequestTimeoutTimerTask.run(RequestTimeoutTimerTask.java:54)
	at io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:717)
	at io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34)
	at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:705)
	at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:791)
	at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:501)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:1474)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTTP/2 server communication triggers random timeouts #2160

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

HTTP/2 server communication triggers random timeouts #2160

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions