-
Notifications
You must be signed in to change notification settings - Fork 1.6k
HTTP/2 server communication triggers random timeouts #2160
Description
Since upgrading to 3.0.8, we starting seeing about 0.5% of our relevant requests (the chart below is as filtered as I could get it) timeout exclusively when communicating with a service that uses HTTP/2. Conveniently, we communicate with an identical API that uses a different server and that server requires HTTP/1.1 from the same backend service.
Once we eliminated other changes from being a factor, we downgraded from 3.0.8 to 3.0.7 and the timeouts went away while talking to the exact same services with no other code changes (literally just the downgrade).
In this chart, you can see the errors and you can see that after a little past 18:00, we began rolling back the async-http-client version across our various deployments and the timeouts started disappearing.
There are a few changes in 3.0.8, but I think the one that stands out given the noted behavioral difference is #2144 (HTTP/2 multiplexing).
Some background details that will hopefully help:
- We never found the internal-AHC cause of why it was timing out (our focus was on eliminating the timeouts).
- In our APM, when these types of timeouts happened, the core giveaway that
async-http-clientwas the cause was that the other end, which we can also see, never even received the request (indicating it never actually happened). - The symptom that we saw was timeouts while making API calls to
BoundRequestBuilder::executeand then a simplefuture.get()with the timeouts configured on the client, forcing the call to become blocking. (Stack trace below) - Retries (from our own wrapper code, not AHC) would work and happen as expected.
- It was 0.5% of our relevant requests, but it was a larger percentage of the requests that used HTTP/2 -- but definitely not the majority -- because that service is used much less.
- The remote server says that it supports HTTP/2 (I can see its APM data, but not its internals), but it could obviously be causing issues via bad responses, but it inherently worked most of the time.
Example stacktrace:
java.util.concurrent.TimeoutException: Request timeout to our-service/1.2.3.4:4567 after 50000 ms
at org.asynchttpclient.netty.timeout.TimeoutTimerTask.expire(TimeoutTimerTask.java:45)
at org.asynchttpclient.netty.timeout.RequestTimeoutTimerTask.run(RequestTimeoutTimerTask.java:54)
at io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:717)
at io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34)
at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:705)
at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:791)
at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:501)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:1474)