[improve][build] Upgrade Netty from 4.1.132.Final to 4.2.12.Final#25522
[improve][build] Upgrade Netty from 4.1.132.Final to 4.2.12.Final#25522merlimat wants to merge 4 commits intoapache:masterfrom
Conversation
This upgrade brings Pulsar onto the Netty 4.2 line in preparation for the async-http-client 3.x upgrade (apache#25023), which transitively depends on Netty 4.2 and cannot be landed while Pulsar force-pins Netty 4.1. Netty 4.1 and 4.2 cannot co-exist on the classpath (same io.netty.* package namespace), so the upgrade has to be done in a single step. The Netty team asserts source/binary forward compatibility from 4.1 to 4.2 for regular API users: https://netty.io/news/2025/04/03/4-2-0.html https://github.com/netty/netty/wiki/Netty-4.2-Migration-Guide Changes in this PR: * gradle/libs.versions.toml: - Bump netty from 4.1.132.Final to 4.2.12.Final. - Drop the separate netty-iouring version (0.0.26.Final). io_uring has graduated from incubator (io.netty.incubator:netty-incubator-transport-*-io_uring) to a first-class Netty artifact (io.netty:netty-transport-{classes,native}-io_uring), now pinned to the same Netty version. * pulsar-common/build.gradle.kts: Point the io_uring consumer at the renamed aliases. * pulsar-common/.../EventLoopUtil.java: Netty 4.2 removed the dedicated IOUringEventLoopGroup class. io_uring now uses the generic MultiThreadIoEventLoopGroup + IoUringIoHandler factory pattern, which makes io_uring groups indistinguishable from any other MultiThreadIoEventLoopGroup by type, breaking the existing instanceof-based channel class dispatch. Fix: introduce a private marker subclass IoUringMultiThreadIoEventLoopGroup used at construction. Also repoint the incubator imports (io.netty.incubator.channel.uring.*) to the core package (io.netty.channel.uring.*) and adjust class names (IOUring -> IoUring). * build-logic/conventions/.../pulsar.java-conventions.gradle.kts: Exclude io.netty.incubator from all configurations. BookKeeper 4.17.3 (bookkeeper-common and stream-storage-java-client) still declares a transitive dependency on the 0.0.26.Final incubator io_uring jars, which are compiled against Netty 4.1 internals and are not safe to leave on the 4.2 classpath. Pulsar uses the core io_uring API via EventLoopUtil; BK stream-storage is an optional feature that Pulsar does not expose in its default surface. * distribution/{server,shell}/src/assemble/LICENSE.bin.txt: Reflect the actual Netty jar set shipped after the upgrade: - Bump all 4.1.132.Final entries to 4.2.12.Final. - Replace the monolithic netty-codec-*.jar with its 4.2 split-out sub-modules netty-codec-base and netty-codec-compression (netty-codec is now an aggregator POM that ships no classes). - Rename the incubator io_uring entries (io.netty.incubator-netty-incubator-transport-*-io_uring-0.0.26.Final) to the core io_uring artifacts (io.netty-netty-transport-{classes,native}-io_uring-4.2.12.Final). The jar set was cross-checked against the output of :distribution:pulsar-server-distribution:serverDistTar and :distribution:pulsar-shell-distribution:shellDistTar. * pulsar-common/.../BitSetRecyclableRecyclableTest and ConcurrentBitSetRecyclableTest: Relax the testRecycle assertion. Netty 4.2's io.netty.util.Recycler (which is itself deprecated in 4.2) no longer guarantees same-thread immediate reuse, so we only assert functional behavior: any recycled instance must come back cleared, and distinct create() calls must return distinct objects. Verification: * ./gradlew compileJava compileTestJava: clean across the entire project, only deprecation warnings (NioEventLoopGroup, EpollEventLoopGroup, DefaultEventLoopGroup, ChannelOption.RCVBUF_ALLOCATOR, EpollMode, Recycler, PlatformDependent.threadLocalRandom). These are compat shims that still function in 4.2; cleanup can follow in a separate PR. * :pulsar-common:test: passes (678 tests). * :pulsar-broker:test --tests BrokerServiceTest: passes (broker startup, producer/consumer flow, Netty transport end-to-end). * :pulsar-proxy:test --tests ProxyServiceTlsStarterTest: passes (proxy, TLS handshake, tcnative-boringssl integration). * :distribution:pulsar-server-distribution:serverDistTar and :distribution:pulsar-shell-distribution:shellDistTar both build, and the Netty jar set inside each tarball matches the LICENSE.bin.txt files exactly. Known Netty 4.2 behavior changes that this PR does NOT address: * The default SslContextBuilder.endpointIdentificationAlgorithm changed from null to HTTPS in 4.2. Pulsar's TLS client sites need to be audited and explicitly configured. This is intentionally out of scope here because the audit touches many modules (pulsar-client, pulsar-broker, pulsar-proxy, pulsar-broker-auth-oidc, admin) and should be its own PR. * The default ByteBufAllocator changed from pooled to adaptive in 4.2. Pulsar is not setting io.netty.allocator.type=pooled in this PR; if CI soak tests show regressions, the pooled override can be added to the launch scripts as a follow-up.
checkstyleMain flagged an UnusedImports violation on the EpollIoHandler import that was added speculatively in the Netty 4.2 bump. The io_uring path was refactored to use IoUringIoHandler + MultiThreadIoEventLoopGroup, but the epoll path still uses the 4.2 EpollEventLoopGroup compat shim directly, so EpollIoHandler is never referenced.
CI surfaced a NoClassDefFoundError during BookKeeper broker setup:
java.lang.NoClassDefFoundError: io/netty/incubator/channel/uring/IOUringEventLoopGroup
java.lang.ClassNotFoundException: io.netty.incubator.channel.uring.IOUringEventLoopGroup
at org.apache.pulsar.broker.LedgerLostAndSkipNonRecoverableTest.setupSharedCluster
BookKeeper 4.17.3 bytecode still directly references
io.netty.incubator.channel.uring.IOUringEventLoopGroup — its own
event-loop selection path loads the class eagerly. The previous commit
excluded the entire io.netty.incubator group from all configurations
to stop shipping the 4.1-era incubator jars alongside Netty 4.2, but
that also prevented BK from resolving the class it links against,
turning a dependency hygiene concern into a hard runtime failure.
Revert the exclusion. The incubator classes are in a separate package
(io.netty.incubator.channel.uring.*) from the new core io_uring API
(io.netty.channel.uring.*), so they can coexist on the classpath
without symbol conflicts. The incubator bytecode was compiled against
Netty 4.1.x, and Netty 4.2 asserts forward compatibility for 4.1
bytecode, so BK's lazy io_uring usage (only when io_uring is enabled)
should continue to work at runtime. Pulsar's own EventLoopUtil already
uses the core io_uring API introduced in 4.2, not the incubator one.
Add the 0.0.26.Final incubator io_uring jars back to
distribution/server LICENSE.bin.txt (the shell distribution does not
pull BK's stream-storage and does not receive them).
The proper long-term fix is a BookKeeper release that moves off the
incubator io_uring onto Netty 4.2's core io_uring module — that can
be picked up in a later Pulsar BK version bump.
… tests Netty 4.2 changed the default SslContextBuilder.endpointIdentificationAlgorithm from null to "HTTPS", which enables hostname verification on every TLS client connection that does not explicitly set the algorithm. Four Pulsar tests fail on the Netty 4.2 bump because they were written against the 4.1 default of no hostname verification: * pulsar-broker: AuthenticationTlsHostnameVerificationTest .testTlsSyncProducerAndConsumerWithInvalidBrokerHost — exercises the "hostname verification disabled" path, expects a connection to an invalid broker host to succeed. * pulsar-broker: TlsWithECCertificateFileTest .testConnectionSuccessWithCertificate — the EC test certs do not contain a SAN that matches the bind address, so broker-internal TLS now fails the hostname check and topic load times out. * pulsar-proxy: ProxyMutualTlsTest.testProducerByAuthenticationTls — proxy test certs have the same SAN mismatch. * integration: ClientTlsTest.testClient — integration test certs likewise. Set -Dio.netty.handler.ssl.defaultEndpointVerificationAlgorithm=NONE for the test JVM only. This is the Netty-supplied migration override documented at https://github.com/netty/netty/wiki/Netty-4.2-Migration-Guide and it restores the 4.1 default for the test suite without touching production code. Production launch scripts are deliberately not changed. The correct long-term fix is to audit Pulsar's SslContextBuilder call sites and explicitly set endpointIdentificationAlgorithm based on the user's hostnameVerification configuration, after which this test override can be removed. That audit should be its own PR — it touches pulsar-client, pulsar-broker, pulsar-proxy, pulsar-broker-auth-oidc, and admin. Note on production behavior: with this PR, Pulsar users who upgrade will get Netty 4.2's stricter default, so any TLS client that implicitly relied on the 4.1 "no hostname verification" default will start enforcing it. Users who want to keep the old behavior can set -Dio.netty.handler.ssl.defaultEndpointVerificationAlgorithm=NONE on the JVM until the per-call-site audit lands.
|
FYI, Issue for AsyncHttpClient 2.12.x backport of GHSA-cmxv-58fp-fm3g: AsyncHttpClient/async-http-client#2167 |
|
Based on the scenarios I’m currently using, when brokers and consumer networks are throttled, message dispatch can slow down, causing back pressure on the broker side and leading to high direct memory usage. If, at this time, the Pulsar ledger handler attempts to read entries from BK, it may encounter transient direct memory OOM due to direct memory exhaustion. However, in the current architecture, even with the I personally think the current exception handling logic is quite good, as it helps avoid OOM that could be triggered by back pressure when brokers and consumer networks are throttled. I’m not entirely sure whether Netty 4.2’s |
@geniusjoe The back pressure for BK reads is handled by configuring In tuning, it's necessary to set the limits low enough that memory limit doesn't ever reach the scenario where Netty direct memory would run out. That might be impossible to achieve in all cases. One of the reasons is what you touched upon in #25274. Memory allocations, either pooled or "direct" (handled by malloc/OS), will cause fragmentation over time and there will be less memory available in certain workloads.
As mentioned in my comment #25021 (comment), AutoMQ's blog post "Challenges of Custom Cache Implementation in Netty-Based Streaming Systems: Memory Fragmentation and OOM Issues" describes some of the problems with fragmentation in the Netty PooledByteBufAllocator. The AutoMQ article doesn't evaluate AdaptiveByteBufAllocator. I'd assume that it handles caching usecases in a better way than PooledByteBufAllocator. It would be great to get some feedback from real tests. |
Lari, thank you for your response. I think the I also agree with your point. Ideally, we could compare the two memory allocation algorithms— |
|
Deferring this upgrade until BK 4.18, which will also be using Netty 4.2.x |
Motivation
This upgrade brings Pulsar onto the Netty 4.2 line. The direct motivation is to unblock the async-http-client 3.x upgrade (#25023), which transitively depends on Netty 4.2 and cannot be landed while Pulsar force-pins Netty 4.1 — AHC 3.x bytecode is compiled against Netty 4.2 and would silently be downgraded to Pulsar's Netty 4.1 at resolution time, with
NoSuchMethodError/NoClassDefFoundErrorsurfacing in integration tests (which is what #25023's CI is hitting).Netty 4.1 and 4.2 cannot co-exist on the classpath (same
io.netty.*package namespace), so the upgrade has to be done in a single step. The Netty team asserts source/binary forward compatibility from 4.1 to 4.2 for regular API users:Modifications
gradle/libs.versions.tomlnettyfrom4.1.132.Finalto4.2.12.Final.netty-iouring = "0.0.26.Final"version. io_uring has graduated from incubator (io.netty.incubator:netty-incubator-transport-*-io_uring) to a first-class Netty artifact (io.netty:netty-transport-{classes,native}-io_uring) and is now pinned to the same Netty version.pulsar-common/build.gradle.ktspulsar-common/.../EventLoopUtil.javaIOUringEventLoopGroupclass. io_uring now uses the genericMultiThreadIoEventLoopGroup+IoUringIoHandler.newFactory()pattern, which makes io_uring groups indistinguishable from any otherMultiThreadIoEventLoopGroupby type — breaking the existinginstanceof-based channel class dispatch ingetClientSocketChannelClass,getServerSocketChannelClass,getDatagramChannelClass. Fix: introduce a private marker subclassIoUringMultiThreadIoEventLoopGroupused at construction soinstanceofkeeps discriminating.io.netty.incubator.channel.uring.*toio.netty.channel.uring.*and adjust class names (IOUring→IoUring).build-logic/conventions/.../pulsar.java-conventions.gradle.ktsio.netty.incubatorfrom all configurations. BookKeeper 4.17.3 (bookkeeper-commonandstream-storage-java-client) still declares a transitive dependency on the0.0.26.Finalincubator io_uring jars. Those jars are compiled against Netty 4.1 internals and are not safe to leave on the 4.2 classpath. Pulsar uses the core io_uring API viaEventLoopUtil; BK stream-storage is an optional feature that Pulsar does not expose in its default surface.distribution/{server,shell}/src/assemble/LICENSE.bin.txt4.1.132.Finalentries to4.2.12.Final.netty-codec-*.jarwith its 4.2 split-out sub-modulesnetty-codec-baseandnetty-codec-compression(netty-codecis now an aggregator POM that ships no classes).:distribution:pulsar-server-distribution:serverDistTarand:distribution:pulsar-shell-distribution:shellDistTar.pulsar-common/.../BitSetRecyclableRecyclableTestandConcurrentBitSetRecyclableTesttestRecycleassertion. Netty 4.2'sio.netty.util.Recycler(itself deprecated in 4.2) no longer guarantees same-thread immediate reuse, so we only assert functional behavior: any recycled instance must come back cleared, and distinctcreate()calls must return distinct objects.Verification
./gradlew compileJava compileTestJava— clean across the entire project. Only deprecation warnings for the 4.2 compat shims (NioEventLoopGroup,EpollEventLoopGroup,DefaultEventLoopGroup,ChannelOption.RCVBUF_ALLOCATOR,EpollMode,Recycler,PlatformDependent.threadLocalRandom). These are still functional in 4.2; cleanup can follow in a separate PR.:pulsar-common:test— passes.:pulsar-broker:test --tests BrokerServiceTest— passes (broker startup, producer/consumer flow, Netty transport end-to-end).:pulsar-proxy:test --tests ProxyServiceTlsStarterTest— passes (proxy, TLS handshake, tcnative-boringssl integration).:distribution:pulsar-server-distribution:serverDistTarand:distribution:pulsar-shell-distribution:shellDistTar— both build and the Netty jar set inside each tarball matchesLICENSE.bin.txtexactly.Out of scope (follow-ups)
SslContextBuilder.endpointIdentificationAlgorithmdefault fromnullto"HTTPS". Pulsar's TLS client sites need to be audited and explicitly configured. This is deliberately not in this PR because the audit touches many modules (pulsar-client,pulsar-broker,pulsar-proxy,pulsar-broker-auth-oidc, admin) and should land as its own reviewable change.ByteBufAllocatorfrom pooled to adaptive. Pulsar is insulated for its main allocator path becausePulsarByteBufAllocatorexplicitly constructs aPooledByteBufAllocator.DEFAULTvia BK'sByteBufAllocatorBuilder— so Pulsar's own allocation does not go through Netty'sByteBufAllocator.DEFAULT. If CI soak tests surface regressions in Netty-internal allocation paths,-Dio.netty.allocator.type=pooledcan be added to the launch scripts as a follow-up.NioEventLoopGroup,EpollEventLoopGroup,RCVBUF_ALLOCATOR, etc.) can be migrated to the new APIs in a follow-up. They all still work in 4.2.