Skip to content

[improve][build] Upgrade Netty from 4.1.132.Final to 4.2.12.Final#25522

Draft
merlimat wants to merge 4 commits intoapache:masterfrom
merlimat:mmerli/upgrade-netty-4.2
Draft

[improve][build] Upgrade Netty from 4.1.132.Final to 4.2.12.Final#25522
merlimat wants to merge 4 commits intoapache:masterfrom
merlimat:mmerli/upgrade-netty-4.2

Conversation

@merlimat
Copy link
Copy Markdown
Contributor

Motivation

This upgrade brings Pulsar onto the Netty 4.2 line. The direct motivation is to unblock the async-http-client 3.x upgrade (#25023), which transitively depends on Netty 4.2 and cannot be landed while Pulsar force-pins Netty 4.1 — AHC 3.x bytecode is compiled against Netty 4.2 and would silently be downgraded to Pulsar's Netty 4.1 at resolution time, with NoSuchMethodError / NoClassDefFoundError surfacing in integration tests (which is what #25023's CI is hitting).

Netty 4.1 and 4.2 cannot co-exist on the classpath (same io.netty.* package namespace), so the upgrade has to be done in a single step. The Netty team asserts source/binary forward compatibility from 4.1 to 4.2 for regular API users:

Modifications

gradle/libs.versions.toml

  • Bump netty from 4.1.132.Final to 4.2.12.Final.
  • Drop the separate netty-iouring = "0.0.26.Final" version. io_uring has graduated from incubator (io.netty.incubator:netty-incubator-transport-*-io_uring) to a first-class Netty artifact (io.netty:netty-transport-{classes,native}-io_uring) and is now pinned to the same Netty version.

pulsar-common/build.gradle.kts

  • Point the io_uring consumer at the renamed aliases.

pulsar-common/.../EventLoopUtil.java

  • Netty 4.2 removed the dedicated IOUringEventLoopGroup class. io_uring now uses the generic MultiThreadIoEventLoopGroup + IoUringIoHandler.newFactory() pattern, which makes io_uring groups indistinguishable from any other MultiThreadIoEventLoopGroup by type — breaking the existing instanceof-based channel class dispatch in getClientSocketChannelClass, getServerSocketChannelClass, getDatagramChannelClass. Fix: introduce a private marker subclass IoUringMultiThreadIoEventLoopGroup used at construction so instanceof keeps discriminating.
  • Repoint imports from io.netty.incubator.channel.uring.* to io.netty.channel.uring.* and adjust class names (IOUringIoUring).

build-logic/conventions/.../pulsar.java-conventions.gradle.kts

  • Exclude io.netty.incubator from all configurations. BookKeeper 4.17.3 (bookkeeper-common and stream-storage-java-client) still declares a transitive dependency on the 0.0.26.Final incubator io_uring jars. Those jars are compiled against Netty 4.1 internals and are not safe to leave on the 4.2 classpath. Pulsar uses the core io_uring API via EventLoopUtil; BK stream-storage is an optional feature that Pulsar does not expose in its default surface.

distribution/{server,shell}/src/assemble/LICENSE.bin.txt

  • Bump all 4.1.132.Final entries to 4.2.12.Final.
  • Replace the monolithic netty-codec-*.jar with its 4.2 split-out sub-modules netty-codec-base and netty-codec-compression (netty-codec is now an aggregator POM that ships no classes).
  • Rename the incubator io_uring entries to the core io_uring artifacts.
  • The jar set was cross-checked against the output of :distribution:pulsar-server-distribution:serverDistTar and :distribution:pulsar-shell-distribution:shellDistTar.

pulsar-common/.../BitSetRecyclableRecyclableTest and ConcurrentBitSetRecyclableTest

  • Relax the testRecycle assertion. Netty 4.2's io.netty.util.Recycler (itself deprecated in 4.2) no longer guarantees same-thread immediate reuse, so we only assert functional behavior: any recycled instance must come back cleared, and distinct create() calls must return distinct objects.

Verification

  • ./gradlew compileJava compileTestJava — clean across the entire project. Only deprecation warnings for the 4.2 compat shims (NioEventLoopGroup, EpollEventLoopGroup, DefaultEventLoopGroup, ChannelOption.RCVBUF_ALLOCATOR, EpollMode, Recycler, PlatformDependent.threadLocalRandom). These are still functional in 4.2; cleanup can follow in a separate PR.
  • :pulsar-common:test — passes.
  • :pulsar-broker:test --tests BrokerServiceTest — passes (broker startup, producer/consumer flow, Netty transport end-to-end).
  • :pulsar-proxy:test --tests ProxyServiceTlsStarterTest — passes (proxy, TLS handshake, tcnative-boringssl integration).
  • :distribution:pulsar-server-distribution:serverDistTar and :distribution:pulsar-shell-distribution:shellDistTar — both build and the Netty jar set inside each tarball matches LICENSE.bin.txt exactly.

Out of scope (follow-ups)

  • TLS endpoint identification default change. Netty 4.2 changes SslContextBuilder.endpointIdentificationAlgorithm default from null to "HTTPS". Pulsar's TLS client sites need to be audited and explicitly configured. This is deliberately not in this PR because the audit touches many modules (pulsar-client, pulsar-broker, pulsar-proxy, pulsar-broker-auth-oidc, admin) and should land as its own reviewable change.
  • Default allocator change. Netty 4.2 changes the default ByteBufAllocator from pooled to adaptive. Pulsar is insulated for its main allocator path because PulsarByteBufAllocator explicitly constructs a PooledByteBufAllocator.DEFAULT via BK's ByteBufAllocatorBuilder — so Pulsar's own allocation does not go through Netty's ByteBufAllocator.DEFAULT. If CI soak tests surface regressions in Netty-internal allocation paths, -Dio.netty.allocator.type=pooled can be added to the launch scripts as a follow-up.
  • Deprecation cleanups. The 4.2 compat shims (NioEventLoopGroup, EpollEventLoopGroup, RCVBUF_ALLOCATOR, etc.) can be migrated to the new APIs in a follow-up. They all still work in 4.2.

This upgrade brings Pulsar onto the Netty 4.2 line in preparation for
the async-http-client 3.x upgrade (apache#25023), which transitively depends
on Netty 4.2 and cannot be landed while Pulsar force-pins Netty 4.1.

Netty 4.1 and 4.2 cannot co-exist on the classpath (same io.netty.*
package namespace), so the upgrade has to be done in a single step.
The Netty team asserts source/binary forward compatibility from 4.1
to 4.2 for regular API users:

  https://netty.io/news/2025/04/03/4-2-0.html
  https://github.com/netty/netty/wiki/Netty-4.2-Migration-Guide

Changes in this PR:

* gradle/libs.versions.toml:
  - Bump netty from 4.1.132.Final to 4.2.12.Final.
  - Drop the separate netty-iouring version (0.0.26.Final).
    io_uring has graduated from incubator
    (io.netty.incubator:netty-incubator-transport-*-io_uring) to a
    first-class Netty artifact
    (io.netty:netty-transport-{classes,native}-io_uring), now pinned
    to the same Netty version.

* pulsar-common/build.gradle.kts: Point the io_uring consumer at the
  renamed aliases.

* pulsar-common/.../EventLoopUtil.java: Netty 4.2 removed the
  dedicated IOUringEventLoopGroup class. io_uring now uses the generic
  MultiThreadIoEventLoopGroup + IoUringIoHandler factory pattern,
  which makes io_uring groups indistinguishable from any other
  MultiThreadIoEventLoopGroup by type, breaking the existing
  instanceof-based channel class dispatch. Fix: introduce a private
  marker subclass IoUringMultiThreadIoEventLoopGroup used at
  construction. Also repoint the incubator imports
  (io.netty.incubator.channel.uring.*) to the core package
  (io.netty.channel.uring.*) and adjust class names (IOUring ->
  IoUring).

* build-logic/conventions/.../pulsar.java-conventions.gradle.kts:
  Exclude io.netty.incubator from all configurations. BookKeeper 4.17.3
  (bookkeeper-common and stream-storage-java-client) still declares a
  transitive dependency on the 0.0.26.Final incubator io_uring jars,
  which are compiled against Netty 4.1 internals and are not safe to
  leave on the 4.2 classpath. Pulsar uses the core io_uring API via
  EventLoopUtil; BK stream-storage is an optional feature that Pulsar
  does not expose in its default surface.

* distribution/{server,shell}/src/assemble/LICENSE.bin.txt: Reflect
  the actual Netty jar set shipped after the upgrade:
  - Bump all 4.1.132.Final entries to 4.2.12.Final.
  - Replace the monolithic netty-codec-*.jar with its 4.2 split-out
    sub-modules netty-codec-base and netty-codec-compression
    (netty-codec is now an aggregator POM that ships no classes).
  - Rename the incubator io_uring entries
    (io.netty.incubator-netty-incubator-transport-*-io_uring-0.0.26.Final)
    to the core io_uring artifacts
    (io.netty-netty-transport-{classes,native}-io_uring-4.2.12.Final).
  The jar set was cross-checked against the output of
  :distribution:pulsar-server-distribution:serverDistTar and
  :distribution:pulsar-shell-distribution:shellDistTar.

* pulsar-common/.../BitSetRecyclableRecyclableTest and
  ConcurrentBitSetRecyclableTest: Relax the testRecycle assertion.
  Netty 4.2's io.netty.util.Recycler (which is itself deprecated in
  4.2) no longer guarantees same-thread immediate reuse, so we only
  assert functional behavior: any recycled instance must come back
  cleared, and distinct create() calls must return distinct objects.

Verification:

* ./gradlew compileJava compileTestJava: clean across the entire
  project, only deprecation warnings (NioEventLoopGroup,
  EpollEventLoopGroup, DefaultEventLoopGroup,
  ChannelOption.RCVBUF_ALLOCATOR, EpollMode, Recycler,
  PlatformDependent.threadLocalRandom). These are compat shims that
  still function in 4.2; cleanup can follow in a separate PR.

* :pulsar-common:test: passes (678 tests).
* :pulsar-broker:test --tests BrokerServiceTest: passes (broker
  startup, producer/consumer flow, Netty transport end-to-end).
* :pulsar-proxy:test --tests ProxyServiceTlsStarterTest: passes
  (proxy, TLS handshake, tcnative-boringssl integration).
* :distribution:pulsar-server-distribution:serverDistTar and
  :distribution:pulsar-shell-distribution:shellDistTar both build,
  and the Netty jar set inside each tarball matches the
  LICENSE.bin.txt files exactly.

Known Netty 4.2 behavior changes that this PR does NOT address:

* The default SslContextBuilder.endpointIdentificationAlgorithm
  changed from null to HTTPS in 4.2. Pulsar's TLS client sites need
  to be audited and explicitly configured. This is intentionally out
  of scope here because the audit touches many modules
  (pulsar-client, pulsar-broker, pulsar-proxy,
  pulsar-broker-auth-oidc, admin) and should be its own PR.

* The default ByteBufAllocator changed from pooled to adaptive in
  4.2. Pulsar is not setting io.netty.allocator.type=pooled in this
  PR; if CI soak tests show regressions, the pooled override can be
  added to the launch scripts as a follow-up.
checkstyleMain flagged an UnusedImports violation on the EpollIoHandler
import that was added speculatively in the Netty 4.2 bump. The io_uring
path was refactored to use IoUringIoHandler + MultiThreadIoEventLoopGroup,
but the epoll path still uses the 4.2 EpollEventLoopGroup compat shim
directly, so EpollIoHandler is never referenced.
@merlimat merlimat marked this pull request as draft April 14, 2026 19:33
CI surfaced a NoClassDefFoundError during BookKeeper broker setup:

    java.lang.NoClassDefFoundError: io/netty/incubator/channel/uring/IOUringEventLoopGroup
    java.lang.ClassNotFoundException: io.netty.incubator.channel.uring.IOUringEventLoopGroup

    at org.apache.pulsar.broker.LedgerLostAndSkipNonRecoverableTest.setupSharedCluster

BookKeeper 4.17.3 bytecode still directly references
io.netty.incubator.channel.uring.IOUringEventLoopGroup — its own
event-loop selection path loads the class eagerly. The previous commit
excluded the entire io.netty.incubator group from all configurations
to stop shipping the 4.1-era incubator jars alongside Netty 4.2, but
that also prevented BK from resolving the class it links against,
turning a dependency hygiene concern into a hard runtime failure.

Revert the exclusion. The incubator classes are in a separate package
(io.netty.incubator.channel.uring.*) from the new core io_uring API
(io.netty.channel.uring.*), so they can coexist on the classpath
without symbol conflicts. The incubator bytecode was compiled against
Netty 4.1.x, and Netty 4.2 asserts forward compatibility for 4.1
bytecode, so BK's lazy io_uring usage (only when io_uring is enabled)
should continue to work at runtime. Pulsar's own EventLoopUtil already
uses the core io_uring API introduced in 4.2, not the incubator one.

Add the 0.0.26.Final incubator io_uring jars back to
distribution/server LICENSE.bin.txt (the shell distribution does not
pull BK's stream-storage and does not receive them).

The proper long-term fix is a BookKeeper release that moves off the
incubator io_uring onto Netty 4.2's core io_uring module — that can
be picked up in a later Pulsar BK version bump.
… tests

Netty 4.2 changed the default SslContextBuilder.endpointIdentificationAlgorithm
from null to "HTTPS", which enables hostname verification on every TLS client
connection that does not explicitly set the algorithm. Four Pulsar tests fail
on the Netty 4.2 bump because they were written against the 4.1 default of no
hostname verification:

* pulsar-broker: AuthenticationTlsHostnameVerificationTest
  .testTlsSyncProducerAndConsumerWithInvalidBrokerHost — exercises the
  "hostname verification disabled" path, expects a connection to an invalid
  broker host to succeed.
* pulsar-broker: TlsWithECCertificateFileTest
  .testConnectionSuccessWithCertificate — the EC test certs do not contain a
  SAN that matches the bind address, so broker-internal TLS now fails the
  hostname check and topic load times out.
* pulsar-proxy: ProxyMutualTlsTest.testProducerByAuthenticationTls — proxy
  test certs have the same SAN mismatch.
* integration: ClientTlsTest.testClient — integration test certs likewise.

Set -Dio.netty.handler.ssl.defaultEndpointVerificationAlgorithm=NONE for the
test JVM only. This is the Netty-supplied migration override documented at
https://github.com/netty/netty/wiki/Netty-4.2-Migration-Guide and it restores
the 4.1 default for the test suite without touching production code.

Production launch scripts are deliberately not changed. The correct long-term
fix is to audit Pulsar's SslContextBuilder call sites and explicitly set
endpointIdentificationAlgorithm based on the user's hostnameVerification
configuration, after which this test override can be removed. That audit
should be its own PR — it touches pulsar-client, pulsar-broker, pulsar-proxy,
pulsar-broker-auth-oidc, and admin.

Note on production behavior: with this PR, Pulsar users who upgrade will get
Netty 4.2's stricter default, so any TLS client that implicitly relied on the
4.1 "no hostname verification" default will start enforcing it. Users who want
to keep the old behavior can set
-Dio.netty.handler.ssl.defaultEndpointVerificationAlgorithm=NONE on the JVM
until the per-call-site audit lands.
@lhotari
Copy link
Copy Markdown
Member

lhotari commented Apr 15, 2026

FYI, Issue for AsyncHttpClient 2.12.x backport of GHSA-cmxv-58fp-fm3g: AsyncHttpClient/async-http-client#2167

@geniusjoe
Copy link
Copy Markdown
Contributor

geniusjoe commented Apr 23, 2026

Based on the scenarios I’m currently using, when brokers and consumer networks are throttled, message dispatch can slow down, causing back pressure on the broker side and leading to high direct memory usage. If, at this time, the Pulsar ledger handler attempts to read entries from BK, it may encounter transient direct memory OOM due to direct memory exhaustion. However, in the current architecture, even with the ExitOnOutOfMemoryError JVM parameter configured, the system does not shut down. Instead, it backs off and retries the read, and the part of the entry bytebuf that was not successfully read is automatically reclaimed by Netty’s internal exception handling, preventing any memory leaks.

I personally think the current exception handling logic is quite good, as it helps avoid OOM that could be triggered by back pressure when brokers and consumer networks are throttled. I’m not entirely sure whether Netty 4.2’s adaptiveByteBuf is fully compatible with the logic of Netty 4.1’s pooledByteBuf in this part. When upgrading to a major version of Netty, if the new adaptiveByteBuf does not provide significant improvements in performance or other aspects, it may be worth evaluating whether to adopt the new adaptiveByteBuf or continue reusing the old pooledByteBuf mode.

@lhotari
Copy link
Copy Markdown
Member

lhotari commented Apr 23, 2026

Based on the scenarios I’m currently using, when brokers and consumer networks are throttled, message dispatch can slow down, causing back pressure on the broker side and leading to high direct memory usage. If, at this time, the Pulsar ledger handler attempts to read entries from BK, it may encounter transient direct memory OOM due to direct memory exhaustion.

@geniusjoe The back pressure for BK reads is handled by configuring managedLedgerMaxReadsInFlightSizeInMB . Please see https://github.com/apache/pulsar/blob/master/pip/pip-442.md for more details about broker memory management and related backpressure configuration. The memory limits aren't accurate since a Netty ByteBuf can be holding on to a larger underlying buffer where it was split off.

In tuning, it's necessary to set the limits low enough that memory limit doesn't ever reach the scenario where Netty direct memory would run out. That might be impossible to achieve in all cases. One of the reasons is what you touched upon in #25274. Memory allocations, either pooled or "direct" (handled by malloc/OS), will cause fragmentation over time and there will be less memory available in certain workloads.

I personally think the current exception handling logic is quite good, as it helps avoid OOM that could be triggered by back pressure when brokers and consumer networks are throttled. I’m not entirely sure whether Netty 4.2’s adaptiveByteBuf is fully compatible with the logic of Netty 4.1’s pooledByteBuf in this part. When upgrading to a major version of Netty, if the new adaptiveByteBuf does not provide significant improvements in performance or other aspects, it may be worth evaluating whether to adopt the new adaptiveByteBuf or continue reusing the old pooledByteBuf mode.

As mentioned in my comment #25021 (comment), AutoMQ's blog post "Challenges of Custom Cache Implementation in Netty-Based Streaming Systems: Memory Fragmentation and OOM Issues" describes some of the problems with fragmentation in the Netty PooledByteBufAllocator. The AutoMQ article doesn't evaluate AdaptiveByteBufAllocator. I'd assume that it handles caching usecases in a better way than PooledByteBufAllocator. It would be great to get some feedback from real tests.

@geniusjoe
Copy link
Copy Markdown
Contributor

@geniusjoe The back pressure for BK reads is handled by configuring managedLedgerMaxReadsInFlightSizeInMB . Please see https://github.com/apache/pulsar/blob/master/pip/pip-442.md for more details about broker memory management and related backpressure configuration. The memory limits aren't accurate since a Netty ByteBuf can be holding on to a larger underlying buffer where it was split off.

In tuning, it's necessary to set the limits low enough that memory limit doesn't ever reach the scenario where Netty direct memory would run out. That might be impossible to achieve in all cases. One of the reasons is what you touched upon in #25274. Memory allocations, either pooled or "direct" (handled by malloc/OS), will cause fragmentation over time and there will be less memory available in certain workloads.

As mentioned in my comment #25021 (comment), AutoMQ's blog post "Challenges of Custom Cache Implementation in Netty-Based Streaming Systems: Memory Fragmentation and OOM Issues" describes some of the problems with fragmentation in the Netty PooledByteBufAllocator. The AutoMQ article doesn't evaluate AdaptiveByteBufAllocator. I'd assume that it handles caching usecases in a better way than PooledByteBufAllocator. It would be great to get some feedback from real tests.

Lari, thank you for your response. I think the managedLedgerMaxReadsInFlightSizeInMB configuration you mentioned is a good workaround to avoid back pressure in bandwidth‑limited scenarios.

I also agree with your point. Ideally, we could compare the two memory allocation algorithms—AdaptiveByteBufAllocator and PooledByteBufAllocator —under the same Pulsar conditions. By using the top command to observe the actual memory consumption of the Java process and leveraging the JVM’s Native Memory Tracking to obtain metrics on direct physical memory usage, we could likely reveal the performance differences between the two allocators.

@merlimat
Copy link
Copy Markdown
Contributor Author

Deferring this upgrade until BK 4.18, which will also be using Netty 4.2.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants