[SPARK-XXXXX][CORE] Add IO_URING transport mode#55880
Draft
LuciferYang wants to merge 8 commits into
Draft
Conversation
### What changes were proposed in this pull request? Enable Netty's io_uring native transport in Spark by: 1. Removing the `netty-transport-classes-io_uring` and `netty-transport-native-io_uring` exclusions from `netty-all` in the root `pom.xml`, and adding explicit `linux-x86_64` / `linux-aarch_64` classifier dependencies under `dependencyManagement`. 2. Mirroring the same native classifier dependencies in `common/network-common/pom.xml` and `core/pom.xml`. 3. Adding `IO_URING` to `IOMode` and wiring it into `NettyUtils.createEventLoop` / `getClientChannelClass` / `getServerChannelClass`. In `AUTO`, io_uring is preferred on Linux when `IoUring.isAvailable()` reports the running kernel supports it, then EPOLL, then KQUEUE on macOS, then NIO. 4. Adding `ShuffleNettyIoUringSuite` (gated on `Utils.isLinux && IoUring.isAvailable`) so the existing shuffle coverage exercises the new mode where the platform supports it. 5. Refreshing `NettyTransportBenchmark` comments so the AUTO behavior change is visible at the call sites; the existing `NIO vs AUTO` suites automatically exercise io_uring on Linux 5.10+. 6. Regenerating `dev/deps/spark-deps-hadoop-3-hive-2.3` to include the new `netty-transport-classes-io_uring` and `netty-transport-native-io_uring` (linux-x86_64 / linux-aarch_64 / linux-riscv64) entries. ### Why are the changes needed? io_uring graduated from incubator to a first-class transport in Netty 4.2 (`io.netty.channel.uring`). Compared to EPOLL it batches I/O operations through submission/completion queues, reducing per-op syscall overhead on busy executors, and uses `IORING_OP_SPLICE` for `FileRegion` writes -- functionally equivalent to `sendfile()` but fully asynchronous. SPARK-56279 already updated `MessageEncoder` to emit the header `ByteBuf` and the bare `DefaultFileRegion` separately when the body is a `FileSegmentManagedBuffer`, which means the io_uring write path can recognize the `DefaultFileRegion` and apply splice without any additional Spark-side change. ### Does this PR introduce _any_ user-facing change? Yes. On Linux kernels 5.10+, `spark.shuffle.io.mode=AUTO` (the default) now selects io_uring instead of EPOLL when io_uring is available. Operators who want the previous behavior can set `spark.shuffle.io.mode=EPOLL` explicitly. A new explicit `IO_URING` mode is also available. ### How was this patch tested? - Manual SBT compile of `network-common`, `core`, `core/Test`, `network-shuffle`, and (with `-Pyarn`) `network-yarn` on macOS. - `ShuffleNettyIoUringSuite` mirrors `ShuffleNettyEpollSuite` and runs the existing `ShuffleSuite` cases under `IO_URING` on Linux 5.10+ via GitHub Actions. - macOS runs continue to take the KQUEUE path; Linux runs without io_uring kernel support fall back to EPOLL. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor 1.x
… NettyUtils helper Replace the `_root_.io.netty.channel.uring.IoUring.isAvailable` reference in `ShuffleNettyIoUringSuite` with a small `NettyUtils.isIoUringAvailable()` helper. The `_root_.` prefix was needed because `org.apache.spark.io` shadows `io.netty.*` in this file's package, but it reads as unusual. The helper keeps the Netty-specific class out of the test scope.
…fle service
Add antrun `move` rules so the YARN external shuffle service relocates
the io_uring native libraries alongside the existing epoll/kqueue/tcnative
ones. Without this, the shaded `org.sparkproject.io.netty` classes look
for `liborg_sparkproject_netty_transport_native_io_uring42_*.so`, but
the unshaded files are named `libnetty_transport_native_io_uring42_*.so`,
so `IoUring.isAvailable()` returns `false` inside the YARN shuffle
service JVM and io_uring is silently unused.
Note: Netty 4.2 names the io_uring native lib `io_uring42_<arch>` (with
the major+minor version suffix to allow multiple Netty versions to
coexist), unlike epoll which uses the unsuffixed `epoll_<arch>`.
Verified with `build/mvn -pl common/network-yarn -am -Pyarn -DskipTests
package`: the resulting `spark-*-yarn-shuffle.jar` contains
`META-INF/native/liborg_sparkproject_netty_transport_native_io_uring42_{x86_64,aarch_64,riscv64}.so`.
… fallback
`IoUring.isAvailable()` only verifies that the JNI library loaded and
the basic syscalls work; it does not detect environments where the
kernel supports io_uring but `RLIMIT_MEMLOCK` is too low to actually
allocate the submission/completion queue rings. This is common in
containers, GitHub Actions runners, and other restricted environments,
and surfaces as:
java.lang.IllegalStateException: failed to create a child event loop
Caused by: java.lang.RuntimeException: failed to allocate memory for
io_uring ring; try raising memlock limit (see getrlimit(RLIMIT_MEMLOCK,
...) or ulimit -l): Cannot allocate memory
at io.netty.channel.uring.IoUringIoHandler.<init>(...)
After SPARK-XXXXX (the parent change) made AUTO prefer io_uring on
Linux, this caused unconditional failures in such environments rather
than graceful fallback to EPOLL.
This change adds a one-time JVM-wide probe in `NettyUtils` that creates
a single-thread `MultiThreadIoEventLoopGroup` with the io_uring handler
factory and shuts it down. If construction throws, the result is cached
as `false` and AUTO falls back to EPOLL. The probe is consulted by AUTO
mode and by `ShuffleNettyIoUringSuite.shouldRunTests`. An explicit
`IOMode.IO_URING` does not consult the probe and surfaces the
underlying error so users see what's wrong.
The previous `isIoUringAvailable()` helper (which just delegated to
`IoUring.isAvailable()`) is replaced by `isIoUringUsable()`, which
returns the probed result.
… is exercised
The probe-based fallback added by the previous follow-up makes Spark
gracefully degrade to EPOLL when AUTO cannot allocate an io_uring ring
(low `RLIMIT_MEMLOCK`, common in containers and GitHub Actions runners).
Without raising the limit in CI, the io_uring code path would be
silently skipped on every PR and never exercised before release.
Add `sudo prlimit --pid $$ --memlock=unlimited:unlimited` (Linux-only,
fail-soft via `2>/dev/null || true` so it's a no-op on macOS/Windows
runners) at the top of:
- `.github/workflows/build_and_test.yml` "Run tests" step, so module
builds (yarn, core, network-shuffle, mllib, etc.) that hit
`IOMode.AUTO` actually use io_uring on Linux 5.10+ and
`ShuffleNettyIoUringSuite` runs instead of skipping via
`NettyUtils.isIoUringUsable`.
- `.github/workflows/benchmark.yml` "Run benchmarks" step, so
`NettyTransportBenchmark`'s NIO-vs-AUTO comparison and
file-backed shuffle suite measure io_uring rather than EPOLL.
The fail-soft is important: stock GHA Linux runners support sudo
prlimit, but stripped-down environments (e.g., custom containers used
by some matrix jobs) might not, and we don't want the CI step to fail
just because memlock could not be raised. The probe in `NettyUtils`
will then degrade to EPOLL as designed.
…nd_test.yml line 417 Inadvertently stripped by the previous CI commit's surrounding edit. Pure whitespace; no behavioral change.
…ount, not 1 The previous probe created a one-thread MultiThreadIoEventLoopGroup to verify io_uring ring allocation works. This is insufficient in environments (e.g., GHA Docker container jobs for pyspark) where the container's RLIMIT_MEMLOCK is just large enough for one io_uring ring but not the eight rings Spark allocates by default per event loop group. The probe would succeed, AUTO would pick io_uring, then TransportServer.init -> createEventLoop(numThreads=8) would crash with `failed to allocate memory for io_uring ring` and propagate the exception out of SparkContext construction. Probe with MAX_DEFAULT_NETTY_THREADS rings instead. This matches the worst-case allocation size Spark uses by default for a single event loop group, so any environment whose memlock can't support real Spark usage now correctly falls back to EPOLL at probe time. Users who explicitly raise spark.shuffle.io.serverThreads (or the analogous client/chunk-fetch knobs) above MAX_DEFAULT_NETTY_THREADS remain responsible for ensuring their environment can support the larger ring count; otherwise they should set spark.shuffle.io.mode to EPOLL explicitly. Observed in the pyspark CI matrix where runs sit inside a Docker container that does not honor `sudo prlimit --memlock=unlimited` from the workflow shell, leaving the JVM with the container's default memlock.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Why are the changes needed?
Does this PR introduce any user-facing change?
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?