Skip to content

[SPARK-XXXXX][CORE] Add IO_URING transport mode#55880

Draft
LuciferYang wants to merge 8 commits into
apache:masterfrom
LuciferYang:iouring-transport
Draft

[SPARK-XXXXX][CORE] Add IO_URING transport mode#55880
LuciferYang wants to merge 8 commits into
apache:masterfrom
LuciferYang:iouring-transport

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

### What changes were proposed in this pull request?

Enable Netty's io_uring native transport in Spark by:

1. Removing the `netty-transport-classes-io_uring` and
   `netty-transport-native-io_uring` exclusions from `netty-all` in the
   root `pom.xml`, and adding explicit `linux-x86_64` / `linux-aarch_64`
   classifier dependencies under `dependencyManagement`.
2. Mirroring the same native classifier dependencies in
   `common/network-common/pom.xml` and `core/pom.xml`.
3. Adding `IO_URING` to `IOMode` and wiring it into
   `NettyUtils.createEventLoop` / `getClientChannelClass` /
   `getServerChannelClass`. In `AUTO`, io_uring is preferred on Linux
   when `IoUring.isAvailable()` reports the running kernel supports it,
   then EPOLL, then KQUEUE on macOS, then NIO.
4. Adding `ShuffleNettyIoUringSuite` (gated on
   `Utils.isLinux && IoUring.isAvailable`) so the existing shuffle
   coverage exercises the new mode where the platform supports it.
5. Refreshing `NettyTransportBenchmark` comments so the AUTO behavior
   change is visible at the call sites; the existing `NIO vs AUTO`
   suites automatically exercise io_uring on Linux 5.10+.
6. Regenerating `dev/deps/spark-deps-hadoop-3-hive-2.3` to include the
   new `netty-transport-classes-io_uring` and
   `netty-transport-native-io_uring` (linux-x86_64 / linux-aarch_64 /
   linux-riscv64) entries.

### Why are the changes needed?

io_uring graduated from incubator to a first-class transport in Netty
4.2 (`io.netty.channel.uring`). Compared to EPOLL it batches I/O
operations through submission/completion queues, reducing per-op syscall
overhead on busy executors, and uses `IORING_OP_SPLICE` for `FileRegion`
writes -- functionally equivalent to `sendfile()` but fully
asynchronous. SPARK-56279 already updated `MessageEncoder` to emit the
header `ByteBuf` and the bare `DefaultFileRegion` separately when the
body is a `FileSegmentManagedBuffer`, which means the io_uring write
path can recognize the `DefaultFileRegion` and apply splice without any
additional Spark-side change.

### Does this PR introduce _any_ user-facing change?

Yes. On Linux kernels 5.10+, `spark.shuffle.io.mode=AUTO` (the default)
now selects io_uring instead of EPOLL when io_uring is available.
Operators who want the previous behavior can set
`spark.shuffle.io.mode=EPOLL` explicitly. A new explicit
`IO_URING` mode is also available.

### How was this patch tested?

- Manual SBT compile of `network-common`, `core`, `core/Test`,
  `network-shuffle`, and (with `-Pyarn`) `network-yarn` on macOS.
- `ShuffleNettyIoUringSuite` mirrors `ShuffleNettyEpollSuite` and runs
  the existing `ShuffleSuite` cases under `IO_URING` on Linux 5.10+ via
  GitHub Actions.
- macOS runs continue to take the KQUEUE path; Linux runs without
  io_uring kernel support fall back to EPOLL.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.x
… NettyUtils helper

Replace the `_root_.io.netty.channel.uring.IoUring.isAvailable` reference
in `ShuffleNettyIoUringSuite` with a small `NettyUtils.isIoUringAvailable()`
helper. The `_root_.` prefix was needed because `org.apache.spark.io`
shadows `io.netty.*` in this file's package, but it reads as unusual.
The helper keeps the Netty-specific class out of the test scope.
…fle service

Add antrun `move` rules so the YARN external shuffle service relocates
the io_uring native libraries alongside the existing epoll/kqueue/tcnative
ones. Without this, the shaded `org.sparkproject.io.netty` classes look
for `liborg_sparkproject_netty_transport_native_io_uring42_*.so`, but
the unshaded files are named `libnetty_transport_native_io_uring42_*.so`,
so `IoUring.isAvailable()` returns `false` inside the YARN shuffle
service JVM and io_uring is silently unused.

Note: Netty 4.2 names the io_uring native lib `io_uring42_<arch>` (with
the major+minor version suffix to allow multiple Netty versions to
coexist), unlike epoll which uses the unsuffixed `epoll_<arch>`.

Verified with `build/mvn -pl common/network-yarn -am -Pyarn -DskipTests
package`: the resulting `spark-*-yarn-shuffle.jar` contains
`META-INF/native/liborg_sparkproject_netty_transport_native_io_uring42_{x86_64,aarch_64,riscv64}.so`.
@LuciferYang LuciferYang marked this pull request as draft May 14, 2026 12:19
… fallback

`IoUring.isAvailable()` only verifies that the JNI library loaded and
the basic syscalls work; it does not detect environments where the
kernel supports io_uring but `RLIMIT_MEMLOCK` is too low to actually
allocate the submission/completion queue rings. This is common in
containers, GitHub Actions runners, and other restricted environments,
and surfaces as:

    java.lang.IllegalStateException: failed to create a child event loop
    Caused by: java.lang.RuntimeException: failed to allocate memory for
      io_uring ring; try raising memlock limit (see getrlimit(RLIMIT_MEMLOCK,
      ...) or ulimit -l): Cannot allocate memory
        at io.netty.channel.uring.IoUringIoHandler.<init>(...)

After SPARK-XXXXX (the parent change) made AUTO prefer io_uring on
Linux, this caused unconditional failures in such environments rather
than graceful fallback to EPOLL.

This change adds a one-time JVM-wide probe in `NettyUtils` that creates
a single-thread `MultiThreadIoEventLoopGroup` with the io_uring handler
factory and shuts it down. If construction throws, the result is cached
as `false` and AUTO falls back to EPOLL. The probe is consulted by AUTO
mode and by `ShuffleNettyIoUringSuite.shouldRunTests`. An explicit
`IOMode.IO_URING` does not consult the probe and surfaces the
underlying error so users see what's wrong.

The previous `isIoUringAvailable()` helper (which just delegated to
`IoUring.isAvailable()`) is replaced by `isIoUringUsable()`, which
returns the probed result.
… is exercised

The probe-based fallback added by the previous follow-up makes Spark
gracefully degrade to EPOLL when AUTO cannot allocate an io_uring ring
(low `RLIMIT_MEMLOCK`, common in containers and GitHub Actions runners).
Without raising the limit in CI, the io_uring code path would be
silently skipped on every PR and never exercised before release.

Add `sudo prlimit --pid $$ --memlock=unlimited:unlimited` (Linux-only,
fail-soft via `2>/dev/null || true` so it's a no-op on macOS/Windows
runners) at the top of:

  - `.github/workflows/build_and_test.yml` "Run tests" step, so module
    builds (yarn, core, network-shuffle, mllib, etc.) that hit
    `IOMode.AUTO` actually use io_uring on Linux 5.10+ and
    `ShuffleNettyIoUringSuite` runs instead of skipping via
    `NettyUtils.isIoUringUsable`.
  - `.github/workflows/benchmark.yml` "Run benchmarks" step, so
    `NettyTransportBenchmark`'s NIO-vs-AUTO comparison and
    file-backed shuffle suite measure io_uring rather than EPOLL.

The fail-soft is important: stock GHA Linux runners support sudo
prlimit, but stripped-down environments (e.g., custom containers used
by some matrix jobs) might not, and we don't want the CI step to fail
just because memlock could not be raised. The probe in `NettyUtils`
will then degrade to EPOLL as designed.
…nd_test.yml line 417

Inadvertently stripped by the previous CI commit's surrounding edit.
Pure whitespace; no behavioral change.
…ount, not 1

The previous probe created a one-thread MultiThreadIoEventLoopGroup to
verify io_uring ring allocation works. This is insufficient in
environments (e.g., GHA Docker container jobs for pyspark) where the
container's RLIMIT_MEMLOCK is just large enough for one io_uring ring
but not the eight rings Spark allocates by default per event loop
group. The probe would succeed, AUTO would pick io_uring, then
TransportServer.init -> createEventLoop(numThreads=8) would crash
with `failed to allocate memory for io_uring ring` and propagate the
exception out of SparkContext construction.

Probe with MAX_DEFAULT_NETTY_THREADS rings instead. This matches the
worst-case allocation size Spark uses by default for a single event
loop group, so any environment whose memlock can't support real Spark
usage now correctly falls back to EPOLL at probe time.

Users who explicitly raise spark.shuffle.io.serverThreads (or the
analogous client/chunk-fetch knobs) above MAX_DEFAULT_NETTY_THREADS
remain responsible for ensuring their environment can support the
larger ring count; otherwise they should set spark.shuffle.io.mode
to EPOLL explicitly.

Observed in the pyspark CI matrix where runs sit inside a Docker
container that does not honor `sudo prlimit --memlock=unlimited` from
the workflow shell, leaving the JVM with the container's default
memlock.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant