feat: experimental two-phase (head-chunked) Ulysses all-to-all by csgoogle · Pull Request #428 · AI-Hypercomputer/maxdiffusion

csgoogle · 2026-06-24T13:56:30Z

Add an opt-in ULYSSES_ATTENTION_CHUNKS env var to split the Ulysses all-to-all into per-head-group passes, so XLA's async-collective scheduler can overlap one group's attention compute with the next group's all-to-all. Defaults to 1 (current single-shot path, no behavior change). Numerically identical to single-shot since heads are independent.

Notes:

Requires async-collective LIBTPU flags to actually overlap.
Gain is largest when all-to-all is a meaningful fraction of attention time (high context-parallelism / shorter sequences); at WAN 2.2 720p (seq~75600) it is compute-bound so the win is small (~3% in microbench), but for seqlen ~24k we observe ~10% gains

github-actions · 2026-06-24T13:56:43Z

e2e testgrid: https://8bcf50593faf4ea38060e236169827e5-dot-us-central1.composer.googleusercontent.com/dags/maxdiffusion_tpu_e2e/grid

google-cla · 2026-06-24T13:56:46Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Add an opt-in ULYSSES_ATTENTION_CHUNKS env var to split the Ulysses all-to-all into per-head-group passes, so XLA's async-collective scheduler can overlap one group's attention compute with the next group's all-to-all. Defaults to 1 (current single-shot path, no behavior change). Numerically identical to single-shot since heads are independent. Notes: - Requires async-collective LIBTPU flags to actually overlap. - Needs heads % (context_shards * chunks) == 0. - Gain is largest when all-to-all is a meaningful fraction of attention time (high context-parallelism / shorter sequences); at WAN 2.2 720p (seq~75600) it is compute-bound so the win is small (~3% in microbench).

Perseus14 · 2026-06-24T20:13:14Z

+  # math is identical to the single-shot path (heads are independent); requires
+  # async-collective LIBTPU flags to actually overlap, and the per-chunk head
+  # count must still be shardable across the context axis.
+  num_chunks = int(os.environ.get("ULYSSES_ATTENTION_CHUNKS", "1"))


Let's move this to config file to be used for any ulysses type kernel

Perseus14 · 2026-06-24T20:14:24Z

        f"got heads={num_heads} and context_shards={num_shards}."
    )
+
+  # EXPERIMENTAL: split the all-to-all into `num_chunks` head-groups so XLA's


Does this work on ulysses + ring as well?

csgoogle force-pushed the sagarchapara/ulysses-two-phase branch from 7240f50 to 0d936f8 Compare June 24, 2026 13:59

csgoogle requested a review from Perseus14 June 24, 2026 14:05

Perseus14 reviewed Jun 24, 2026

View reviewed changes

Perseus14 requested a review from eltsai June 24, 2026 20:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: experimental two-phase (head-chunked) Ulysses all-to-all#428

feat: experimental two-phase (head-chunked) Ulysses all-to-all#428
csgoogle wants to merge 1 commit into
mainfrom
sagarchapara/ulysses-two-phase

csgoogle commented Jun 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

google-cla Bot commented Jun 24, 2026

Uh oh!

Perseus14 Jun 24, 2026

Uh oh!

Perseus14 Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

csgoogle commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

google-cla Bot commented Jun 24, 2026

Uh oh!

Perseus14 Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Perseus14 Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

csgoogle commented Jun 24, 2026 •

edited

Loading