[SPARK-56866][INFRA] Pin downstream actions/checkout to a single resolved SHA by zhengruifeng · Pull Request #55879 · apache/spark

zhengruifeng · 2026-05-14T11:08:44Z

What changes were proposed in this pull request?

In .github/workflows/build_and_test.yml, add a step to the precondition job that captures git rev-parse HEAD right after the apache/spark checkout, exposes it as a head_sha output, and switch every downstream actions/checkout from ref: ${{ inputs.branch }} to ref: ${{ needs.precondition.outputs.head_sha }}. The precondition job's own checkout still resolves inputs.branch; the 11 downstream checkouts (build, infra-image, precompile, pyspark, sparkr, buf, lint, docs, tpcds-1g, docker-integration-tests, k8s-integration-tests) now all pin to the same SHA.

Why are the changes needed?

Today each actions/checkout step independently re-resolves ref: ${{ inputs.branch }} (default master) at the moment the runner picks it up. Different jobs in the same workflow run can therefore end up testing different commits.

This is a long-standing issue. ref: ${{ inputs.branch }} has been in build_and_test.yml since commit 9e468cf010f (SPARK-39521, 2022-06-21) — ~3.5 years. The race has existed the entire time. It usually goes unnoticed because a normal master commit doesn't cross the JVM/Python boundary, so even when jobs do see different commits the tests stay consistent within each job.

It becomes a real problem during merge bursts. Commits per hour on master vary wildly; release-prep windows, end-of-week merges, and APAC + EU overlap regularly push 3–6 commits in 20 minutes. The drift window for pyspark jobs is structurally ~17 minutes (precompile time) plus runner queue wait — so during a merge burst the probability that at least one commit lands inside that window approaches 1. When the unlucky commit happens to add a tightly-coupled change — new Spark Connect relation + new proto field + new server planner + new Python tests in one PR — every NEAREST-BY-style test in the previous run then fails with:

[CONNECT_INVALID_PLAN.INVALID_ONE_OF_FIELD_NOT_SET]
The Spark Connect plan is invalid. This oneOf field in spark.connect.Relation is not set: RELTYPE_NOT_SET

Concrete example from 2026-05-14:

Run 25835824862 triggered by e19bc35c (SPARK-56844) — pyspark-connect failed with 19 NEAREST BY errors.
Run 25835929554 triggered ~3 minutes later by the next commit 13380e78 (SPARK-56395, which added the NEAREST BY feature) — same job passed.

The first run's precompile checked out e19bc35c (no NEAREST BY server code), but by the time its pyspark-connect job actually started 17 minutes later, master was at 13380e78 and actions/checkout resolved that newer commit (with the new Python test files). Pinning every job to the SHA precondition saw makes this impossible.

The fix is also forward-leaning: as Spark's release cadence and contributor count grow, the merge-burst probability only goes up; without pinning, "spurious red CI on the previous PR every time someone merges a Connect feature" will keep recurring.

Does this PR introduce any user-facing change?

No. CI infrastructure only.

How was this patch tested?

YAML syntax validated locally. CI will exercise the change end-to-end.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-opus-4-7)

Have the `precondition` job capture `git rev-parse HEAD` right after its `actions/checkout`, expose it as `head_sha`, and switch every downstream `actions/checkout` from `ref: ${{ inputs.branch }}` to `ref: ${{ needs.precondition.outputs.head_sha }}`. Without this, each downstream job independently re-resolves the branch tip at the moment it starts. Slow-to-start jobs (the `pyspark` matrix waits on `precompile` and typically begins ~17 minutes after the run is created) can pick up a newer commit than the one the compiled JAR they download was built from. When the intervening commit adds a tightly coupled change (new Spark Connect relation, new proto field, new server planner, new Python tests) the test job loads the new Python sources against an older JAR and every test fails with `[CONNECT_INVALID_PLAN.INVALID_ONE_OF_FIELD_NOT_SET]`. Generated-by: Claude Code (claude-opus-4-7)

zhengruifeng changed the title ~~[INFRA] Pin downstream actions/checkout to a single resolved SHA~~ [SPARK-56866][INFRA] Pin downstream actions/checkout to a single resolved SHA May 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56866][INFRA] Pin downstream actions/checkout to a single resolved SHA#55879

[SPARK-56866][INFRA] Pin downstream actions/checkout to a single resolved SHA#55879
zhengruifeng wants to merge 1 commit into
apache:masterfrom
zhengruifeng:ci-pin-checkout-sha

zhengruifeng commented May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhengruifeng commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhengruifeng commented May 14, 2026 •

edited

Loading