Skip to content

[SPARK-56866][INFRA] Pin downstream actions/checkout to a single resolved SHA#55879

Draft
zhengruifeng wants to merge 1 commit into
apache:masterfrom
zhengruifeng:ci-pin-checkout-sha
Draft

[SPARK-56866][INFRA] Pin downstream actions/checkout to a single resolved SHA#55879
zhengruifeng wants to merge 1 commit into
apache:masterfrom
zhengruifeng:ci-pin-checkout-sha

Conversation

@zhengruifeng
Copy link
Copy Markdown
Contributor

@zhengruifeng zhengruifeng commented May 14, 2026

What changes were proposed in this pull request?

In .github/workflows/build_and_test.yml, add a step to the precondition job that captures git rev-parse HEAD right after the apache/spark checkout, exposes it as a head_sha output, and switch every downstream actions/checkout from ref: ${{ inputs.branch }} to ref: ${{ needs.precondition.outputs.head_sha }}. The precondition job's own checkout still resolves inputs.branch; the 11 downstream checkouts (build, infra-image, precompile, pyspark, sparkr, buf, lint, docs, tpcds-1g, docker-integration-tests, k8s-integration-tests) now all pin to the same SHA.

Why are the changes needed?

Today each actions/checkout step independently re-resolves ref: ${{ inputs.branch }} (default master) at the moment the runner picks it up. Different jobs in the same workflow run can therefore end up testing different commits.

This is a long-standing issue. ref: ${{ inputs.branch }} has been in build_and_test.yml since commit 9e468cf010f (SPARK-39521, 2022-06-21) — ~3.5 years. The race has existed the entire time. It usually goes unnoticed because a normal master commit doesn't cross the JVM/Python boundary, so even when jobs do see different commits the tests stay consistent within each job.

It becomes a real problem during merge bursts. Commits per hour on master vary wildly; release-prep windows, end-of-week merges, and APAC + EU overlap regularly push 3–6 commits in 20 minutes. The drift window for pyspark jobs is structurally ~17 minutes (precompile time) plus runner queue wait — so during a merge burst the probability that at least one commit lands inside that window approaches 1. When the unlucky commit happens to add a tightly-coupled change — new Spark Connect relation + new proto field + new server planner + new Python tests in one PR — every NEAREST-BY-style test in the previous run then fails with:

[CONNECT_INVALID_PLAN.INVALID_ONE_OF_FIELD_NOT_SET]
The Spark Connect plan is invalid. This oneOf field in spark.connect.Relation is not set: RELTYPE_NOT_SET

Concrete example from 2026-05-14:

  • Run 25835824862 triggered by e19bc35c (SPARK-56844) — pyspark-connect failed with 19 NEAREST BY errors.
  • Run 25835929554 triggered ~3 minutes later by the next commit 13380e78 (SPARK-56395, which added the NEAREST BY feature) — same job passed.

The first run's precompile checked out e19bc35c (no NEAREST BY server code), but by the time its pyspark-connect job actually started 17 minutes later, master was at 13380e78 and actions/checkout resolved that newer commit (with the new Python test files). Pinning every job to the SHA precondition saw makes this impossible.

The fix is also forward-leaning: as Spark's release cadence and contributor count grow, the merge-burst probability only goes up; without pinning, "spurious red CI on the previous PR every time someone merges a Connect feature" will keep recurring.

Does this PR introduce any user-facing change?

No. CI infrastructure only.

How was this patch tested?

YAML syntax validated locally. CI will exercise the change end-to-end.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-opus-4-7)

Have the `precondition` job capture `git rev-parse HEAD` right after
its `actions/checkout`, expose it as `head_sha`, and switch every
downstream `actions/checkout` from `ref: ${{ inputs.branch }}` to
`ref: ${{ needs.precondition.outputs.head_sha }}`.

Without this, each downstream job independently re-resolves the
branch tip at the moment it starts. Slow-to-start jobs (the `pyspark`
matrix waits on `precompile` and typically begins ~17 minutes after
the run is created) can pick up a newer commit than the one the
compiled JAR they download was built from. When the intervening
commit adds a tightly coupled change (new Spark Connect relation,
new proto field, new server planner, new Python tests) the test
job loads the new Python sources against an older JAR and every
test fails with `[CONNECT_INVALID_PLAN.INVALID_ONE_OF_FIELD_NOT_SET]`.

Generated-by: Claude Code (claude-opus-4-7)
@zhengruifeng zhengruifeng changed the title [INFRA] Pin downstream actions/checkout to a single resolved SHA [SPARK-56866][INFRA] Pin downstream actions/checkout to a single resolved SHA May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant