[SPARK-56866][INFRA] Pin downstream actions/checkout to a single resolved SHA#55879
Draft
zhengruifeng wants to merge 1 commit into
Draft
[SPARK-56866][INFRA] Pin downstream actions/checkout to a single resolved SHA#55879zhengruifeng wants to merge 1 commit into
zhengruifeng wants to merge 1 commit into
Conversation
Have the `precondition` job capture `git rev-parse HEAD` right after
its `actions/checkout`, expose it as `head_sha`, and switch every
downstream `actions/checkout` from `ref: ${{ inputs.branch }}` to
`ref: ${{ needs.precondition.outputs.head_sha }}`.
Without this, each downstream job independently re-resolves the
branch tip at the moment it starts. Slow-to-start jobs (the `pyspark`
matrix waits on `precompile` and typically begins ~17 minutes after
the run is created) can pick up a newer commit than the one the
compiled JAR they download was built from. When the intervening
commit adds a tightly coupled change (new Spark Connect relation,
new proto field, new server planner, new Python tests) the test
job loads the new Python sources against an older JAR and every
test fails with `[CONNECT_INVALID_PLAN.INVALID_ONE_OF_FIELD_NOT_SET]`.
Generated-by: Claude Code (claude-opus-4-7)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
In
.github/workflows/build_and_test.yml, add a step to thepreconditionjob that capturesgit rev-parse HEADright after the apache/spark checkout, exposes it as ahead_shaoutput, and switch every downstreamactions/checkoutfromref: ${{ inputs.branch }}toref: ${{ needs.precondition.outputs.head_sha }}. Thepreconditionjob's own checkout still resolvesinputs.branch; the 11 downstream checkouts (build,infra-image,precompile,pyspark,sparkr,buf,lint,docs,tpcds-1g,docker-integration-tests,k8s-integration-tests) now all pin to the same SHA.Why are the changes needed?
Today each
actions/checkoutstep independently re-resolvesref: ${{ inputs.branch }}(defaultmaster) at the moment the runner picks it up. Different jobs in the same workflow run can therefore end up testing different commits.This is a long-standing issue.
ref: ${{ inputs.branch }}has been inbuild_and_test.ymlsince commit9e468cf010f(SPARK-39521, 2022-06-21) — ~3.5 years. The race has existed the entire time. It usually goes unnoticed because a normal master commit doesn't cross the JVM/Python boundary, so even when jobs do see different commits the tests stay consistent within each job.It becomes a real problem during merge bursts. Commits per hour on master vary wildly; release-prep windows, end-of-week merges, and APAC + EU overlap regularly push 3–6 commits in 20 minutes. The drift window for
pysparkjobs is structurally ~17 minutes (precompiletime) plus runner queue wait — so during a merge burst the probability that at least one commit lands inside that window approaches 1. When the unlucky commit happens to add a tightly-coupled change — new Spark Connect relation + new proto field + new server planner + new Python tests in one PR — every NEAREST-BY-style test in the previous run then fails with:Concrete example from 2026-05-14:
e19bc35c(SPARK-56844) —pyspark-connectfailed with 19 NEAREST BY errors.13380e78(SPARK-56395, which added the NEAREST BY feature) — same job passed.The first run's
precompilechecked oute19bc35c(no NEAREST BY server code), but by the time itspyspark-connectjob actually started 17 minutes later, master was at13380e78andactions/checkoutresolved that newer commit (with the new Python test files). Pinning every job to the SHApreconditionsaw makes this impossible.The fix is also forward-leaning: as Spark's release cadence and contributor count grow, the merge-burst probability only goes up; without pinning, "spurious red CI on the previous PR every time someone merges a Connect feature" will keep recurring.
Does this PR introduce any user-facing change?
No. CI infrastructure only.
How was this patch tested?
YAML syntax validated locally. CI will exercise the change end-to-end.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (claude-opus-4-7)