Release Pipeline Reliability and Manifest Correctness
Overview
A user-facing issue was observed where setup-python-pz downloaded an s390x tarball on ppc64le systems.
Investigation showed that this originated from incorrect architecture mappings in the version manifests generated by the release pipeline.
This issue documents the observed behavior, explains why the previous workflow allowed this failure, and describes the architectural changes required to prevent this class of problem going forward.
Context
-
Version manifests are JSON files that map:
(python version, architecture) → release tarball URL
These manifests are consumed by downstream tooling (e.g. setup-python-pz) and are treated as authoritative.
-
The release pipeline builds Python binaries for multiple architectures (currently ppc64le and s390x) and publishes both the artifacts and the corresponding manifest updates.
Any corruption in these manifests directly affects end users.
Impact
User impact
ppc64le users received incompatible s390x binaries
Engineering impact
- Architecture-specific manifests became unreliable
- Entire workflows were cancelled due to unrelated infrastructure failures
- Release behavior became timing-dependent and difficult to reason about
- Recovery often required full workflow re-runs
Observed Behavior
Incorrect Architecture Mapping (Critical)
While consuming the manifests via setup-python-pz:
- On
ppc64le systems, the resolved download URL pointed to an s390x tarball
- This indicates a correctness failure in manifest generation rather than in the consumer
Manifest Pollution After Introducing Trivy
After Trivy scanning was added to the release process:
-
Trivy SBOMs and scan reports were uploaded as release artifacts
-
Manifest parsing logic did not strictly filter artifact types
-
Architecture-specific manifests contained URLs pointing to:
- Trivy artifacts
- Tarballs built for the wrong architecture
Expected behavior:
- A
ppc64le manifest should reference only ppc64le tarballs
- A
s390x manifest should reference only s390x tarballs
Workflow Cancellation Due to dotnet Setup Failures
When multiple jobs were triggered concurrently:
s390x runners frequently failed during dotnet-install.py due to transient network timeouts
- A dotnet setup failure for one architecture cancelled the entire workflow
- Successful builds on
ppc64le were discarded
This coupled transient infrastructure failures to release correctness.
Retry-Based Git Push Behavior
The previous release workflow relied on retries to handle concurrent updates:
Failure scenario:
ppc64le and s390x jobs complete at roughly the same time
- Both attempt to push manifest updates
- One succeeds; the other rebases and retries
Under concurrency, this resulted in:
- Non-deterministic behavior
- Flaky failures
- Race conditions being masked rather than eliminated
Old Release Workflow (Before Refactor)
Structure
Trigger (workflow_dispatch)
|
v
+---------------------------+
| get-tags |
|---------------------------|
| - Determine Python tags |
| - Output tags_json |
+-------------+-------------+
|
v
+-----------------------------------------------+
| build-and-release-matrix (parallel) |
|-----------------------------------------------|
| arch: ppc64le, s390x |
| platform: ubuntu-22.04, ubuntu-24.04 |
| |
| For each (tag, arch, platform): |
| - Build binaries |
| - Upload release artifacts |
| - Update architecture manifest |
| - git commit + git push |
+---------------------------+-------------------+
|
v
+-----------------------------------------------+
| release-assets (per tag) |
|-----------------------------------------------|
| - Finalize GitHub Release assets |
+-----------------------------------------------+
Properties of the Old Model
- Multiple parallel jobs wrote directly to
versions-manifests/
- No serialization or concurrency control
- Retry + rebase logic used to resolve conflicts
- Infrastructure failures cancelled the entire workflow
Root Cause
Parallel jobs for ppc64le and s390x were allowed to directly mutate shared, user-facing state (version manifests), and retry logic was used to handle write conflicts instead of enforcing serialization.
Proposed Change
Refactor the release pipeline to separate artifact generation from manifest updates, and ensure that shared state is updated exactly once in a controlled manner.
New Release Workflow (After Refactor)
Structure (Matches Release Matching Python Tags)
Trigger
(push to python-tag-filter.txt
or workflow_dispatch)
|
v
+---------------------------+
| get-tags |
|---------------------------|
| - Read python-tag-filter |
| OR derive latest tag |
| - Produce tags_json |
+-------------+-------------+
|
v
+-----------------------------------------------+
| build-and-release-matrix (fan-out) |
|-----------------------------------------------|
| arch: ppc64le, s390x |
| platform: ubuntu-22.04, ubuntu-24.04 |
| |
| For each (tag, arch, platform): |
| - Build binaries |
| - Upload release artifacts |
| - Emit manifest-part-* artifact |
| |
| fail-fast: false |
+---------------------------+-------------------+
|
v
+-----------------------------------------------+
| release-assets (per tag) |
|-----------------------------------------------|
| - Finalize GitHub Release assets |
| - Does NOT modify manifests |
+---------------------------+-------------------+
|
v
+-----------------------------------------------+
| update-manifests (fan-in, single writer) |
|-----------------------------------------------|
| - Download manifest-part-* artifacts |
| - Filter Trivy artifacts |
| - Validate arch ↔ tarball mapping |
| - Merge manifests |
| - Single git commit + push |
| |
| concurrency: manifests-${{ github.ref }} |
+-----------------------------------------------+
Implementation Outline
Build Phase (Parallel)
For each architecture (ppc64le, s390x):
- Build and upload release tarballs
- Generate an architecture-scoped partial manifest
- Do not write to git
Aggregation Phase (Serialized)
A single job:
- Downloads all available partial manifests
- Filters out non-release artifacts (e.g. Trivy outputs)
- Validates architecture-to-tarball mapping
- Merges results deterministically
- Commits once
Partial failures are tolerated: a failure on s390x does not invalidate a successful ppc64le build, and vice versa.
Infrastructure Considerations
- Add retry logic to
dotnet-install.py for transient network failures
- Prevent dotnet setup failures on
s390x from cancelling ppc64le releases
- Decouple infrastructure instability from manifest correctness
Acceptance Criteria
ppc64le manifests reference only ppc64le tarballs
s390x manifests reference only s390x tarballs
- Trivy artifacts do not appear in version manifests
setup-python-pz resolves correct binaries for each architecture
- No parallel git writes during release
- One atomic manifest update per release run
Rationale for Documentation
This issue documents a real failure mode and the architectural reasoning behind the fix, so future changes to the release pipeline can be evaluated against these constraints and invariants.
Release Pipeline Reliability and Manifest Correctness
Overview
A user-facing issue was observed where
setup-python-pzdownloaded an s390x tarball on ppc64le systems.Investigation showed that this originated from incorrect architecture mappings in the version manifests generated by the release pipeline.
This issue documents the observed behavior, explains why the previous workflow allowed this failure, and describes the architectural changes required to prevent this class of problem going forward.
Context
Version manifests are JSON files that map:
These manifests are consumed by downstream tooling (e.g.
setup-python-pz) and are treated as authoritative.The release pipeline builds Python binaries for multiple architectures (currently
ppc64leands390x) and publishes both the artifacts and the corresponding manifest updates.Any corruption in these manifests directly affects end users.
Impact
User impact
ppc64leusers received incompatibles390xbinariesEngineering impact
Observed Behavior
Incorrect Architecture Mapping (Critical)
While consuming the manifests via
setup-python-pz:ppc64lesystems, the resolved download URL pointed to ans390xtarballManifest Pollution After Introducing Trivy
After Trivy scanning was added to the release process:
Trivy SBOMs and scan reports were uploaded as release artifacts
Manifest parsing logic did not strictly filter artifact types
Architecture-specific manifests contained URLs pointing to:
Expected behavior:
ppc64lemanifest should reference onlyppc64letarballss390xmanifest should reference onlys390xtarballsWorkflow Cancellation Due to dotnet Setup Failures
When multiple jobs were triggered concurrently:
s390xrunners frequently failed duringdotnet-install.pydue to transient network timeoutsppc64lewere discardedThis coupled transient infrastructure failures to release correctness.
Retry-Based Git Push Behavior
The previous release workflow relied on retries to handle concurrent updates:
Each architecture-specific job (
ppc64le,s390x) generated and pushed its own manifestIf a push failed:
Failure scenario:
ppc64leands390xjobs complete at roughly the same timeUnder concurrency, this resulted in:
Old Release Workflow (Before Refactor)
Structure
Properties of the Old Model
versions-manifests/Root Cause
Parallel jobs for
ppc64leands390xwere allowed to directly mutate shared, user-facing state (version manifests), and retry logic was used to handle write conflicts instead of enforcing serialization.Proposed Change
Refactor the release pipeline to separate artifact generation from manifest updates, and ensure that shared state is updated exactly once in a controlled manner.
New Release Workflow (After Refactor)
Structure (Matches
Release Matching Python Tags)Implementation Outline
Build Phase (Parallel)
For each architecture (
ppc64le,s390x):Aggregation Phase (Serialized)
A single job:
Partial failures are tolerated: a failure on
s390xdoes not invalidate a successfulppc64lebuild, and vice versa.Infrastructure Considerations
dotnet-install.pyfor transient network failuress390xfrom cancellingppc64lereleasesAcceptance Criteria
ppc64lemanifests reference onlyppc64letarballss390xmanifests reference onlys390xtarballssetup-python-pzresolves correct binaries for each architectureRationale for Documentation
This issue documents a real failure mode and the architectural reasoning behind the fix, so future changes to the release pipeline can be evaluated against these constraints and invariants.