Skip to content

Release Pipeline Reliability and Manifest Correctness #15

@adilhusain-s

Description

@adilhusain-s

Release Pipeline Reliability and Manifest Correctness

Overview

A user-facing issue was observed where setup-python-pz downloaded an s390x tarball on ppc64le systems.
Investigation showed that this originated from incorrect architecture mappings in the version manifests generated by the release pipeline.

This issue documents the observed behavior, explains why the previous workflow allowed this failure, and describes the architectural changes required to prevent this class of problem going forward.


Context

  • Version manifests are JSON files that map:

    (python version, architecture) → release tarball URL
    

    These manifests are consumed by downstream tooling (e.g. setup-python-pz) and are treated as authoritative.

  • The release pipeline builds Python binaries for multiple architectures (currently ppc64le and s390x) and publishes both the artifacts and the corresponding manifest updates.

Any corruption in these manifests directly affects end users.


Impact

User impact

  • ppc64le users received incompatible s390x binaries

Engineering impact

  • Architecture-specific manifests became unreliable
  • Entire workflows were cancelled due to unrelated infrastructure failures
  • Release behavior became timing-dependent and difficult to reason about
  • Recovery often required full workflow re-runs

Observed Behavior

Incorrect Architecture Mapping (Critical)

While consuming the manifests via setup-python-pz:

  • On ppc64le systems, the resolved download URL pointed to an s390x tarball
  • This indicates a correctness failure in manifest generation rather than in the consumer

Manifest Pollution After Introducing Trivy

After Trivy scanning was added to the release process:

  • Trivy SBOMs and scan reports were uploaded as release artifacts

  • Manifest parsing logic did not strictly filter artifact types

  • Architecture-specific manifests contained URLs pointing to:

    • Trivy artifacts
    • Tarballs built for the wrong architecture

Expected behavior:

  • A ppc64le manifest should reference only ppc64le tarballs
  • A s390x manifest should reference only s390x tarballs

Workflow Cancellation Due to dotnet Setup Failures

When multiple jobs were triggered concurrently:

  • s390x runners frequently failed during dotnet-install.py due to transient network timeouts
  • A dotnet setup failure for one architecture cancelled the entire workflow
  • Successful builds on ppc64le were discarded

This coupled transient infrastructure failures to release correctness.


Retry-Based Git Push Behavior

The previous release workflow relied on retries to handle concurrent updates:

  • Each architecture-specific job (ppc64le, s390x) generated and pushed its own manifest

  • If a push failed:

    • The job rebased
    • Retried until success

Failure scenario:

  • ppc64le and s390x jobs complete at roughly the same time
  • Both attempt to push manifest updates
  • One succeeds; the other rebases and retries

Under concurrency, this resulted in:

  • Non-deterministic behavior
  • Flaky failures
  • Race conditions being masked rather than eliminated

Old Release Workflow (Before Refactor)

Structure

Trigger (workflow_dispatch)
        |
        v
+---------------------------+
| get-tags                  |
|---------------------------|
| - Determine Python tags   |
| - Output tags_json        |
+-------------+-------------+
              |
              v
+-----------------------------------------------+
| build-and-release-matrix (parallel)            |
|-----------------------------------------------|
| arch: ppc64le, s390x                           |
| platform: ubuntu-22.04, ubuntu-24.04           |
|                                               |
| For each (tag, arch, platform):                |
|   - Build binaries                             |
|   - Upload release artifacts                  |
|   - Update architecture manifest              |
|   - git commit + git push                     |
+---------------------------+-------------------+
                            |
                            v
+-----------------------------------------------+
| release-assets (per tag)                       |
|-----------------------------------------------|
| - Finalize GitHub Release assets               |
+-----------------------------------------------+

Properties of the Old Model

  • Multiple parallel jobs wrote directly to versions-manifests/
  • No serialization or concurrency control
  • Retry + rebase logic used to resolve conflicts
  • Infrastructure failures cancelled the entire workflow

Root Cause

Parallel jobs for ppc64le and s390x were allowed to directly mutate shared, user-facing state (version manifests), and retry logic was used to handle write conflicts instead of enforcing serialization.


Proposed Change

Refactor the release pipeline to separate artifact generation from manifest updates, and ensure that shared state is updated exactly once in a controlled manner.


New Release Workflow (After Refactor)

Structure (Matches Release Matching Python Tags)

Trigger
(push to python-tag-filter.txt
 or workflow_dispatch)
        |
        v
+---------------------------+
| get-tags                  |
|---------------------------|
| - Read python-tag-filter  |
|   OR derive latest tag    |
| - Produce tags_json       |
+-------------+-------------+
              |
              v
+-----------------------------------------------+
| build-and-release-matrix (fan-out)             |
|-----------------------------------------------|
| arch: ppc64le, s390x                           |
| platform: ubuntu-22.04, ubuntu-24.04           |
|                                               |
| For each (tag, arch, platform):                |
|   - Build binaries                             |
|   - Upload release artifacts                  |
|   - Emit manifest-part-* artifact              |
|                                               |
| fail-fast: false                               |
+---------------------------+-------------------+
                            |
                            v
+-----------------------------------------------+
| release-assets (per tag)                       |
|-----------------------------------------------|
| - Finalize GitHub Release assets               |
| - Does NOT modify manifests                   |
+---------------------------+-------------------+
                            |
                            v
+-----------------------------------------------+
| update-manifests (fan-in, single writer)       |
|-----------------------------------------------|
| - Download manifest-part-* artifacts            |
| - Filter Trivy artifacts                      |
| - Validate arch ↔ tarball mapping              |
| - Merge manifests                             |
| - Single git commit + push                    |
|                                               |
| concurrency: manifests-${{ github.ref }}       |
+-----------------------------------------------+

Implementation Outline

Build Phase (Parallel)

For each architecture (ppc64le, s390x):

  • Build and upload release tarballs
  • Generate an architecture-scoped partial manifest
  • Do not write to git

Aggregation Phase (Serialized)

A single job:

  • Downloads all available partial manifests
  • Filters out non-release artifacts (e.g. Trivy outputs)
  • Validates architecture-to-tarball mapping
  • Merges results deterministically
  • Commits once

Partial failures are tolerated: a failure on s390x does not invalidate a successful ppc64le build, and vice versa.


Infrastructure Considerations

  • Add retry logic to dotnet-install.py for transient network failures
  • Prevent dotnet setup failures on s390x from cancelling ppc64le releases
  • Decouple infrastructure instability from manifest correctness

Acceptance Criteria

  • ppc64le manifests reference only ppc64le tarballs
  • s390x manifests reference only s390x tarballs
  • Trivy artifacts do not appear in version manifests
  • setup-python-pz resolves correct binaries for each architecture
  • No parallel git writes during release
  • One atomic manifest update per release run

Rationale for Documentation

This issue documents a real failure mode and the architectural reasoning behind the fix, so future changes to the release pipeline can be evaluated against these constraints and invariants.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions