Skip to content

CI: flaky integration jobs — Preload Images step fails on transient Docker Hub pull (context deadline exceeded) #7598

@sandy2008

Description

@sandy2008

Drafted with AI assistance (Claude Code) and reviewed before submission, per the Generative AI Contribution Policy.

Describe the bug
The integration CI matrix jobs (workflow ci, defined in .github/workflows/test-build-deploy.yml) intermittently fail in the Preload Images step when a docker pull from Docker Hub fails with a transient network timeout:

Error response from daemon: Get "https://registry-1.docker.io/v2/": context deadline exceeded
##[error]Process completed with exit code 1.

The step is a plain sequence of docker pull commands with no retry, so a single transient registry timeout (or Docker Hub throttling) fails the whole step and aborts the integration job. This is a CI infrastructure/flakiness problem; it is unrelated to the code under test.

To Reproduce
This is a non-deterministic CI flake rather than a Cortex runtime bug. It surfaces whenever Docker Hub momentarily times out or throttles a pull:

  1. Push a commit / open a PR so the integration matrix runs.
  2. The Preload Images step runs, in order: docker pull minio/minio:RELEASE.2024-05-28T17-19-04Z, docker pull consul:1.8.4, docker pull quay.io/coreos/etcd:v3.5.29, docker pull memcached:1.6.1, docker pull redis:7.0.4-alpine.
  3. If any single pull from registry-1.docker.io (Docker Hub) times out, the step exits 1 and the job fails — even though nothing is wrong with the change or the tests.

Observed occurrence: run 27092748117, job integration (ubuntu-24.04, amd64, integration_overrides), on 2026-06-07. The failing pull was the first one (minio/minio). The change under test was a 4-line, test-only modernization in pkg/ruler and could not have caused this.

Expected behavior
A transient Docker Hub pull error should not fail the entire integration job. Image preloads should be retried (and/or authenticated/mirrored) so a job only fails on a genuine, persistent problem.

Environment:

  • Infrastructure: GitHub Actions hosted runner (ubuntu-24.04, amd64)
  • Deployment tool: N/A — CI (.github/workflows/test-build-deploy.yml, Preload Images step)

Additional Context

The Preload Images step has no retry around the pulls (.github/workflows/test-build-deploy.yml):

      - name: Preload Images
        # We download docker images used by integration tests so that all images are available
        # locally and the download time doesn't account in the test execution time, which is subject
        # to a timeout
        run: |
          docker pull minio/minio:RELEASE.2024-05-28T17-19-04Z
          docker pull consul:1.8.4
          docker pull quay.io/coreos/etcd:v3.5.29
          ...
          docker pull memcached:1.6.1
          docker pull redis:7.0.4-alpine

Proposed fix (any of):

  • Wrap the docker pull commands in a small retry loop (a few attempts with backoff), or use a retry action such as nick-fields/retry, so a transient registry timeout is retried instead of failing the job.
  • Optionally docker login to Docker Hub for these pulls to reduce unauthenticated rate-limit/throttle flakiness (the Docker Hub-hosted images are minio/minio, consul, memcached, redis).

Related prior Docker Hub CI flakiness (closed, different mechanism — unauthenticated tonistiigi/binfmt rate-limit on ARM runners): #7414.

Failing Preload Images step log (job 79959583989)
##[group]Run docker pull minio/minio:RELEASE.2024-05-28T17-19-04Z
docker pull minio/minio:RELEASE.2024-05-28T17-19-04Z
docker pull consul:1.8.4
docker pull quay.io/coreos/etcd:v3.5.29
if [ "$TEST_TAGS" = "integration_backward_compatibility" ]; then
  docker pull quay.io/cortexproject/cortex:v1.16.1
  ...
elif [ "$TEST_TAGS" = "integration_query_fuzz" ]; then
  docker pull quay.io/cortexproject/cortex:v1.20.1
  docker pull quay.io/prometheus/prometheus:v3.8.1
fi
docker pull memcached:1.6.1
docker pull redis:7.0.4-alpine
shell: /usr/bin/bash -e {0}
env:
  GOTOOLCHAIN: local
  TEST_TAGS: integration_overrides
##[endgroup]
Error response from daemon: Get "https://registry-1.docker.io/v2/": context deadline exceeded
##[error]Process completed with exit code 1.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions