Drafted with AI assistance (Claude Code) and reviewed before submission, per the Generative AI Contribution Policy.
Describe the bug
The integration CI matrix jobs (workflow ci, defined in .github/workflows/test-build-deploy.yml) intermittently fail in the Preload Images step when a docker pull from Docker Hub fails with a transient network timeout:
Error response from daemon: Get "https://registry-1.docker.io/v2/": context deadline exceeded
##[error]Process completed with exit code 1.
The step is a plain sequence of docker pull commands with no retry, so a single transient registry timeout (or Docker Hub throttling) fails the whole step and aborts the integration job. This is a CI infrastructure/flakiness problem; it is unrelated to the code under test.
To Reproduce
This is a non-deterministic CI flake rather than a Cortex runtime bug. It surfaces whenever Docker Hub momentarily times out or throttles a pull:
- Push a commit / open a PR so the
integration matrix runs.
- The
Preload Images step runs, in order: docker pull minio/minio:RELEASE.2024-05-28T17-19-04Z, docker pull consul:1.8.4, docker pull quay.io/coreos/etcd:v3.5.29, docker pull memcached:1.6.1, docker pull redis:7.0.4-alpine.
- If any single pull from
registry-1.docker.io (Docker Hub) times out, the step exits 1 and the job fails — even though nothing is wrong with the change or the tests.
Observed occurrence: run 27092748117, job integration (ubuntu-24.04, amd64, integration_overrides), on 2026-06-07. The failing pull was the first one (minio/minio). The change under test was a 4-line, test-only modernization in pkg/ruler and could not have caused this.
Expected behavior
A transient Docker Hub pull error should not fail the entire integration job. Image preloads should be retried (and/or authenticated/mirrored) so a job only fails on a genuine, persistent problem.
Environment:
- Infrastructure: GitHub Actions hosted runner (
ubuntu-24.04, amd64)
- Deployment tool: N/A — CI (
.github/workflows/test-build-deploy.yml, Preload Images step)
Additional Context
The Preload Images step has no retry around the pulls (.github/workflows/test-build-deploy.yml):
- name: Preload Images
# We download docker images used by integration tests so that all images are available
# locally and the download time doesn't account in the test execution time, which is subject
# to a timeout
run: |
docker pull minio/minio:RELEASE.2024-05-28T17-19-04Z
docker pull consul:1.8.4
docker pull quay.io/coreos/etcd:v3.5.29
...
docker pull memcached:1.6.1
docker pull redis:7.0.4-alpine
Proposed fix (any of):
- Wrap the
docker pull commands in a small retry loop (a few attempts with backoff), or use a retry action such as nick-fields/retry, so a transient registry timeout is retried instead of failing the job.
- Optionally
docker login to Docker Hub for these pulls to reduce unauthenticated rate-limit/throttle flakiness (the Docker Hub-hosted images are minio/minio, consul, memcached, redis).
Related prior Docker Hub CI flakiness (closed, different mechanism — unauthenticated tonistiigi/binfmt rate-limit on ARM runners): #7414.
Failing Preload Images step log (job 79959583989)
##[group]Run docker pull minio/minio:RELEASE.2024-05-28T17-19-04Z
docker pull minio/minio:RELEASE.2024-05-28T17-19-04Z
docker pull consul:1.8.4
docker pull quay.io/coreos/etcd:v3.5.29
if [ "$TEST_TAGS" = "integration_backward_compatibility" ]; then
docker pull quay.io/cortexproject/cortex:v1.16.1
...
elif [ "$TEST_TAGS" = "integration_query_fuzz" ]; then
docker pull quay.io/cortexproject/cortex:v1.20.1
docker pull quay.io/prometheus/prometheus:v3.8.1
fi
docker pull memcached:1.6.1
docker pull redis:7.0.4-alpine
shell: /usr/bin/bash -e {0}
env:
GOTOOLCHAIN: local
TEST_TAGS: integration_overrides
##[endgroup]
Error response from daemon: Get "https://registry-1.docker.io/v2/": context deadline exceeded
##[error]Process completed with exit code 1.
Drafted with AI assistance (Claude Code) and reviewed before submission, per the Generative AI Contribution Policy.
Describe the bug
The
integrationCI matrix jobs (workflowci, defined in.github/workflows/test-build-deploy.yml) intermittently fail in the Preload Images step when adocker pullfrom Docker Hub fails with a transient network timeout:The step is a plain sequence of
docker pullcommands with no retry, so a single transient registry timeout (or Docker Hub throttling) fails the whole step and aborts the integration job. This is a CI infrastructure/flakiness problem; it is unrelated to the code under test.To Reproduce
This is a non-deterministic CI flake rather than a Cortex runtime bug. It surfaces whenever Docker Hub momentarily times out or throttles a pull:
integrationmatrix runs.Preload Imagesstep runs, in order:docker pull minio/minio:RELEASE.2024-05-28T17-19-04Z,docker pull consul:1.8.4,docker pull quay.io/coreos/etcd:v3.5.29,docker pull memcached:1.6.1,docker pull redis:7.0.4-alpine.registry-1.docker.io(Docker Hub) times out, the step exits 1 and the job fails — even though nothing is wrong with the change or the tests.Observed occurrence: run
27092748117, jobintegration (ubuntu-24.04, amd64, integration_overrides), on 2026-06-07. The failing pull was the first one (minio/minio). The change under test was a 4-line, test-only modernization inpkg/rulerand could not have caused this.Expected behavior
A transient Docker Hub pull error should not fail the entire integration job. Image preloads should be retried (and/or authenticated/mirrored) so a job only fails on a genuine, persistent problem.
Environment:
ubuntu-24.04,amd64).github/workflows/test-build-deploy.yml,Preload Imagesstep)Additional Context
The
Preload Imagesstep has no retry around the pulls (.github/workflows/test-build-deploy.yml):Proposed fix (any of):
docker pullcommands in a small retry loop (a few attempts with backoff), or use a retry action such asnick-fields/retry, so a transient registry timeout is retried instead of failing the job.docker loginto Docker Hub for these pulls to reduce unauthenticated rate-limit/throttle flakiness (the Docker Hub-hosted images areminio/minio,consul,memcached,redis).Related prior Docker Hub CI flakiness (closed, different mechanism — unauthenticated
tonistiigi/binfmtrate-limit on ARM runners): #7414.Failing
Preload Imagesstep log (job 79959583989)