[build] retry Bazel on transient GitHub CDN 5xx errors during repo fetch#17568
[build] retry Bazel on transient GitHub CDN 5xx errors during repo fetch#17568titusfortner wants to merge 2 commits into
Conversation
Review Summary by QodoAdd retry logic for Bazel on transient CDN errors
WalkthroughsDescription• Implements retry mechanism for transient GitHub CDN 5xx errors • Detects HTTP 502/503/504 errors in Bazel output logs • Retries failed Bazel commands up to 3 times with exponential backoff • Preserves original exit codes for non-transient failures Diagramflowchart LR
A["Bazel Command"] --> B["Execute with Log Capture"]
B --> C{"Exit Code = 0?"}
C -->|Yes| D["Success"]
C -->|No| E{"5xx Error Detected?"}
E -->|No| F["Exit with Error"]
E -->|Yes| G{"Max Retries?"}
G -->|Yes| H["Exit Failure"]
G -->|No| I["Wait & Retry"]
I --> B
File Changes1. scripts/github-actions/run-bazel-with-retry.sh
|
Code Review by Qodo
1. Tee failures ignored
|
|
Persistent review updated to latest commit 2ba409c |
| BAZEL_ERROR_PATTERN='GET returned 50[234] ' | ||
|
|
||
| for i in $(seq 1 "$BAZEL_MAX_ATTEMPTS"); do | ||
| bash -c "$BAZEL_RUN_CMD" 2>&1 | tee "$LOG_FILE" |
There was a problem hiding this comment.
1. Tee failures ignored 🐞 Bug ☼ Reliability
run-bazel-with-retry.sh only considers Bazel’s exit code, so if tee fails to write the console log (e.g., permission/disk-full), the step can still exit 0 and silently drop build/bazel-console.log. This can hide infrastructure problems and break downstream tooling that parses that log on failures.
Agent Prompt
## Issue description
`run-bazel-with-retry.sh` pipes Bazel output to `tee`, but it only uses Bazel’s exit code (`PIPESTATUS[0]`). If `tee` fails (can’t write the log), the script may still report success and/or proceed with retry logic based on incomplete logs.
## Issue Context
Downstream logic expects `build/bazel-console.log` to be present and readable when diagnosing failures.
## Fix Focus Areas
- scripts/github-actions/run-bazel-with-retry.sh[20-21]
### Implementation guidance
After the pipeline, capture both statuses (Bazel + tee) and fail fast on `tee` errors, e.g.:
- `BAZEL_EXIT_CODE=${PIPESTATUS[0]}`
- `TEE_EXIT_CODE=${PIPESTATUS[1]}`
- if `TEE_EXIT_CODE != 0`, print an error and `exit $TEE_EXIT_CODE` (or a fixed non-zero code).
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
I'm seeing a bunch of 502 Gateway errors trying to run tests:
https://github.com/SeleniumHQ/selenium/actions/runs/26376849996
Options:
--repository_downloader_retrieswas the answer until I realized that Bazel considers 502 errors non-transient and doesn't retry them.This PR is Option 5
💥 What does this PR do?
Wraps our current execution in a retry loop that parses the output log we're already generating and if it shows there was a 502 Gateway error, it reruns.
🔧 Implementation Notes
💡 Additional Considerations
I'm not sure how common this is going to be going forward
I think this may cause issues with mirrors if we use them
🤖 AI assistance
🔄 Types of changes