Skip to content

[build] retry Bazel on transient GitHub CDN 5xx errors during repo fetch#17568

Draft
titusfortner wants to merge 2 commits into
trunkfrom
bazel_retry
Draft

[build] retry Bazel on transient GitHub CDN 5xx errors during repo fetch#17568
titusfortner wants to merge 2 commits into
trunkfrom
bazel_retry

Conversation

@titusfortner
Copy link
Copy Markdown
Member

@titusfortner titusfortner commented May 25, 2026

I'm seeing a bunch of 502 Gateway errors trying to run tests:
https://github.com/SeleniumHQ/selenium/actions/runs/26376849996

Options:

  1. I thought --repository_downloader_retries was the answer until I realized that Bazel considers 502 errors non-transient and doesn't retry them.
  2. Figure out mirrors - this looks complicated
  3. Ask EngFlow to let us use remote download, but we're already getting a lot from them
  4. Use Github Cache - I have another PR to investigate this incoming
  5. Retry the whole run when we detect a problem

This PR is Option 5

💥 What does this PR do?

Wraps our current execution in a retry loop that parses the output log we're already generating and if it shows there was a 502 Gateway error, it reruns.

🔧 Implementation Notes

  • The log being parsed is overwritten on future runs so it won't flag on 502 errors from previous attempts
  • Github log will still show everything
  • Updated to use ENV for the inputs.run rather than positional arg so quote syntax doesn't collide

💡 Additional Considerations

I'm not sure how common this is going to be going forward
I think this may cause issues with mirrors if we use them

🤖 AI assistance

  • No substantial AI assistance used
  • AI assisted (complete below)
    • Tool(s):
    • What was generated:
    • I reviewed all AI output and can explain the change

🔄 Types of changes

  • Bug fix (backwards compatible)

@selenium-ci selenium-ci added the B-build Includes scripting, bazel and CI integrations label May 25, 2026
@qodo-code-review
Copy link
Copy Markdown
Contributor

Review Summary by Qodo

Add retry logic for Bazel on transient CDN errors

🐞 Bug fix ✨ Enhancement

Grey Divider

Walkthroughs

Description
• Implements retry mechanism for transient GitHub CDN 5xx errors
• Detects HTTP 502/503/504 errors in Bazel output logs
• Retries failed Bazel commands up to 3 times with exponential backoff
• Preserves original exit codes for non-transient failures
Diagram
flowchart LR
  A["Bazel Command"] --> B["Execute with Log Capture"]
  B --> C{"Exit Code = 0?"}
  C -->|Yes| D["Success"]
  C -->|No| E{"5xx Error Detected?"}
  E -->|No| F["Exit with Error"]
  E -->|Yes| G{"Max Retries?"}
  G -->|Yes| H["Exit Failure"]
  G -->|No| I["Wait & Retry"]
  I --> B

Loading

File Changes

1. scripts/github-actions/run-bazel-with-retry.sh ✨ Enhancement +44/-0

New retry wrapper script for Bazel CDN errors

• New shell script that wraps Bazel commands with retry logic
• Detects transient HTTP 5xx errors using regex pattern matching
• Implements exponential backoff (15s, 30s, 45s) between retry attempts
• Logs output to file and preserves non-transient error exit codes

scripts/github-actions/run-bazel-with-retry.sh


2. .github/workflows/bazel.yml ✨ Enhancement +1/-5

Integrate retry script into Bazel workflow

• Replaces inline Bazel execution with call to retry wrapper script
• Removes manual log file creation and tee piping logic
• Simplifies "Run Bazel" step by delegating retry handling to script

.github/workflows/bazel.yml


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown
Contributor

qodo-code-review Bot commented May 25, 2026

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (1)

Grey Divider


Action required

1. Tee failures ignored 🐞 Bug ☼ Reliability ⭐ New
Description
run-bazel-with-retry.sh only considers Bazel’s exit code, so if tee fails to write the console log
(e.g., permission/disk-full), the step can still exit 0 and silently drop build/bazel-console.log.
This can hide infrastructure problems and break downstream tooling that parses that log on failures.
Code

scripts/github-actions/run-bazel-with-retry.sh[20]

Evidence
The script pipes output to tee but only reads PIPESTATUS[0] (Bazel), so tee write failures are
ignored; meanwhile downstream parsing relies on that log existing.

scripts/github-actions/run-bazel-with-retry.sh[19-22]
scripts/github-actions/rerun-failures.sh[4-10]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`run-bazel-with-retry.sh` pipes Bazel output to `tee`, but it only uses Bazel’s exit code (`PIPESTATUS[0]`). If `tee` fails (can’t write the log), the script may still report success and/or proceed with retry logic based on incomplete logs.

## Issue Context
Downstream logic expects `build/bazel-console.log` to be present and readable when diagnosing failures.

## Fix Focus Areas
- scripts/github-actions/run-bazel-with-retry.sh[20-21]

### Implementation guidance
After the pipeline, capture both statuses (Bazel + tee) and fail fast on `tee` errors, e.g.:
- `BAZEL_EXIT_CODE=${PIPESTATUS[0]}`
- `TEE_EXIT_CODE=${PIPESTATUS[1]}`
- if `TEE_EXIT_CODE != 0`, print an error and `exit $TEE_EXIT_CODE` (or a fixed non-zero code).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Retry check never matches ✓ Resolved 🐞 Bug ≡ Correctness
Description
run-bazel-with-retry.sh searches the log for the literal text "BAZEL_ERROR_PATTERN" (missing $),
so 502/503/504 fetch errors will never be detected and the Bazel run will not be retried.
Code

scripts/github-actions/run-bazel-with-retry.sh[R28-36]

Evidence
The script defines a regex in BAZEL_ERROR_PATTERN, but grep is invoked with the quoted literal
"BAZEL_ERROR_PATTERN" (no $), so it will only match if the log contains that exact text, not the
intended 5xx error lines.

scripts/github-actions/run-bazel-with-retry.sh[16-40]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The retry condition uses `grep -qE "BAZEL_ERROR_PATTERN"` and `grep -E "BAZEL_ERROR_PATTERN"`, which greps for the literal string `BAZEL_ERROR_PATTERN` instead of the regex stored in the `BAZEL_ERROR_PATTERN` variable. This prevents the retry loop from ever triggering on 5xx CDN errors.

## Issue Context
The script defines `BAZEL_ERROR_PATTERN='GET returned 50[234] '`, but never dereferences it when calling `grep`.

## Fix Focus Areas
- scripts/github-actions/run-bazel-with-retry.sh[16-36]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. tee/grep missing -- 📘 Rule violation ⛨ Security ⭐ New
Description
The script passes $LOG_FILE to tee and grep without a -- argument separator, so a log path
beginning with - could be interpreted as options (option/argument injection) instead of a
filename. This violates the safe argument-handling requirement for CI scripts and can cause
incorrect behavior or unexpected flags being applied.
Code

scripts/github-actions/run-bazel-with-retry.sh[R20-34]

Evidence
PR Compliance ID 14 requires safe argument handling in build/CI scripts. The new code invokes `tee
"$LOG_FILE" and grep ... "$LOG_FILE" without --, allowing a -`-prefixed log filename (via
env/config) to be parsed as options.

scripts/github-actions/run-bazel-with-retry.sh[20-34]
Best Practice: Learned patterns

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`scripts/github-actions/run-bazel-with-retry.sh` passes `$LOG_FILE` as a positional argument to `tee`/`grep` without a `--` separator. If `$LOG_FILE` begins with `-`, these tools may treat it as an option rather than a filename.

## Issue Context
This script is used in GitHub Actions. Rule requires safe argument handling; adding `--` prevents option parsing for filename-like arguments.

## Fix Focus Areas
- scripts/github-actions/run-bazel-with-retry.sh[20-34]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


4. bash -c executes CMD ✓ Resolved 📘 Rule violation ⛨ Security
Description
The workflow passes a single command string into run-bazel-with-retry.sh, which then executes it
via bash -c, enabling shell interpretation/injection and brittle argument parsing. This violates
the safe argument-handling requirement for build/CI scripts.
Code

scripts/github-actions/run-bazel-with-retry.sh[R11-22]

Evidence
PR Compliance ID 16 requires safe argument handling in build/CI scripts. The script builds CMD
from a single string argument and runs it via bash -c, which performs shell interpretation rather
than executing a safely-quoted argv vector.

scripts/github-actions/run-bazel-with-retry.sh[11-22]
.github/workflows/bazel.yml[295-305]
Best Practice: Learned patterns

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`scripts/github-actions/run-bazel-with-retry.sh` executes a caller-provided command string using `bash -c "$CMD"`, which relies on shell parsing and can lead to command injection or incorrect argument handling.

## Issue Context
This script is invoked from `.github/workflows/bazel.yml` with `"${{ inputs.run }}"`, so the called command is provided as a single string. Compliance requires safe argument handling for build/CI scripts.

## Fix Focus Areas
- scripts/github-actions/run-bazel-with-retry.sh[11-22]
- .github/workflows/bazel.yml[295-305]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. Retries lose Bazel exit ✓ Resolved 🐞 Bug ≡ Correctness
Description
If CDN errors persist through the last attempt, the script exits with code 1 instead of Bazel’s
actual exit code, contradicting the script’s stated behavior and potentially breaking downstream
error handling.
Code

scripts/github-actions/run-bazel-with-retry.sh[R43-44]

Evidence
The header comment claims the script exits with Bazel’s actual exit code, but after breaking out of
the retry loop it unconditionally exits 1, discarding the previously captured BAZEL_EXIT_CODE.

scripts/github-actions/run-bazel-with-retry.sh[2-6]
scripts/github-actions/run-bazel-with-retry.sh[19-44]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
When retries are exhausted, the script unconditionally `exit 1` rather than returning the last Bazel exit code it captured. This violates the script’s own contract (“exits with Bazel's actual exit code”).

## Issue Context
The script already captures Bazel’s exit code via `BAZEL_EXIT_CODE=${PIPESTATUS[0]}`. Preserve and return that value after the retry loop when it terminates due to max attempts.

## Fix Focus Areas
- scripts/github-actions/run-bazel-with-retry.sh[19-44]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Previous review results

Review updated until commit 2ba409c

Results up to commit 61b582b


🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)


Action required
1. Retry check never matches ✓ Resolved 🐞 Bug ≡ Correctness
Description
run-bazel-with-retry.sh searches the log for the literal text "BAZEL_ERROR_PATTERN" (missing $),
so 502/503/504 fetch errors will never be detected and the Bazel run will not be retried.
Code

scripts/github-actions/run-bazel-with-retry.sh[R28-36]

Evidence
The script defines a regex in BAZEL_ERROR_PATTERN, but grep is invoked with the quoted literal
"BAZEL_ERROR_PATTERN" (no $), so it will only match if the log contains that exact text, not the
intended 5xx error lines.

scripts/github-actions/run-bazel-with-retry.sh[16-40]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The retry condition uses `grep -qE "BAZEL_ERROR_PATTERN"` and `grep -E "BAZEL_ERROR_PATTERN"`, which greps for the literal string `BAZEL_ERROR_PATTERN` instead of the regex stored in the `BAZEL_ERROR_PATTERN` variable. This prevents the retry loop from ever triggering on 5xx CDN errors.

## Issue Context
The script defines `BAZEL_ERROR_PATTERN='GET returned 50[234] '`, but never dereferences it when calling `grep`.

## Fix Focus Areas
- scripts/github-actions/run-bazel-with-retry.sh[16-36]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended
2. bash -c executes CMD ✓ Resolved 📘 Rule violation ⛨ Security
Description
The workflow passes a single command string into run-bazel-with-retry.sh, which then executes it
via bash -c, enabling shell interpretation/injection and brittle argument parsing. This violates
the safe argument-handling requirement for build/CI scripts.
Code

scripts/github-actions/run-bazel-with-retry.sh[R11-22]

Evidence
PR Compliance ID 16 requires safe argument handling in build/CI scripts. The script builds CMD
from a single string argument and runs it via bash -c, which performs shell interpretation rather
than executing a safely-quoted argv vector.

scripts/github-actions/run-bazel-with-retry.sh[11-22]
.github/workflows/bazel.yml[295-305]
Best Practice: Learned patterns

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`scripts/github-actions/run-bazel-with-retry.sh` executes a caller-provided command string using `bash -c "$CMD"`, which relies on shell parsing and can lead to command injection or incorrect argument handling.

## Issue Context
This script is invoked from `.github/workflows/bazel.yml` with `"${{ inputs.run }}"`, so the called command is provided as a single string. Compliance requires safe argument handling for build/CI scripts.

## Fix Focus Areas
- scripts/github-actions/run-bazel-with-retry.sh[11-22]
- .github/workflows/bazel.yml[295-305]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Retries lose Bazel exit ✓ Resolved 🐞 Bug ≡ Correctness
Description
If CDN errors persist through the last attempt, the script exits with code 1 instead of Bazel’s
actual exit code, contradicting the script’s stated behavior and potentially breaking downstream
error handling.
Code

scripts/github-actions/run-bazel-with-retry.sh[R43-44]

Evidence
The header comment claims the script exits with Bazel’s actual exit code, but after breaking out of
the retry loop it unconditionally exits 1, discarding the previously captured BAZEL_EXIT_CODE.

scripts/github-actions/run-bazel-with-retry.sh[2-6]
scripts/github-actions/run-bazel-with-retry.sh[19-44]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
When retries are exhausted, the script unconditionally `exit 1` rather than returning the last Bazel exit code it captured. This violates the script’s own contract (“exits with Bazel's actual exit code”).

## Issue Context
The script already captures Bazel’s exit code via `BAZEL_EXIT_CODE=${PIPESTATUS[0]}`. Preserve and return that value after the retry loop when it terminates due to max attempts.

## Fix Focus Areas
- scripts/github-actions/run-bazel-with-retry.sh[19-44]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Qodo Logo

Comment thread scripts/github-actions/run-bazel-with-retry.sh Outdated
@qodo-code-review
Copy link
Copy Markdown
Contributor

qodo-code-review Bot commented May 25, 2026

Persistent review updated to latest commit 2ba409c

BAZEL_ERROR_PATTERN='GET returned 50[234] '

for i in $(seq 1 "$BAZEL_MAX_ATTEMPTS"); do
bash -c "$BAZEL_RUN_CMD" 2>&1 | tee "$LOG_FILE"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Tee failures ignored 🐞 Bug ☼ Reliability

run-bazel-with-retry.sh only considers Bazel’s exit code, so if tee fails to write the console log
(e.g., permission/disk-full), the step can still exit 0 and silently drop build/bazel-console.log.
This can hide infrastructure problems and break downstream tooling that parses that log on failures.
Agent Prompt
## Issue description
`run-bazel-with-retry.sh` pipes Bazel output to `tee`, but it only uses Bazel’s exit code (`PIPESTATUS[0]`). If `tee` fails (can’t write the log), the script may still report success and/or proceed with retry logic based on incomplete logs.

## Issue Context
Downstream logic expects `build/bazel-console.log` to be present and readable when diagnosing failures.

## Fix Focus Areas
- scripts/github-actions/run-bazel-with-retry.sh[20-21]

### Implementation guidance
After the pipeline, capture both statuses (Bazel + tee) and fail fast on `tee` errors, e.g.:
- `BAZEL_EXIT_CODE=${PIPESTATUS[0]}`
- `TEE_EXIT_CODE=${PIPESTATUS[1]}`
- if `TEE_EXIT_CODE != 0`, print an error and `exit $TEE_EXIT_CODE` (or a fixed non-zero code).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@titusfortner titusfortner marked this pull request as draft May 25, 2026 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

B-build Includes scripting, bazel and CI integrations

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants