Skip to content

[build] monitor disk and cache usage throughout a bazel job#17569

Merged
titusfortner merged 6 commits into
trunkfrom
disk_size
May 25, 2026
Merged

[build] monitor disk and cache usage throughout a bazel job#17569
titusfortner merged 6 commits into
trunkfrom
disk_size

Conversation

@titusfortner
Copy link
Copy Markdown
Member

This is all about gathering more info from the logs about how much space and cache are in use throughout a bazel job.

🔗 Related Issues

One alternative to #17568 is to enable caching on RBE runs, but I want to see what the sizes look like. I have another PR lined up for bazel-setup to restrict the newer repository contents cache which is less useful on CI, but I also want to see the real numbers before submitting it.

On one of my runs today the "Remove extra tools to free disk space" step took 10 minutes, which, thankfully turned out to be a major outlier, but the total time for this step adds up with how many bazel jobs we run right now.

💥 What does this PR do?

  • Creates a disk status script that will display current state of the disk and caches
  • moves the cleanup code pieces to separate scripts with additional logging

🔧 Implementation Notes

  • I actually restored the previous free-disk-space.sh and added a couple things from that
  • The extra dot is to run it in the same shell so the script can set $AVAIL_GB value
  • Moving the first disk space check to after checking out the repo so it can access the script, and because we can't really change the size of the repo

🤖 AI assistance

  • No substantial AI assistance used
  • AI assisted (complete below)
    • Tool(s):
    • What was generated:
    • I reviewed all AI output and can explain the change

@selenium-ci selenium-ci added the B-build Includes scripting, bazel and CI integrations label May 25, 2026
@qodo-code-review
Copy link
Copy Markdown
Contributor

Review Summary by Qodo

Refactor disk monitoring and cleanup into reusable scripts

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Refactor disk cleanup into dedicated scripts for better reusability
• Add comprehensive disk and cache usage monitoring throughout CI job
• Consolidate disk space checks with standardized reporting format
• Support both Ubuntu (bash) and Windows (PowerShell) platforms
Diagram
flowchart LR
  A["Inline cleanup code"] -->|Extract| B["free-disk-space.sh<br/>free-disk-space.ps1"]
  C["Scattered disk checks"] -->|Consolidate| D["disk-status.sh"]
  B -->|Called from| E["bazel.yml workflow"]
  D -->|Called from| E
  E -->|Provides| F["Standardized monitoring<br/>at key checkpoints"]

Loading

File Changes

1. scripts/github-actions/disk-status.sh ✨ Enhancement +43/-0

New disk status monitoring script

• New script that captures disk space snapshot with available GB calculation
• Measures Bazel output_base, external cache, repository cache, and bazelisk cache sizes
• Exports AVAIL_GB variable for use in other scripts
• Handles both Ubuntu and Windows path differences

scripts/github-actions/disk-status.sh


2. scripts/github-actions/free-disk-space.sh ✨ Enhancement +55/-0

New Ubuntu disk cleanup script

• New bash script that removes pre-installed tools and Docker images
• Logs time taken and disk space freed for each cleanup step
• Replaces inline cleanup commands from workflow
• Provides detailed before/after disk usage reporting

scripts/github-actions/free-disk-space.sh


3. scripts/github-actions/free-disk-space.ps1 ✨ Enhancement +36/-0

New Windows disk cleanup script

• New PowerShell script for Windows disk cleanup
• Removes language toolchains, SDKs, and WebDriver binaries
• Tracks duration and freed space for each deletion step
• Replaces inline Windows cleanup commands from workflow

scripts/github-actions/free-disk-space.ps1


View more (1)
4. .github/workflows/bazel.yml ✨ Enhancement +20/-69

Refactor workflow to use new disk scripts

• Replace inline disk cleanup steps with calls to new free-disk-space.sh and free-disk-space.ps1
 scripts
• Consolidate scattered disk space checks into calls to disk-status.sh script
• Remove WebDriver cleanup steps (now included in cleanup scripts)
• Simplify disk space validation logic by using exported AVAIL_GB variable
• Update "Disk usage after cache restore" step to use new disk-status script and handle cache setup
 failures
• Change final disk check from error to warning for low disk space condition

.github/workflows/bazel.yml


Grey Divider

Qodo Logo

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 25, 2026

CLA assistant check
All committers have signed the CLA.

@qodo-code-review
Copy link
Copy Markdown
Contributor

qodo-code-review Bot commented May 25, 2026

Code Review by Qodo

🐞 Bugs (8) 📘 Rule violations (6) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Disk status step can fail ✓ Resolved 🐞 Bug ☼ Reliability
Description
disk-status.sh runs du -sh "$path"/* | sort -h when the directory exists, which returns a
non-zero status when the glob doesn’t match (e.g., empty directory), and this script is sourced
directly inside multiple workflow steps. That non-zero can abort those steps, turning an
observability checkpoint into a CI failure.
Code

scripts/github-actions/disk-status.sh[R29-35]

Evidence
disk-status.sh performs du over a globbed path without guarding the command’s exit status, and
the workflow sources this script in multiple shell: bash steps; any non-zero status inside the
sourced script can therefore terminate the step.

scripts/github-actions/disk-status.sh[29-37]
.github/workflows/bazel.yml[206-214]
.github/workflows/bazel.yml[239-247]
.github/workflows/bazel.yml[293-300]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`scripts/github-actions/disk-status.sh` is sourced from workflow steps. Inside `measure()`, the `du` pipelines can exit non-zero (e.g., when `$path/*` doesn’t match because the directory is empty, or due to permissions). When sourced, that failure can propagate and fail the workflow step.

### Issue Context
This script is intended for diagnostics only; it should be best-effort and never fail CI.

### Fix Focus Areas
- scripts/github-actions/disk-status.sh[29-37]
- .github/workflows/bazel.yml[206-214]
- .github/workflows/bazel.yml[239-247]
- .github/workflows/bazel.yml[293-300]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. AVAIL_GB used without validation 📘 Rule violation ☼ Reliability
Description
The workflow now sources disk-status.sh and compares AVAIL_GB numerically without ensuring it is
set and numeric, which can cause non-deterministic CI failures with unclear errors if df/env
assumptions break. External/config-derived values should be validated early with actionable failure
messages.
Code

.github/workflows/bazel.yml[R243-245]

Evidence
PR Compliance ID 12 requires validating external/config-derived inputs early and failing with
deterministic, actionable errors. The workflow now performs numeric comparisons on AVAIL_GB
immediately after sourcing disk-status.sh, but the script can produce an empty/non-numeric
AVAIL_GB (and masks some df failures with || true), making the workflow behavior fragile.

.github/workflows/bazel.yml[243-245]
.github/workflows/bazel.yml[297-299]
scripts/github-actions/disk-status.sh[6-17]
Best Practice: Learned patterns

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`.github/workflows/bazel.yml` relies on `AVAIL_GB` set by sourcing `scripts/github-actions/disk-status.sh` and then performs integer comparisons. If `AVAIL_GB` is empty/non-numeric (e.g., `GITHUB_WORKSPACE` unset or `df` fails), the `[` comparison can fail with a non-actionable error.

## Issue Context
`disk-status.sh` also uses `df -h ... || true`, masking failures while still exporting `AVAIL_GB` based on later `df` output. This makes failures harder to diagnose and can lead to brittle CI.

## Fix Focus Areas
- .github/workflows/bazel.yml[243-245]
- .github/workflows/bazel.yml[297-299]
- scripts/github-actions/disk-status.sh[6-17]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. free-disk-space.sh unsafe rm -rf 📘 Rule violation ⛨ Security
Description
The new cleanup script executes sudo rm -rf $path without --, quoting, or an empty-path guard,
which can cause destructive behavior or unexpected failures when $path is empty or malformed and
violates the requirement for safe argument handling in CI scripts. In particular, callers pass
potentially-empty WebDriver paths, leading to misleading cleanup output today and potential hard
failures if stricter shell settings (e.g., set -e) are introduced later.
Code

scripts/github-actions/free-disk-space.sh[R12-13]

Evidence
PR Compliance ID 14 requires safe argument handling in CI/build scripts (including quoting and using
rm -rf --). In scripts/github-actions/free-disk-space.sh, the script passes empty-default
parameter expansions for WebDriver locations (e.g., ${CHROMEWEBDRIVER:-}, ${EDGEWEBDRIVER:-},
${GECKOWEBDRIVER:-}) into clean, and clean() then runs sudo rm -rf $path directly on
whatever string it receives, without quoting, a -- separator, or an emptiness check; this means an
empty value can result in rm being invoked with no operands (emitting errors) and malformed values
are not safely handled.

scripts/github-actions/free-disk-space.sh[12-13]
scripts/github-actions/free-disk-space.sh[39-42]
scripts/github-actions/free-disk-space.sh[7-17]
Best Practice: Learned patterns

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`scripts/github-actions/free-disk-space.sh` performs privileged deletion using `sudo rm -rf $path` without safe argument handling (no quoting, no `--` separator, and no guard for empty paths). Some callers pass potentially-empty strings via `${VAR:-}` (notably the WebDriver env vars), which can invoke `rm` with no operands (printing errors/misleading output) and becomes riskier or a hard failure under stricter shell error handling.

## Issue Context
- PR Compliance ID 14 requires CI/scripts to use safe argument handling (quoting, `--` separators, and avoiding unsafe patterns).
- The bash script currently forwards potentially-empty WebDriver paths (`${CHROMEWEBDRIVER:-}`, `${EDGEWEBDRIVER:-}`, `${GECKOWEBDRIVER:-}`) into `clean()`.
- The PowerShell version already skips when the path is empty; the bash implementation should match that behavior.

## Fix Focus Areas
- scripts/github-actions/free-disk-space.sh[7-17]
- scripts/github-actions/free-disk-space.sh[39-42]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

4. Parse failure misread as low disk 🐞 Bug ◔ Observability
Description
When disk-status.sh can’t parse disk availability, it logs an error but then forces AVAIL_GB=0,
so downstream checks interpret the failure as actual low disk space and emit secondary
“insufficient/low disk” messages.
This makes CI logs harder to interpret because the job may fail/warn for “0GB remaining” even though
the root cause is a measurement/parsing failure.
Code

scripts/github-actions/disk-status.sh[R16-19]

Evidence
The script explicitly rewrites a non-numeric AVAIL_GB into 0, and the workflow then uses
AVAIL_GB in numeric comparisons for both the pre-build hard gate and the end-of-job warning, so a
parse failure will reliably trigger those conditions as if disk were actually exhausted.

scripts/github-actions/disk-status.sh[16-19]
.github/workflows/bazel.yml[239-246]
.github/workflows/bazel.yml[293-300]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`disk-status.sh` currently converts an `AVAIL_GB` parse failure into `AVAIL_GB=0` while continuing. Downstream steps then treat that as a real measurement and emit “insufficient disk” / “low disk” messages, obscuring that the underlying problem was an inability to measure disk space.

## Issue Context
The workflow sources (`.`) this script and uses `AVAIL_GB` numerically to gate the build and to emit an end-of-job low-disk warning.

## Fix Focus Areas
- scripts/github-actions/disk-status.sh[16-19]
- .github/workflows/bazel.yml[239-246]
- .github/workflows/bazel.yml[293-300]

## Suggested fix
1. In `disk-status.sh`, on parse failure:
  - Export an explicit validity flag (e.g., `AVAIL_GB_VALID=0`) and keep the raw value for debugging.
  - Avoid synthesizing `AVAIL_GB=0` as if it were a real measurement (or at least ensure callers can distinguish the “invalid” case).
2. In the workflow:
  - Before-build gate: if invalid, fail with a clear message about disk measurement failure (not “0GB available”).
  - Low-disk warning step: if invalid, emit a warning like “could not determine available disk space” and skip the numeric comparison.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. Sourced script leaks globals 🐞 Bug ⚙ Maintainability
Description
disk-status.sh is sourced (not executed) in multiple workflow steps to export AVAIL_GB, but it
also leaves behind helper functions and non-exported global variables (external, repos,
bazelisk, loop vars) in the caller shell. This increases the chance of accidental name collisions
and makes later step edits harder to reason about because the step environment is silently mutated
beyond AVAIL_GB.
Code

scripts/github-actions/disk-status.sh[R19-54]

Evidence
The workflow sources the script (so any variables/functions it defines persist in that step’s
shell), and the script defines a function plus multiple non-local variables beyond the intended
AVAIL_GB export.

.github/workflows/bazel.yml[206-214]
.github/workflows/bazel.yml[239-247]
.github/workflows/bazel.yml[293-300]
scripts/github-actions/disk-status.sh[19-54]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`scripts/github-actions/disk-status.sh` is sourced in workflow steps to export `AVAIL_GB`, but it also defines `cache_size()` and assigns additional global variables (`external`, `repos`, `bazelisk`, `label`, `sub`). Because the script is sourced, these names persist in the current shell and can collide with future variables/functions added to those steps.

### Issue Context
The workflow currently sources the script in multiple `bash` run blocks, so any globals/functions defined by the script remain in that step’s shell process.

### Fix Focus Areas
- scripts/github-actions/disk-status.sh[19-54]
- .github/workflows/bazel.yml[206-214]
- .github/workflows/bazel.yml[239-247]
- .github/workflows/bazel.yml[293-300]

### Suggested fix
Keep exporting `AVAIL_GB`, but avoid leaving other globals behind. Options:
1) Define everything inside a function and call it, then `unset -f` that function; or
2) After printing, explicitly `unset` temporary variables and `unset -f cache_size` at the end of the script.

Example approach (2):
- At end of `disk-status.sh`, add:
 - `unset -f cache_size`
 - `unset external repos bazelisk label sub`
 - (leave `AVAIL_GB` exported)
This keeps the intended export while minimizing side effects.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


6. free_mb() value not validated 📘 Rule violation ☼ Reliability
Description
free_mb() parses df output and assumes it will always be on the second line and end with an M;
if df output wraps/changes or parsing fails, it can return empty/non-numeric values that later
break arithmetic expansions and/or log misleading space deltas. CI scripts should defensively
validate external command output and fail with a clear, actionable message when required inputs
can’t be reliably derived.
Code

scripts/github-actions/free-disk-space.sh[R5-17]

Evidence
Rule 16 calls for hardening CI/scripts by avoiding fragile behavior and safely handling external
command output. In scripts/github-actions/free-disk-space.sh, free space is derived via a brittle
df -BM / | awk 'NR==2 {print $4}' | tr -d 'M' pipeline, then reused as numeric input in arithmetic
expansions such as $((before / 1024)) and $((after - before)) without validating that
before/after are present and numeric; because the workflow executes this script directly, any
parse-to-empty/non-numeric result will surface as arithmetic errors (or misleading logs) and can
fail the Ubuntu cleanup/diagnostic step.

scripts/github-actions/free-disk-space.sh[5-17]
.github/workflows/bazel.yml[134-137]
Best Practice: Learned patterns

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`scripts/github-actions/free-disk-space.sh` uses `free_mb()` (parsed from `df`) as numeric input for arithmetic expansions without validating that the parsed value is present and numeric, and the parsing is brittle due to assumptions like `NR==2` and stripping a trailing `M`. If `df` output format changes, wraps, `df` fails, or parsing returns empty/non-numeric output, the later GB/MB arithmetic can fail and/or produce misleading log output.

## Issue Context
This script runs in CI to free disk space and provide diagnostic logging, and it is executed directly by the Ubuntu GitHub Actions workflow step. Because it depends on external command output (`df`) and then performs arithmetic expansions, it should be robust against unexpected output and either (a) fail fast with a clear, actionable error when it cannot determine free space, or (b) explicitly tolerate/handle failures if best-effort behavior is desired.

## Fix Focus Areas
- scripts/github-actions/free-disk-space.sh[5-17]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


View more (8)
7. Repo cache size omitted 🐞 Bug ≡ Correctness
Description
disk-status.sh only reports repository-cache subdirectories ("$repos"/*/), so if the repository
cache contains only files (e.g., downloaded tarballs) or has no subdirectories, it prints no
“Repository Cache” line at all. This silently removes the very cache metric the workflow is trying
to observe.
Code

scripts/github-actions/disk-status.sh[R41-53]

Evidence
The script only enters the else branch when $repos does not exist; when it exists but has no
matching subdirectories, the loop emits nothing and no fallback runs. The repo docs describe
--repository_cache storing downloaded dependencies like tarballs (files), which means a file-only
cache directory is a normal state and would trigger the omission.

scripts/github-actions/disk-status.sh[40-53]
README.md[141-155]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`disk-status.sh` prints repository cache sizes by iterating only over subdirectories (`"$repos"/*/`). If the repository cache directory exists but contains only files (common for `--repository_cache`) or is empty/no subdirs, the loop produces no output and the fallback path is not executed, so the “Repository Cache” line is missing.

### Issue Context
This script is sourced in CI to provide disk/cache checkpoints; missing a cache section makes the logs misleading.

### Fix Focus Areas
- scripts/github-actions/disk-status.sh[41-53]

### Suggested change
Add a fallback when no subdirectories are found, e.g.:
- enable `nullglob` locally and track whether any subdirs were printed; if none, call `cache_size "Repository Cache" "$repos"` (or print an explicit “(empty)” line).
- alternatively, always call `cache_size "Repository Cache" "$repos"` first, then additionally enumerate known subdirs when present.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


8. du errors suppressed in cache_size() 📘 Rule violation ☼ Reliability
Description
cache_size() suppresses du stderr and does not check the command/pipeline exit status, so
permission or I/O failures can be hidden and result in empty sizes being logged as if they were real
measurements in CI. This reduces diagnosability and conflicts with guidance to avoid masking
failures in CI scripts.
Code

scripts/github-actions/disk-status.sh[R33-35]

Evidence
PR Compliance ID 11 requires CI/scripts to avoid masking failures, yet the cache_size()
implementation redirects du stderr to /dev/null and then pipes into awk without validating the
computed size before printing. If du fails or produces no output, awk can still succeed and
size becomes empty, causing the logs to show the cache-size label with a blank/incorrect value,
making the output look like a successful measurement even when it was not.

scripts/github-actions/disk-status.sh[33-35]
scripts/github-actions/disk-status.sh[30-38]
Best Practice: Learned patterns

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`scripts/github-actions/disk-status.sh` computes cache sizes using `du` while suppressing errors (`2>/dev/null`) and not checking exit status (including the pipeline behavior where `awk` can succeed even if `du` fails). This can hide permission/I/O failures and produce misleading blank cache-size values in CI logs.

## Issue Context
This script is used as a CI diagnostic checkpoint to support disk/cache investigations, so failures should be visible and the output should remain deterministic, trustworthy, and actionable rather than silently blank.

## Fix Focus Areas
- scripts/github-actions/disk-status.sh[32-36]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


9. Low disk no longer fails 🐞 Bug ☼ Reliability
Description
The workflow’s final low-disk check now only emits a warning and does not fail the job when
available space is critically low (<5GB). This can let CI pass while masking disk regressions that
increase flakiness in later runs.
Code

.github/workflows/bazel.yml[R293-300]

Evidence
The current 'Low Disk Warning' step only prints a warning with no exit, while the pre-build disk
gate explicitly errors and exits on low disk, demonstrating the behavioral regression at the end of
the job.

.github/workflows/bazel.yml[293-300]
.github/workflows/bazel.yml[239-247]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The end-of-job disk check no longer exits non-zero when disk space is below the critical threshold. That reduces the signal for disk regressions.

### Issue Context
There is still a pre-build hard failure on low disk, but the post-build check is useful to detect disk growth/regressions during the job.

### Fix Focus Areas
- .github/workflows/bazel.yml[293-300]

### Suggested fix
- Either restore failure semantics for `<5GB` (use `::error::` + `exit 1`), or make it configurable (e.g., an input like `fail-on-low-disk` defaulting to true for trunk).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


10. disk-status.sh what-only comment 📘 Rule violation ⚙ Maintainability
Description
The new header comment in disk-status.sh primarily describes what the script does (and contains an
incomplete parenthesis) rather than explaining the rationale/intent. This reduces maintainability
because future readers won’t learn why the snapshot/export pattern is needed in the workflow.
Code

scripts/github-actions/disk-status.sh[R2-5]

Evidence
PR Compliance ID 7 requires comments to explain rationale rather than restating behavior. The added
header comment is a functional description of printing a disk snapshot and exporting AVAIL_GB,
without explaining the motivation/constraints for doing so.

AGENTS.md: Comments Should Explain Why, Not What
scripts/github-actions/disk-status.sh[2-5]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The comment block at the top of `scripts/github-actions/disk-status.sh` describes what the script does rather than why it exists/why it is designed to be sourced to export `AVAIL_GB`.

## Issue Context
Per the repo guidelines, comments should capture rationale/constraints/intent (the “why”), while code/naming covers the “what”. This file is newly introduced in this PR.

## Fix Focus Areas
- scripts/github-actions/disk-status.sh[2-5]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


11. Hardcoded Bazel output_base 🐞 Bug ◔ Observability
Description
disk-status.sh hardcodes the Bazel output_base location ("$HOME/.bazel" on non-Windows), so the
snapshot can miss the real Bazel output tree and report "(not present)" or incorrect sizes when
output_base differs. This reduces the usefulness of the new diagnostics when output_base is
configured/overridden (which this repo already supports).
Code

scripts/github-actions/disk-status.sh[R19-27]

Evidence
The workflow only explicitly pins output-base on Windows; on other OSes it may vary. The repo
itself documents/uses dynamic discovery of output_base via bazel info, showing hardcoding is not
reliable long-term.

.github/workflows/bazel.yml[159-170]
dotnet/update-deps.sh[6-16]
AGENTS.md[40-43]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`scripts/github-actions/disk-status.sh` hardcodes `output_base` paths (`$HOME/.bazel` on non-Windows). This can cause the disk snapshot to omit the actual Bazel output tree and external cache usage, reducing diagnostic accuracy.

### Issue Context
The repo already treats `output_base` as configurable/variable and uses `bazel info output_base` elsewhere.

### Fix Focus Areas
- scripts/github-actions/disk-status.sh[19-27]

### Suggested fix
- Prefer to *discover* paths when Bazel is available:
 - `output_base=$(cd "$GITHUB_WORKSPACE" && bazel info output_base 2>/dev/null || true)`
 - If available, also consider `repos=$(bazel info repository_cache ...)` / `bazelisk` via known cache envs.
- Keep the current hardcoded fallbacks only if `bazel info` fails (or before Bazel is installed).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


12. Cache fallback warning skipped 🐞 Bug ◔ Observability
Description
The explicit warning about cached Bazel setup failure is only emitted inside the "Disk status after
cache restore" step, which is skipped on macOS. As a result, macOS jobs can silently fall back to
the uncached setup without the workflow-level warning that used to make this visible.
Code

.github/workflows/bazel.yml[R206-213]

Evidence
The fallback warning is printed only in a step gated off on macOS, while macOS workflows do invoke
this shared bazel workflow; therefore macOS runs won't receive this explicit warning even when
fallback occurs.

.github/workflows/bazel.yml[159-213]
.github/workflows/ci-java.yml[34-47]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
When `setup-bazel-cached` fails, the workflow falls back to the uncached setup. The warning message about that fallback is currently printed only in the `Disk status after cache restore` step, but that step is gated to `inputs.os != 'macos'`, so macOS runs won't see the warning.

### Issue Context
This is a workflow-level observability regression for macOS jobs that use this shared bazel workflow.

### Fix Focus Areas
- .github/workflows/bazel.yml[206-213]

### Suggested fix
Move the fallback warning to a separate step (or an earlier always-run step) that is **not** gated to non-macOS, e.g.:
- Add a small bash step after the setup steps:
 - `if: steps.setup-bazel-cached.outcome == 'failure'`
 - `run: echo "::warning::..."`
Keep the disk-status snapshot step gated as desired.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


13. Disk-status runs on macOS 🐞 Bug ☼ Reliability
Description
The "Low Disk Warning" step sources disk-status.sh unconditionally, so macOS jobs now execute the
full disk snapshot (including du/sort over cache directories) even though earlier disk-status steps
intentionally skip macOS. This can add unnecessary time and/or introduce end-of-job failures on
macOS runs from what is meant to be a diagnostic check.
Code

.github/workflows/bazel.yml[R293-299]

Evidence
The workflow skips disk-status steps on macOS elsewhere, but still sources the same script at the
end for all OSes; macOS workflows exist and call this shared bazel workflow, and disk-status.sh
performs a potentially heavy du/sort scan when directories exist.

.github/workflows/bazel.yml[206-213]
.github/workflows/bazel.yml[293-300]
.github/workflows/ci-java.yml[34-47]
scripts/github-actions/disk-status.sh[29-35]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`Low Disk Warning` runs on all OSes, but it sources `scripts/github-actions/disk-status.sh` which the workflow otherwise avoids running on macOS (`if: inputs.os != 'macos'`). This means macOS jobs now run the heavier disk snapshot unexpectedly at the end of the job.

### Issue Context
- Other disk-status checkpoints are explicitly skipped on macOS.
- The low-disk step only needs `AVAIL_GB`, not the full `du` breakdown.

### Fix Focus Areas
- .github/workflows/bazel.yml[293-300]

### Suggested fix
Either:
1) Add `if: always() && inputs.os != 'macos'` to `Low Disk Warning` (and optionally keep the old lightweight macOS-only `df` check inline), **or**
2) Split `disk-status.sh` into a lightweight `disk-avail.sh` (just computes/exports `AVAIL_GB`) and a heavier `disk-status.sh` (does `du`), and use the lightweight one in `Low Disk Warning` for all OSes.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


14. PowerShell delete errors hidden 📘 Rule violation ☼ Reliability
Description
The disk cleanup scripts free-disk-space.ps1 and free-disk-space.sh blanket-suppress errors
during deletion/pruning (-ErrorAction SilentlyContinue and || true), which can hide cleanup
failures and prevent actionable diagnostics in CI logs. This reduces CI robustness and can make the
reported freed-space delta misleading, violating guidance to avoid error swallowing in build/CI
scripts.
Code

scripts/github-actions/free-disk-space.ps1[R5-13]

Evidence
PR Compliance ID 15 requires robust error handling and explicitly flags blanket error suppression as
a failure mode. In the PowerShell script, the cleanup function uses `Remove-Item ... -ErrorAction
SilentlyContinue`, which discards deletion errors entirely and obscures why disk space may not
improve; in the shell script, docker image prune -af >/dev/null 2>&1 || true ignores any prune
failure, allowing the step to silently do nothing (or partially fail) while still reporting a
freed-space delta.

scripts/github-actions/free-disk-space.ps1[5-13]
scripts/github-actions/free-disk-space.sh[44-46]
Best Practice: Learned patterns

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The disk cleanup scripts suppress errors in ways that can hide real cleanup failures: `scripts/github-actions/free-disk-space.ps1` uses `Remove-Item ... -ErrorAction SilentlyContinue`, and `scripts/github-actions/free-disk-space.sh` uses `docker image prune ... >/dev/null 2>&1 || true`. This makes CI logs non-actionable when cleanup fails and can lead to misleading “freed space” reporting.

## Issue Context
Compliance (PR Compliance ID 15) requires robust error handling in CI/scripts and discourages blanket error suppression patterns that eliminate visibility into failures.

## Fix Focus Areas
- scripts/github-actions/free-disk-space.ps1[5-13]
- scripts/github-actions/free-disk-space.sh[44-46]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

15. Rounded delta can mislead 🐞 Bug ◔ Observability
Description
The Windows cleanup script rounds free space to whole MB before subtracting, so the reported “freed
XMB” can be off by up to ~1MB (or appear slightly negative) due to rounding and concurrent disk
activity, reducing diagnostic accuracy.
Code

scripts/github-actions/free-disk-space.ps1[3]

Evidence
FreeMB rounds (Get-PSDrive C).Free to whole MB and the formatted log prints the delta as
($afterMB - $beforeMB), so rounding occurs before subtraction.

scripts/github-actions/free-disk-space.ps1[3-14]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`FreeMB` rounds to whole MB before computing the delta (`afterMB - beforeMB`). This can slightly skew the logged freed-space delta (up to ~1MB) and can occasionally show a small negative value if the filesystem free space fluctuates during the operation.

### Issue Context
This output is used for CI diagnostics; improving accuracy makes the logs more trustworthy.

### Fix Focus Areas
- scripts/github-actions/free-disk-space.ps1[3-14]

### Suggested fix
- Measure in bytes (or unrounded MB) for `before`/`after`, compute the delta first, and only round for display.
- Optionally clamp the displayed freed delta at `>= 0` if you want to avoid confusing negative values in logs.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread scripts/github-actions/free-disk-space.sh Outdated
Comment thread .github/workflows/bazel.yml
Comment thread scripts/github-actions/disk-status.sh Outdated
@qodo-code-review
Copy link
Copy Markdown
Contributor

qodo-code-review Bot commented May 25, 2026

Persistent review updated to latest commit 3fa553c

@qodo-code-review
Copy link
Copy Markdown
Contributor

qodo-code-review Bot commented May 25, 2026

Persistent review updated to latest commit f471125

@qodo-code-review
Copy link
Copy Markdown
Contributor

qodo-code-review Bot commented May 25, 2026

Persistent review updated to latest commit 61b962f

@qodo-code-review
Copy link
Copy Markdown
Contributor

qodo-code-review Bot commented May 25, 2026

Persistent review updated to latest commit 0afe908

@qodo-code-review
Copy link
Copy Markdown
Contributor

qodo-code-review Bot commented May 25, 2026

Persistent review updated to latest commit 7d7fb92

@qodo-code-review
Copy link
Copy Markdown
Contributor

qodo-code-review Bot commented May 25, 2026

Persistent review updated to latest commit 962cb10

@qodo-code-review
Copy link
Copy Markdown
Contributor

qodo-code-review Bot commented May 25, 2026

Persistent review updated to latest commit bebf1b5

@titusfortner titusfortner merged commit 26794a0 into trunk May 25, 2026
27 checks passed
@titusfortner titusfortner deleted the disk_size branch May 25, 2026 12:39
titusfortner added a commit that referenced this pull request May 25, 2026
* [build] monitor disk and cache usage throughout a bazel job

* make the build not need android
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

B-build Includes scripting, bazel and CI integrations

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants