fix(swtbench): prevent build workflow from hanging indefinitely#403
fix(swtbench): prevent build workflow from hanging indefinitely#403juanmichelini wants to merge 1 commit intomainfrom
Conversation
This commit addresses issue #400 where SWT-Bench image build workflows were hanging indefinitely (2+ hours) at the build step. Changes: 1. Add preflight step to prune BuildKit cache and verify disk space - Matches the existing safeguard in build-swebench-images.yml - Prunes cache to 60GB max and checks for 75GB free space - Fails early if disk space is insufficient 2. Add timeout-minutes to build steps - 30 minutes for main image build (expected ~10 min) - 60 minutes for prebaked eval env images - Prevents indefinite hangs from blocking runners 3. Add BUILDKIT_RESET_ON_FAILURE=1 environment variable - Helps recover from BuildKit failures/corruption 4. Change concurrency group to global (not per-ref) - Prevents concurrent builds from interfering with each other - Uses cancel-in-progress: true to cancel old runs Fixes #400 Co-authored-by: openhands <openhands@all-hands.dev>
all-hands-bot
left a comment
There was a problem hiding this comment.
This fix will prevent indefinite hangs, but the global concurrency serialization is a significant behavioral change that may delay builds when multiple PRs are active. See inline comments for details.
| group: build-swt-bench-${{ github.ref }} | ||
| cancel-in-progress: false | ||
| group: build-swt-bench-images | ||
| cancel-in-progress: true |
There was a problem hiding this comment.
🔴 Critical - Behavioral Change: This changes from per-ref to global concurrency with cancel-in-progress: true.
Impact:
- Only ONE build can run at a time across the entire repository
- New builds on ANY branch will cancel existing builds on OTHER branches
- Multiple active PRs will compete - newest build cancels older ones
Before: Builds on feature-a and feature-b ran in parallel
After: Build on feature-b cancels the build on feature-a
Question: Was concurrent build interference actually confirmed as the root cause? Have you considered alternatives like:
- Separate BuildKit builder instances per ref
- Per-ref cache volumes
- BuildKit lock file improvements
This is a pragmatic fix but comes with significant tradeoffs for CI throughput.
| USED=$(echo "$LINE" | awk '{print $3}') | ||
| FREE=$(echo "$LINE" | awk '{print $4}') | ||
| if [ -n "$TOTAL" ] && [ -n "$FREE" ]; then | ||
| PCT=$(( 100 * USED / TOTAL )) |
There was a problem hiding this comment.
🟠 Important - Root Cause Unknown: The PR description mentions builds started hanging "after 14:12 UTC on 2026-02-06" but doesn't explain what changed at that time.
Without understanding the root cause (BuildKit version? Disk space exhaustion? Cache corruption?), these safeguards are defensive but not diagnostic.
Consider: What will you do if builds start hanging again despite these changes? Having telemetry/logs around the actual failure would help debug future issues.
| # Prefer newer max-storage flag; fall back to keep-storage if not supported. | ||
| if ! docker buildx prune --all --force --max-storage ${KEEP_GB}g; then | ||
| docker buildx prune --all --force --keep-storage ${KEEP_GB}g || true | ||
| fi |
There was a problem hiding this comment.
🟡 Suggestion - Error Handling: The || true on the fallback means if both --max-storage and --keep-storage fail, the script continues without pruning.
This is pragmatic (don't fail builds if pruning fails), but means your "preflight" safeguard is best-effort, not required.
Similarly, if /var/lib/buildkit doesn't exist, the disk check is skipped (line 160).
Recommendation: Document this behavior explicitly - are these checks required safeguards or optional nice-to-haves? The current implementation makes them optional.
| fi | ||
|
|
||
| - name: Build and push SWT-Bench images | ||
| timeout-minutes: 30 |
There was a problem hiding this comment.
🟡 Suggestion - Timeout Justification: timeout-minutes: 30 for builds that normally take ~10 minutes gives 3x buffer, which is reasonable.
But where does timeout-minutes: 60 for prebaked eval env come from? Is this based on historical data (P95/P99 of build times)?
Recommendation: Add comments documenting typical build times and why these specific timeout values were chosen. This helps future maintainers adjust them appropriately.
Summary
This PR addresses issue #400 where SWT-Bench image build workflows were hanging indefinitely (2+ hours) at the "Build and push SWT-Bench images" step, blocking evaluation runs.
Problem
SWT-Bench builds after 14:12 UTC on 2026-02-06 were freezing with no progress updates, while earlier builds completed successfully in ~10 minutes. The builds were consuming runner resources indefinitely and blocking evaluation pods.
Root Cause Analysis
Comparing the SWT-Bench workflow with the working SWE-Bench workflow revealed several missing safeguards:
Changes
Add preflight step to prune BuildKit cache and verify disk space
build-swebench-images.ymlAdd timeout-minutes to build steps
Add BUILDKIT_RESET_ON_FAILURE=1 environment variable
Change concurrency group to global (not per-ref)
cancel-in-progress: trueto cancel old runs when new ones startTesting
This is a workflow configuration change. The fix will be validated when the workflow runs in GitHub Actions. The changes align with the working SWE-Bench workflow configuration.
Fixes #400
@juanmichelini can click here to continue refining the PR