fix(swtbench): prevent build workflow from hanging indefinitely by juanmichelini · Pull Request #403 · OpenHands/benchmarks

juanmichelini · 2026-02-06T18:23:25Z

Summary

This PR addresses issue #400 where SWT-Bench image build workflows were hanging indefinitely (2+ hours) at the "Build and push SWT-Bench images" step, blocking evaluation runs.

Problem

SWT-Bench builds after 14:12 UTC on 2026-02-06 were freezing with no progress updates, while earlier builds completed successfully in ~10 minutes. The builds were consuming runner resources indefinitely and blocking evaluation pods.

Root Cause Analysis

Comparing the SWT-Bench workflow with the working SWE-Bench workflow revealed several missing safeguards:

No preflight step - SWE-Bench has a preflight step that prunes BuildKit cache and verifies disk space before building
No BUILDKIT_RESET_ON_FAILURE - SWE-Bench sets this env var to help recover from BuildKit failures
No timeout - Neither workflow had timeouts, but SWT-Bench was more susceptible to hangs
Per-ref concurrency - Concurrent builds on different refs could interfere with each other

Changes

Add preflight step to prune BuildKit cache and verify disk space
- Matches the existing safeguard in build-swebench-images.yml
- Prunes cache to 60GB max and checks for 75GB free space
- Fails early if disk space is insufficient
Add timeout-minutes to build steps
- 30 minutes for main image build (expected ~10 min)
- 60 minutes for prebaked eval env images
- Prevents indefinite hangs from blocking runners
Add BUILDKIT_RESET_ON_FAILURE=1 environment variable
- Helps recover from BuildKit failures/corruption
Change concurrency group to global (not per-ref)
- Prevents concurrent builds from interfering with each other
- Uses cancel-in-progress: true to cancel old runs when new ones start

Testing

This is a workflow configuration change. The fix will be validated when the workflow runs in GitHub Actions. The changes align with the working SWE-Bench workflow configuration.

Fixes #400

@juanmichelini can click here to continue refining the PR

This commit addresses issue #400 where SWT-Bench image build workflows were hanging indefinitely (2+ hours) at the build step. Changes: 1. Add preflight step to prune BuildKit cache and verify disk space - Matches the existing safeguard in build-swebench-images.yml - Prunes cache to 60GB max and checks for 75GB free space - Fails early if disk space is insufficient 2. Add timeout-minutes to build steps - 30 minutes for main image build (expected ~10 min) - 60 minutes for prebaked eval env images - Prevents indefinite hangs from blocking runners 3. Add BUILDKIT_RESET_ON_FAILURE=1 environment variable - Helps recover from BuildKit failures/corruption 4. Change concurrency group to global (not per-ref) - Prevents concurrent builds from interfering with each other - Uses cancel-in-progress: true to cancel old runs Fixes #400 Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

This fix will prevent indefinite hangs, but the global concurrency serialization is a significant behavioral change that may delay builds when multiple PRs are active. See inline comments for details.

all-hands-bot · 2026-02-25T23:27:15Z

.github/workflows/build-swtbench-images.yml

-  group: build-swt-bench-${{ github.ref }}
-  cancel-in-progress: false
+  group: build-swt-bench-images
+  cancel-in-progress: true


🔴 Critical - Behavioral Change: This changes from per-ref to global concurrency with cancel-in-progress: true.

Impact:

Only ONE build can run at a time across the entire repository

New builds on ANY branch will cancel existing builds on OTHER branches

Multiple active PRs will compete - newest build cancels older ones

Before: Builds on feature-a and feature-b ran in parallel
After: Build on feature-b cancels the build on feature-a

Question: Was concurrent build interference actually confirmed as the root cause? Have you considered alternatives like:

Separate BuildKit builder instances per ref

Per-ref cache volumes

BuildKit lock file improvements

This is a pragmatic fix but comes with significant tradeoffs for CI throughput.

all-hands-bot · 2026-02-25T23:27:15Z

.github/workflows/build-swtbench-images.yml

+            USED=$(echo "$LINE" | awk '{print $3}')
+            FREE=$(echo "$LINE" | awk '{print $4}')
+            if [ -n "$TOTAL" ] && [ -n "$FREE" ]; then
+              PCT=$(( 100 * USED / TOTAL ))


🟠 Important - Root Cause Unknown: The PR description mentions builds started hanging "after 14:12 UTC on 2026-02-06" but doesn't explain what changed at that time.

Without understanding the root cause (BuildKit version? Disk space exhaustion? Cache corruption?), these safeguards are defensive but not diagnostic.

Consider: What will you do if builds start hanging again despite these changes? Having telemetry/logs around the actual failure would help debug future issues.

all-hands-bot · 2026-02-25T23:27:15Z

.github/workflows/build-swtbench-images.yml

+          # Prefer newer max-storage flag; fall back to keep-storage if not supported.
+          if ! docker buildx prune --all --force --max-storage ${KEEP_GB}g; then
+            docker buildx prune --all --force --keep-storage ${KEEP_GB}g || true
+          fi


🟡 Suggestion - Error Handling: The || true on the fallback means if both --max-storage and --keep-storage fail, the script continues without pruning.

This is pragmatic (don't fail builds if pruning fails), but means your "preflight" safeguard is best-effort, not required.

Similarly, if /var/lib/buildkit doesn't exist, the disk check is skipped (line 160).

Recommendation: Document this behavior explicitly - are these checks required safeguards or optional nice-to-haves? The current implementation makes them optional.

all-hands-bot · 2026-02-25T23:27:15Z

.github/workflows/build-swtbench-images.yml

+          fi
+
      - name: Build and push SWT-Bench images
+        timeout-minutes: 30


🟡 Suggestion - Timeout Justification: timeout-minutes: 30 for builds that normally take ~10 minutes gives 3x buffer, which is reasonable.

But where does timeout-minutes: 60 for prebaked eval env come from? Is this based on historical data (P95/P99 of build times)?

Recommendation: Add comments documenting typical build times and why these specific timeout values were chosen. This helps future maintainers adjust them appropriately.

openhands-ai bot mentioned this pull request Feb 6, 2026

SWT-Bench image build workflows hanging indefinitely (2+ hours) #400

Open

juanmichelini marked this pull request as draft February 25, 2026 23:24

juanmichelini marked this pull request as ready for review February 25, 2026 23:24

all-hands-bot reviewed Feb 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(swtbench): prevent build workflow from hanging indefinitely#403

fix(swtbench): prevent build workflow from hanging indefinitely#403
juanmichelini wants to merge 1 commit intomainfrom
openhands/fix-swtbench-build-hanging-400

juanmichelini commented Feb 6, 2026

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot Feb 25, 2026

Uh oh!

all-hands-bot Feb 25, 2026

Uh oh!

all-hands-bot Feb 25, 2026

Uh oh!

all-hands-bot Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

juanmichelini commented Feb 6, 2026

Summary

Problem

Root Cause Analysis

Changes

Testing

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants