Skip to content

Commit 154c4c2

Browse files
MarcCotematheperxingdi-eric-yuan
authored
Adding r2egym support (#232)
* Pull changes from #107 and #212 * Add a workspace class. Refactoring + tests * Refactor SWE-Bench and make sure all 500 tasks can be solved. * Adding r2egym support * Pass revision id to parent class * Tests should check for default SWE-bench/SWE-bench_Verified * Make sure we apply any changes needed for setting up the environment. * Rename RemoteWorkspace -> Workspace * Fix typos. * Use better delimiter for here-document + add explantion to workspace.write_file * Show with hidden files when listing file with workspace * Fix interact_with_pdb test * Raises if git diff fails * Cleanup * update test, now log file is named debug_gym.log * Adding r2egym support * Add tests for r2egym + Refactor tests * GH Action * Add step to download and combine coverage artifacts for multiple tests * try current dir * Update tests.yml * Update tests.yml --------- Co-authored-by: Matheus Pereira <matpereira@microsoft.com> Co-authored-by: Xingdi (Eric) Yuan <xingdi-eric-yuan@users.noreply.github.com>
1 parent 689199a commit 154c4c2

File tree

15 files changed

+825
-254
lines changed

15 files changed

+825
-254
lines changed
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
2+
name: Run env tests.
3+
inputs:
4+
name:
5+
description: 'Name for the coverage file.'
6+
required: true
7+
changed-files:
8+
description: 'Files to look for changes.'
9+
required: true
10+
test-files:
11+
description: 'Test files to run.'
12+
required: true
13+
14+
runs:
15+
using: "composite"
16+
steps:
17+
- name: Set up Python
18+
uses: actions/setup-python@v5
19+
with:
20+
python-version: '3.12'
21+
cache: 'pip'
22+
- name: Install dependencies
23+
shell: bash
24+
run: |
25+
pip install --upgrade pip
26+
pip install -e '.[dev]'
27+
- name: Run tests
28+
env:
29+
FORCE_DOCKER_TERMINAL: false
30+
shell: bash
31+
run: |
32+
DEBUG_GYM_DEBUG=1 pytest ${{ inputs.test-files }} -vv -n 16 --timeout=600 --cov=debug_gym --cov-report=term-missing
33+
- name: Store coverage report
34+
uses: actions/upload-artifact@v4
35+
with:
36+
name: .coverage-${{ inputs.name }}
37+
path: .coverage
38+
if-no-files-found: error
39+
include-hidden-files: true

.github/workflows/tests.yml

Lines changed: 95 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,63 @@ on:
1010
- cron: '30 8 * * *'
1111

1212
jobs:
13+
test-swebench:
14+
name: "Testing SWE-Bench"
15+
runs-on: ${{ matrix.os }}
16+
strategy:
17+
fail-fast: false
18+
matrix:
19+
language: [ "python" ]
20+
os: [ubuntu-latest]
21+
22+
steps:
23+
- uses: actions/checkout@v5
24+
- uses: ./.github/actions/test-if-changes
25+
with:
26+
name: swebench
27+
test-files: "tests/gym/envs/test_swe_bench.py"
28+
changed-files: |
29+
debug_gym/gym/envs/swe_bench.py
30+
tests/gym/envs/test_swe_bench.py
31+
32+
test-swesmith:
33+
name: "Testing SWE-Smith"
34+
runs-on: ${{ matrix.os }}
35+
strategy:
36+
fail-fast: false
37+
matrix:
38+
language: [ "python" ]
39+
os: [ubuntu-latest]
40+
41+
steps:
42+
- uses: actions/checkout@v5
43+
- uses: ./.github/actions/test-if-changes
44+
with:
45+
name: swesmith
46+
test-files: "tests/gym/envs/test_swe_smith.py"
47+
changed-files: |
48+
debug_gym/gym/envs/swe_smith.py
49+
tests/gym/envs/test_swe_smith.py
50+
51+
test-r2egym:
52+
name: "Testing R2E-Gym"
53+
runs-on: ${{ matrix.os }}
54+
strategy:
55+
fail-fast: false
56+
matrix:
57+
language: [ "python" ]
58+
os: [ubuntu-latest]
59+
60+
steps:
61+
- uses: actions/checkout@v5
62+
- uses: ./.github/actions/test-if-changes
63+
with:
64+
name: r2egym
65+
test-files: "tests/gym/envs/test_r2egym.py"
66+
changed-files: |
67+
debug_gym/gym/envs/r2egym.py
68+
tests/gym/envs/test_r2egym.py
69+
1370
tests:
1471
name: Test
1572
runs-on: ${{ matrix.os }}
@@ -21,7 +78,7 @@ jobs:
2178

2279
steps:
2380
- name: Checkout repository
24-
uses: actions/checkout@v4
81+
uses: actions/checkout@v5
2582
- name: Set up Python
2683
uses: actions/setup-python@v5
2784
with:
@@ -31,28 +88,45 @@ jobs:
3188
run: |
3289
pip install --upgrade pip
3390
pip install -e '.[dev]'
34-
- name: Get changed files related to SWE-Bench or SWE-Smith
35-
id: changed-files-specific
36-
uses: tj-actions/changed-files@v46.0.5
37-
with:
38-
files: |
39-
debug_gym/gym/envs/swe_*.py
40-
tests/gym/envs/test_swe_*.py
4191
- name: Test - PR - Fast
42-
if: github.event_name == 'pull_request' && steps.changed-files-specific.outputs.any_changed != 'true'
43-
env:
44-
FORCE_DOCKER_TERMINAL: false
45-
run: |
46-
DEBUG_GYM_DEBUG=1 pytest -vv -n 16 -k "not test_swe_bench and not test_swe_smith" --cov=debug_gym --cov-report=term-missing --cov-fail-under=80 --timeout=600
47-
- name: Test - PR - Slow
48-
if: github.event_name == 'pull_request' && steps.changed-files-specific.outputs.any_changed == 'true'
4992
env:
5093
FORCE_DOCKER_TERMINAL: false
5194
run: |
52-
DEBUG_GYM_DEBUG=1 pytest -vv -n 16 --cov=debug_gym --cov-report=term-missing --cov-fail-under=85 --timeout=600
53-
- name: Test - main
54-
if: github.event_name != 'pull_request'
55-
env:
56-
FORCE_DOCKER_TERMINAL: false
95+
DEBUG_GYM_DEBUG=1 pytest -vv -n 16 -k "not test_swe_bench and not test_swe_smith and not test_r2egym" --cov=debug_gym --cov-report=term-missing --timeout=600
96+
- name: Store coverage report
97+
uses: actions/upload-artifact@v4
98+
with:
99+
name: .coverage-main
100+
path: .coverage
101+
if-no-files-found: error
102+
include-hidden-files: true
103+
104+
report-coverage:
105+
runs-on: ubuntu-latest
106+
needs: [tests, test-swebench, test-swesmith, test-r2egym]
107+
steps:
108+
- name: Checkout repository
109+
uses: actions/checkout@v5
110+
- name: Set up Python
111+
uses: actions/setup-python@v5
112+
with:
113+
python-version: '3.12'
114+
cache: 'pip'
115+
- name: Install dependencies
116+
run: pip install coverage
117+
- name: Download coverage reports
118+
uses: actions/download-artifact@v4
119+
- name: Combine and report coverage.
57120
run: |
58-
DEBUG_GYM_DEBUG=1 pytest -vv -n 16 --cov=debug_gym --cov-report=term-missing --cov-fail-under=85 --timeout=600
121+
ls -la
122+
# The artifacts are downloaded as directories, but coverage combine expects files
123+
# Move the .coverage files from directories to the current directory
124+
for dir in .coverage-*; do
125+
if [ -d "$dir" ] && [ -f "$dir/.coverage" ]; then
126+
cp "$dir/.coverage" "${dir}.coverage"
127+
echo "Moved coverage file from $dir to ${dir}.coverage"
128+
fi
129+
done
130+
ls -la .*.coverage
131+
coverage combine --keep .*.coverage
132+
coverage report --fail-under=85

debug_gym/gym/envs/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
from debug_gym.gym.envs.aider import AiderBenchmarkEnv
22
from debug_gym.gym.envs.env import RepoEnv, TooledEnv
33
from debug_gym.gym.envs.mini_nightmare import MiniNightmareEnv
4+
from debug_gym.gym.envs.r2egym import R2EGymEnv
45
from debug_gym.gym.envs.swe_bench import SWEBenchEnv
56
from debug_gym.gym.envs.swe_smith import SWESmithEnv
67

@@ -17,5 +18,7 @@ def select_env(env_type: str = None) -> type[RepoEnv]:
1718
return SWESmithEnv
1819
case "mini_nightmare":
1920
return MiniNightmareEnv
21+
case "r2egym":
22+
return R2EGymEnv
2023
case _:
2124
raise ValueError(f"Unknown benchmark {env_type}")

debug_gym/gym/envs/configs/r2egym.yaml

Lines changed: 17 additions & 0 deletions
Large diffs are not rendered by default.

debug_gym/gym/envs/env.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -278,6 +278,10 @@ def _prepare_entrypoint(entrypoint):
278278
entrypoint_list[2] = f"$(which {entrypoint_list[2]})"
279279
entrypoint_list = entrypoint_list[:2] + ["python"] + entrypoint_list[2:]
280280

281+
elif "xvfb" in entrypoint:
282+
# parse "xvfb-run --auto-servernum .venv/bin/python -W ignore -m pytest -rA r2e_tests"
283+
return entrypoint
284+
281285
# For non-python commands, ensure we have the absolute path to the Python executable
282286
# and explicitly run it through Python for consistent execution behavior.
283287
elif entrypoint_list[0] != "python":

0 commit comments

Comments
 (0)