Skip to content

Commit 5353750

Browse files
committed
Merge branch 'main' into hitl_agent
2 parents bc9dcfd + f328f9e commit 5353750

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

72 files changed

+4711
-3197
lines changed

.github/workflows/tests.yml

Lines changed: 17 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,12 @@ on:
1212
jobs:
1313
tests:
1414
name: Test
15-
runs-on: ubuntu-latest
15+
runs-on: ${{ matrix.os }}
1616
strategy:
1717
fail-fast: false
1818
matrix:
1919
language: [ "python" ]
20+
os: [ubuntu-latest]
2021

2122
steps:
2223
- name: Checkout repository
@@ -30,11 +31,22 @@ jobs:
3031
run: |
3132
pip install --upgrade pip
3233
pip install -e '.[dev]'
33-
- name: Test with pytest - PR
34-
if: github.event_name == 'pull_request'
34+
- name: Get changed files related to SWE-Bench or SWE-Smith
35+
id: changed-files-specific
36+
uses: tj-actions/changed-files@v46.0.5
37+
with:
38+
files: |
39+
debug_gym/gym/envs/swe_*.py
40+
tests/gym/envs/test_swe_*.py
41+
- name: Test - PR - Fast
42+
if: github.event_name == 'pull_request' && steps.changed-files-specific.outputs.any_changed != 'true'
43+
run: |
44+
DEBUG_GYM_DEBUG=1 pytest -vv -n 16 -k "not test_swe_bench and not test_swe_smith" --cov=debug_gym --cov-report=term-missing --cov-fail-under=80 --timeout=600
45+
- name: Test - PR - Slow
46+
if: github.event_name == 'pull_request' && steps.changed-files-specific.outputs.any_changed == 'true'
3547
run: |
36-
DEBUG_GYM_DEBUG=1 pytest -vv -n 16 -k "not test_swe_bench and not test_swe_smith" --cov=debug_gym --cov-report=term-missing --cov-fail-under=80 --timeout=300
37-
- name: Test with pytest
48+
DEBUG_GYM_DEBUG=1 pytest -vv -n 16 --cov=debug_gym --cov-report=term-missing --cov-fail-under=85 --timeout=600
49+
- name: Test - main
3850
if: github.event_name != 'pull_request'
3951
run: |
4052
DEBUG_GYM_DEBUG=1 pytest -vv -n 16 --cov=debug_gym --cov-report=term-missing --cov-fail-under=85 --timeout=600

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -184,3 +184,4 @@ vscode/out
184184
vscode/node_modules
185185
vscode/package-lock.json
186186
.vscode
187+
debug_gym/llms/copilot.py

CHANGELOG.md

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,30 @@
11
# Changelog
22

3+
### 2025-06-13
4+
5+
Added scripts to filter trajectories generated by `debug-gym`.
6+
7+
8+
### 2025-06-11
9+
10+
Added support to [SWE-smith](https://swesmith.com/). Users can use the tasks shipped with the official SWE-smith package, or customized tasks generated using SWE-smith.
11+
312
### 2025-06-03
413
* Refactored `debug_gym/agents/llm_api.py` into separate modules in `debug_gym/llms/` for OpenAI, AzureOpenAI, Anthropic APIs, and human mode, allowing for easier extension to other LLM providers in the future.
514
* Improved the Human mode to support better prompt completion and error handling.
615

7-
### 2025-05-20
8-
9-
Changed the tool-calling syntax to be compatible with the [OpenAI](https://platform.openai.com/docs/guides/function-calling) and [Anthropic](https://docs.anthropic.com/en/docs/agents-and-tools/tool-use) function-calling formats.
16+
### 2025-05-28
1017

11-
* Switched tools (view, rewrite, pdb, listdir, eval) to a function-call API with explicit arguments and environment injection.
12-
* Overhauled LLM interfaces to define, parse, and format function calls, and updated agents to consume `ToolCall` objects.
13-
* Removed the old conversational-prompt flag from configs.
18+
Improved the View tool, added the `start` and `end` arguments so the agent can specify a particular chunk of code to view.
1419

1520
### 2025-05-22
1621

1722
Added in the [analysis](https://github.com/microsoft/debug-gym/tree/main/analysis/json_log_viewer) folder a Flask app to view `.jsonl` log files in the browser.
1823

19-
### 2025-05-28
20-
21-
Improved the View tool, added the `start` and `end` arguments so the agent can specify a particular chunk of code to view.
24+
### 2025-05-20
2225

23-
### 2025-06-11
26+
Changed the tool-calling syntax to be compatible with the [OpenAI](https://platform.openai.com/docs/guides/function-calling) and [Anthropic](https://docs.anthropic.com/en/docs/agents-and-tools/tool-use) function-calling formats.
2427

25-
Added support to [SWE-smith](https://swesmith.com/). Users can use the tasks shipped with the official SWE-smith package, or customized tasks generated using SWE-smith.
28+
* Switched tools (view, rewrite, pdb, listdir, eval) to a function-call API with explicit arguments and environment injection.
29+
* Overhauled LLM interfaces to define, parse, and format function calls, and updated agents to consume `ToolCall` objects.
30+
* Removed the old conversational-prompt flag from configs.

MANIFEST.in

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
1-
include debug_gym/envs/configs/*.yaml
1+
include debug_gym/envs/configs/*.yaml
2+
include scripts/templates/*.jinja

README.md

Lines changed: 70 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,9 @@ debug_gym
6767

6868
`debug_gym.llms` are the different LLM backends that can be used to instantiate agents. Currently, we support OpenAI, Azure OpenAI, and Anthropic.
6969

70+
> [!WARNING]
71+
> `debug-gym` has limited support on non-Linux platforms. Interactive terminal sessions using PTY (pseudo-terminal) in Docker are not fully supported on macOS or Windows. As a result, the `pdb` tool (see [2.1. Environment and Tools](#21-environment-and-tools)) only works on Linux.
72+
7073
---
7174

7275
#### 2.1. Environment and Tools
@@ -137,28 +140,89 @@ We provide a human mode that enables developers to manually interact with `debug
137140

138141
#### 3.3. Overriding Values in Config
139142

140-
`-p` is a handy way to override values defined in config. For example, the below command will run rewrite_agent agent on Aider with human mode (while in config file it specifies gpt-4o).
143+
The `-p` flag is a handy way to override values defined in the config file. For example, the command below will run the rewrite_agent agent on Aider with human mode (even if the config file specifies gpt-4o). The command also overrides the default system prompt (see below for more information).
144+
145+
python scripts/run.py scripts/config_aider.yaml \
146+
--agent debug_agent \
147+
-v \
148+
-p debug_agent.llm_name="human" \
149+
-p debug_agent.system_prompt_template_file="scripts/templates/human_friendly_system_prompt.jinja"
150+
151+
152+
#### 3.4. Customizing the System Prompt with Jinja Templates
153+
154+
`debug-gym` allows you to fully customize the system prompt by providing a [Jinja](https://jinja.palletsprojects.com/) template file. This enables you to control the format and content of the prompt sent to the LLM, making it easier to adapt the environment to your specific needs or research experiments.
155+
156+
To use a custom system prompt template, specify the path to your Jinja template file in your agent's configuration under `system_prompt_template_file`. For example:
157+
158+
```yaml
159+
debug_agent:
160+
system_prompt_template_file: scripts/templates/custom_system_prompt.jinja
161+
```
162+
163+
Alternatively, you can provide a custom template from the command line with `-p <agent>.system_prompt_template_file="<path/to/template.jinja>"` (see above).
164+
165+
Within your Jinja template, you have access to the `agent` and `info` objects, which provide all relevant context about the current environment and agent state.
166+
167+
#### Custom Jinja Filters
168+
169+
In addition to all [built-in Jinja filters](https://jinja.palletsprojects.com/en/stable/templates/#list-of-builtin-filters), two custom filters are available for use in your template:
170+
171+
- **`to_pretty_json`**: Converts a Python object to a pretty-printed JSON string. Useful for displaying structured data in a readable format.
172+
```jinja
173+
{{ info.tools | to_pretty_json }}
174+
```
175+
176+
- **`trim_message`**: Trims a string to fit within a token or character limit, also filtering out non-UTF8 characters. This is helpful for ensuring that large outputs (such as directory trees or evaluation results) do not exceed the LLM's context window. The `trim_message` filter accepts the following arguments to control how messages are trimmed:
177+
- **`max_length`**: The maximum number of tokens to keep in the message. If the message exceeds this length, it will be trimmed.
178+
- **`max_length_percentage`**: Instead of specifying an absolute number, you can provide a percentage (e.g., `0.1` for 10%) of the LLM's context window. The message will be trimmed to fit within this percentage of the model's maximum context length.
179+
- **`where`**: Specifies where to trim the message if it exceeds the limit. The default is `"middle"`, which trims from the middle of the message. Other options are `start` or `end`.
180+
181+
```jinja
182+
{{ info.dir_tree | trim_message(max_length_percentage=0.1, where="end") }}
183+
```
184+
185+
#### Example Template
186+
187+
```jinja
188+
System Prompt for Debug-Gym
189+
190+
Task: {{ agent.system_prompt }}
191+
192+
Instructions:
193+
{{ info.instructions }}
194+
195+
Directory Tree:
196+
{{ info.dir_tree | trim_message(max_length=1000) }}
197+
198+
Current Breakpoints:
199+
{{ info.current_breakpoints | to_pretty_json }}
200+
201+
{% if agent.shortcut_features() %}
202+
Shortcut Features:
203+
{{ agent.shortcut_features() | to_pretty_json }}
204+
{% endif %}
205+
```
141206

142-
python scripts/run.py scripts/config_aider.yaml --agent rewrite_agent -v -p rewrite_agent.llm_name="human"
143207

144-
#### 3.4. Debugging a Custom Repository
208+
#### 3.5. Debugging a Custom Repository
145209

146210
Modify `scripts/config.yaml`, especially the `env_kwargs` to set the path and entrypoint of the custom repository. We assume there is a `.debugignore` file and a `.debugreadonly` within the repository that labels files/folders that are not seen or not editable, respectively.
147211

148212
As an example, we provide a buggy pytorch code repository in `data/pytorch`.
149213

150214
python scripts/run.py scripts/config.yaml --agent <agent name>
151215

152-
#### 3.5. Debugging a Custom SWE-Smith Instance
216+
#### 3.6. Debugging a Custom SWE-Smith Instance
153217

154218
[SWE-Smith](https://github.com/SWE-bench/SWE-smith) allows to generate new buggy code instances. Give a custom HuggingFace dataset (either local or remote) that has a similar structure as [SWE-bench/SWE-smith](https://huggingface.co/datasets/SWE-bench/SWE-smith), one can override the `-p base.env_kwargs.dataset_id=<dataset_id>` in the command line to run the agent on that dataset. For example, to run on a local dataset:
155219

156220
python scripts/run.py scripts/config_swesmith.yaml --agent <agent name> -p base.env_kwargs.dataset_id="path/to/local/dataset"
157221

158-
#### 3.6. Design Your Own Tool
222+
#### 3.7. Design Your Own Tool
159223
`debug-gym`'s modular design makes it extensible. Users are encouraged to extend `debug-gym` to their specific usecases, for example by creating new tools that diversify an agent's action and observation spaces. For detailed instruction on designing new tools that are `debug-gym`-compatible, please refer to the [Technical Report](https://arxiv.org/abs/2503.21557).
160224

161-
#### 3.7. Analysis and Visualization
225+
#### 3.8. Analysis and Visualization
162226

163227
We provide a set of scripts to help analyze the log files (e.g., the `.jsonl` files) generated by the agent.
164228
- In the `analysis` folder, we provide scripts that used to generate the corresponding figures in our technical report.

0 commit comments

Comments
 (0)