You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Added scripts to filter trajectories generated by `debug-gym`.
6
+
7
+
8
+
### 2025-06-11
9
+
10
+
Added support to [SWE-smith](https://swesmith.com/). Users can use the tasks shipped with the official SWE-smith package, or customized tasks generated using SWE-smith.
11
+
3
12
### 2025-06-03
4
13
* Refactored `debug_gym/agents/llm_api.py` into separate modules in `debug_gym/llms/` for OpenAI, AzureOpenAI, Anthropic APIs, and human mode, allowing for easier extension to other LLM providers in the future.
5
14
* Improved the Human mode to support better prompt completion and error handling.
6
15
7
-
### 2025-05-20
8
-
9
-
Changed the tool-calling syntax to be compatible with the [OpenAI](https://platform.openai.com/docs/guides/function-calling) and [Anthropic](https://docs.anthropic.com/en/docs/agents-and-tools/tool-use) function-calling formats.
16
+
### 2025-05-28
10
17
11
-
* Switched tools (view, rewrite, pdb, listdir, eval) to a function-call API with explicit arguments and environment injection.
12
-
* Overhauled LLM interfaces to define, parse, and format function calls, and updated agents to consume `ToolCall` objects.
13
-
* Removed the old conversational-prompt flag from configs.
18
+
Improved the View tool, added the `start` and `end` arguments so the agent can specify a particular chunk of code to view.
14
19
15
20
### 2025-05-22
16
21
17
22
Added in the [analysis](https://github.com/microsoft/debug-gym/tree/main/analysis/json_log_viewer) folder a Flask app to view `.jsonl` log files in the browser.
18
23
19
-
### 2025-05-28
20
-
21
-
Improved the View tool, added the `start` and `end` arguments so the agent can specify a particular chunk of code to view.
24
+
### 2025-05-20
22
25
23
-
### 2025-06-11
26
+
Changed the tool-calling syntax to be compatible with the [OpenAI](https://platform.openai.com/docs/guides/function-calling) and [Anthropic](https://docs.anthropic.com/en/docs/agents-and-tools/tool-use) function-calling formats.
24
27
25
-
Added support to [SWE-smith](https://swesmith.com/). Users can use the tasks shipped with the official SWE-smith package, or customized tasks generated using SWE-smith.
28
+
* Switched tools (view, rewrite, pdb, listdir, eval) to a function-call API with explicit arguments and environment injection.
29
+
* Overhauled LLM interfaces to define, parse, and format function calls, and updated agents to consume `ToolCall` objects.
30
+
* Removed the old conversational-prompt flag from configs.
Copy file name to clipboardExpand all lines: README.md
+70-6Lines changed: 70 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -67,6 +67,9 @@ debug_gym
67
67
68
68
`debug_gym.llms` are the different LLM backends that can be used to instantiate agents. Currently, we support OpenAI, Azure OpenAI, and Anthropic.
69
69
70
+
> [!WARNING]
71
+
> `debug-gym` has limited support on non-Linux platforms. Interactive terminal sessions using PTY (pseudo-terminal) in Docker are not fully supported on macOS or Windows. As a result, the `pdb` tool (see [2.1. Environment and Tools](#21-environment-and-tools)) only works on Linux.
72
+
70
73
---
71
74
72
75
#### 2.1. Environment and Tools
@@ -137,28 +140,89 @@ We provide a human mode that enables developers to manually interact with `debug
137
140
138
141
#### 3.3. Overriding Values in Config
139
142
140
-
`-p` is a handy way to override values defined in config. For example, the below command will run rewrite_agent agent on Aider with human mode (while in config file it specifies gpt-4o).
143
+
The `-p` flag is a handy way to override values defined in the config file. For example, the command below will run the rewrite_agent agent on Aider with human mode (even if the config file specifies gpt-4o). The command also overrides the default system prompt (see below for more information).
#### 3.4. Customizing the System Prompt with Jinja Templates
153
+
154
+
`debug-gym` allows you to fully customize the system prompt by providing a [Jinja](https://jinja.palletsprojects.com/) template file. This enables you to control the format and content of the prompt sent to the LLM, making it easier to adapt the environment to your specific needs or research experiments.
155
+
156
+
To use a custom system prompt template, specify the path to your Jinja template file in your agent's configuration under `system_prompt_template_file`. For example:
Alternatively, you can provide a custom template from the command line with `-p <agent>.system_prompt_template_file="<path/to/template.jinja>"` (see above).
164
+
165
+
Within your Jinja template, you have access to the `agent` and `info` objects, which provide all relevant context about the current environment and agent state.
166
+
167
+
#### Custom Jinja Filters
168
+
169
+
In addition to all [built-in Jinja filters](https://jinja.palletsprojects.com/en/stable/templates/#list-of-builtin-filters), two custom filters are available for use in your template:
170
+
171
+
- **`to_pretty_json`**: Converts a Python object to a pretty-printed JSON string. Useful for displaying structured data in a readable format.
172
+
```jinja
173
+
{{ info.tools | to_pretty_json }}
174
+
```
175
+
176
+
- **`trim_message`**: Trims a string to fit within a token or character limit, also filtering out non-UTF8 characters. This is helpful for ensuring that large outputs (such as directory trees or evaluation results) do not exceed the LLM's context window. The `trim_message` filter accepts the following arguments to control how messages are trimmed:
177
+
- **`max_length`**: The maximum number of tokens to keep in the message. If the message exceeds this length, it will be trimmed.
178
+
- **`max_length_percentage`**: Instead of specifying an absolute number, you can provide a percentage (e.g., `0.1` for 10%) of the LLM's context window. The message will be trimmed to fit within this percentage of the model's maximum context length.
179
+
- **`where`**: Specifies where to trim the message if it exceeds the limit. The default is `"middle"`, which trims from the middle of the message. Other options are `start` or `end`.
Modify `scripts/config.yaml`, especially the `env_kwargs` to set the path and entrypoint of the custom repository. We assume there is a `.debugignore` file and a `.debugreadonly` within the repository that labels files/folders that are not seen or not editable, respectively.
147
211
148
212
As an example, we provide a buggy pytorch code repository in `data/pytorch`.
[SWE-Smith](https://github.com/SWE-bench/SWE-smith) allows to generate new buggy code instances. Give a custom HuggingFace dataset (either local or remote) that has a similar structure as [SWE-bench/SWE-smith](https://huggingface.co/datasets/SWE-bench/SWE-smith), one can override the `-p base.env_kwargs.dataset_id=<dataset_id>` in the command line to run the agent on that dataset. For example, to run on a local dataset:
`debug-gym`'s modular design makes it extensible. Users are encouraged to extend `debug-gym` to their specific usecases, for example by creating new tools that diversify an agent's action and observation spaces. For detailed instruction on designing new tools that are `debug-gym`-compatible, please refer to the [Technical Report](https://arxiv.org/abs/2503.21557).
160
224
161
-
#### 3.7. Analysis and Visualization
225
+
#### 3.8. Analysis and Visualization
162
226
163
227
We provide a set of scripts to help analyze the log files (e.g., the `.jsonl` files) generated by the agent.
164
228
- In the `analysis` folder, we provide scripts that used to generate the corresponding figures in our technical report.
0 commit comments