Skip to content
Open
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
4e30e04
terminate the episode when encountering non-recoverable errors.
xingdi-eric-yuan Dec 10, 2025
d0215f3
Update docker.py
xingdi-eric-yuan Dec 10, 2025
b043f97
test cases
xingdi-eric-yuan Dec 10, 2025
1d48dc4
Fix edge cases in unrecoverable terminal error handling
xingdi-eric-yuan Dec 11, 2025
b886678
Update run.py
xingdi-eric-yuan Dec 11, 2025
54f6d30
Update experiment.py
xingdi-eric-yuan Dec 11, 2025
7f75f2f
Update experiment.py
xingdi-eric-yuan Dec 11, 2025
70cc39e
Update experiment.py
xingdi-eric-yuan Dec 11, 2025
d63e32a
Update experiment.py
xingdi-eric-yuan Dec 11, 2025
0fa81e9
Update test_experiment.py
xingdi-eric-yuan Dec 11, 2025
6aacd22
Update run.py
xingdi-eric-yuan Dec 11, 2025
841024a
Update test_experiment.py
xingdi-eric-yuan Dec 11, 2025
8bfa68d
more tests
xingdi-eric-yuan Dec 11, 2025
8cf0cab
Update mini_nightmare.py
xingdi-eric-yuan Dec 11, 2025
586a991
multiple processes share the same temp file path
xingdi-eric-yuan Dec 11, 2025
c8da26b
Update run.py
xingdi-eric-yuan Dec 11, 2025
4e20ae0
Fix pickle error when logging exceptions in worker processes
xingdi-eric-yuan Dec 11, 2025
863a294
Merge branch 'main' into error_handling
xingdi-eric-yuan Dec 11, 2025
974b237
Update test_experiment.py
xingdi-eric-yuan Dec 11, 2025
4d12f88
Update env.py
xingdi-eric-yuan Dec 12, 2025
1dee2bf
minor
xingdi-eric-yuan Dec 12, 2025
94f3043
remove listdir
xingdi-eric-yuan Dec 12, 2025
6809736
add test
xingdi-eric-yuan Dec 12, 2025
021b240
Update test_env.py
xingdi-eric-yuan Dec 12, 2025
139959d
Update run.py
xingdi-eric-yuan Dec 12, 2025
8eedeb6
update readme
xingdi-eric-yuan Dec 12, 2025
1e855b2
add back listdir, introducing tool dependencies (when defining tools)
xingdi-eric-yuan Dec 12, 2025
7c00f94
Update test_experiment.py
xingdi-eric-yuan Dec 12, 2025
5652c99
Update README.md
xingdi-eric-yuan Dec 12, 2025
a56ca76
move setup_commands to tool.py so it's more generic
xingdi-eric-yuan Dec 12, 2025
4015c74
Update tool.py
xingdi-eric-yuan Dec 12, 2025
64ebedf
Move tool setup_commands to base EnvironmentTool class
xingdi-eric-yuan Dec 12, 2025
eb2e59b
Simplify tool's setup_commands
MarcCote Dec 13, 2025
2edd5fc
Better LLM.instantiate (#313)
sordonia Dec 13, 2025
805a303
Update submit.py
xingdi-eric-yuan Dec 13, 2025
c03a36a
Fix formatting
MarcCote Dec 13, 2025
49a0e4d
fix copy
Dec 13, 2025
de97b55
step / init
Dec 13, 2025
912040f
rename .tool -> .action
Dec 13, 2025
d726f6d
merge and rename
Dec 13, 2025
5bf4c6b
merge
Dec 13, 2025
55f8940
agents
Dec 13, 2025
56e1017
precommit
Dec 13, 2025
1526908
finish removing
Dec 13, 2025
3a93442
revert back renaming, too much mess
Dec 13, 2025
f050928
llm mock
Dec 13, 2025
9133c8c
fix tests
Dec 13, 2025
81e374f
accept llm as None otw all hell breaks loose
Dec 13, 2025
8f1813a
remove llm from run
Dec 13, 2025
7225a6a
value error
Dec 13, 2025
01fdbde
sys prompt for froggy
Dec 13, 2025
ff76f2d
fix order of args
Dec 13, 2025
2bbb7fd
dont require llm
Dec 13, 2025
801e238
info rename
Dec 13, 2025
22334fe
return env info
Dec 13, 2025
ab41a84
Refactor step and execute_action methods
sordonia Dec 13, 2025
44633f7
remove env in config
Dec 13, 2025
e5101b1
Merge branch 'refactor_step_init' of github.com:microsoft/debug-gym i…
Dec 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 29 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,16 +81,18 @@ One of the core designs of `debug-gym` is the notion of tools. Users can dynamic
| Tool name | Description |
| :-: | :----- |
| `bash` | Run commands in a bash shell. You have access to common Linux and Python packages via pip. State is persistent across command calls within the same session. |
| `listdir` | It returns the directory tree at a given subdirectory. This is particularly useful when dealing with a repository with multiple files. |
| `view` | It is used to change an agent's focus to a particular source code file. This is particularly useful when dealing with a repository with multiple files. |
| `eval` | It runs the current code repository using the provided entrypoint (e.g., pytest), and returns the terminal's output (e.g., error message). |
| `pdb` | Interactive debugger wrapping the [Python pdb tool](https://docs.python.org/3/library/pdb.html). In addition, users can choose to maintain a set of persistent breakpoints (as in some programming IDEs), which are not reset after every eval. With such feature, a new pdb debugging session is activated automatically, with all the breakpoints restored. Note such breakpoints can be cleared by pdb commands such as `cl`. |
| `grep` | Search for patterns in files within the repository. Supports both literal string matching and regular expressions. Can search in specific files, directories, or the entire repository. Useful for finding code patterns, function definitions, variable usage, or identifying files containing specific text. |
| `listdir` | List the file and folder contents of a directory within the working directory, up to a specified depth. Useful for exploring the repository structure. |
| `edit` | It can be used to edit a certain piece of code to fix the bug. The inputs of this tool call include the file path, the start and end line numbers, and the new code. |
| `submit` | Submit your changes once the task is complete. By default, it runs evaluation before terminating the session, but this can be disabled via `eval_on_submit: false`. |

Upon importing a tool, its action space and observation space will be automatically merged into `debug-gym`'s action space and observation space; its instruction will also be merged into the overall instruction provided to the agent (e.g., as system prompt).

**Tool Dependencies:** Some tools require additional packages to be installed in the terminal environment. When a tool is added to the configuration, its required dependencies are automatically installed during terminal setup. For example, the `listdir` tool requires the `tree` package, which is automatically installed when the tool is used. This ensures that tools work out of the box without manual configuration.

Users can include a `.debugignore` file in the repository to specify files and directories that are not visible to `debug-gym`, similarly, they can include a `.debugreadonly` to specify files and directories that are read only by the agents (e.g., the test files). Both files share the same syntax as `.gitignore`.

---
Expand Down Expand Up @@ -119,7 +121,7 @@ To demonstrate how to integrate `debug-gym` with coding tasks and repositories,
| `mini_nightmare` | A set of 10 hand-crafted minimal buggy code snippet where edit-only agents have harder time to tackle. Read details [here](https://github.com/microsoft/debug-gym/blob/main/data/mini_nightmare/mini_nightmare.md). |

> [!NOTE]
> Since debug-gym focuses on debugging tasks with the use of a debugger, we provide a customized version of `swebench`, called `swebench-debug`, where each problem's codebase already has the gold test patch applied. This allows us to better simulate real-world debugging scenarios where the buggy code is expected to have failing tests and we can set the debugger's entrypoint accordingly. To use `swebench-debug`, use `configs/swebench_debug.yaml` or set `task_data.dataset_type: swebench-debug` in your config file.
> Since debug-gym focuses on debugging tasks with the use of a debugger, we provide a customized version of `swebench`, called `swebench-debug`, where each problem's codebase already has the gold test patch applied. This allows us to better simulate real-world debugging scenarios where the buggy code is expected to have failing tests and we can set the debugger's entrypoint accordingly. To use `swebench-debug`, use `configs/swebench_debug.yaml` or set `dataset.type: swebench-debug` in your config file.

---

Expand Down Expand Up @@ -151,31 +153,38 @@ Terminal selection is configured through the `terminal_config` in your script co
## 3. Running Baselines
We use `.yaml` files to specify configurations. Example config files can be found in `configs/`. To run an agent:

python scripts/run.py configs/<benchmark name>.yaml
python scripts/run.py --config configs/<benchmark_name>.yaml

Common options:
- `-v` or `-vv`: Verbose or very verbose logging
- `--debug`: Enter debug mode (press `c` to continue after each step)
- `-n <num>`: Number of parallel workers (default: 1)
- `-p key=value`: Override config values (use `.` for nested keys, e.g., `-p llm.name=gpt-4o`)
- `--force-all`: Re-run all problems even if already completed
- `--force-failed`: Re-run only failed problems

Add `-v`, `--debug` to be verbose, or to enter debug mode.
> [!WARNING]
> When using --debug, you will need to press `c` to continue after each reasoning step.

#### 3.1 Sanity Checks

We can use the `solution_agent` to validate that your `swebench`, `swesmith`, and `r2egym` instances work as expected. This agent will apply a gold patch to the buggy code and check that the tests are failing before applying the patch, and passing after. It also checks that `pdb` tool can be used as expected (if available).

python scripts/run.py configs/swebench.yaml -p agent.type=solution_agent
python scripts/run.py configs/swesmith.yaml -p agent.type=solution_agent
python scripts/run.py configs/r2egym.yaml -p agent.type=solution_agent
python scripts/run.py --config configs/swebench.yaml -p agent.type=solution_agent
python scripts/run.py --config configs/swesmith.yaml -p agent.type=solution_agent
python scripts/run.py --config configs/r2egym.yaml -p agent.type=solution_agent

#### 3.2 Human Mode

We provide a human mode that enables developers to manually interact with `debug-gym`. To activate this mode, change the `llm_name` field in your config YAML to `"human"`. Once activated, at every step, the environment will expect a command input (in tool calling format). One can use the `Tab` key to get a list of tool calling templates and fill in any necessary arguments.
We provide a human mode that enables developers to manually interact with `debug-gym`. To activate this mode, set `llm.name` to `"human"` in your config YAML (or use `-p llm.name=human`). Once activated, at every step, the environment will expect a command input (in tool calling format). One can use the `Tab` key to get a list of tool calling templates and fill in any necessary arguments.

#### 3.3. Overriding Values in Config

The `-p` flag is a handy way to override values defined in the config file. For example, the command below will run on Aider with human mode (even if the config file specifies gpt-4o). The command also overrides the default system prompt (see below for more information).
The `-p` flag is a handy way to override values defined in the config file. Use `.` notation for nested keys. For example, the command below will run on Aider with human mode (even if the config file specifies gpt-4o). The command also overrides the default system prompt (see below for more information).

python scripts/run.py configs/aider.yaml \
python scripts/run.py --config configs/aider.yaml \
-v \
-p llm_name="human" \
-p llm.name="human" \
-p agent.system_prompt="scripts/templates/human_friendly_system_prompt.jinja"


Expand Down Expand Up @@ -235,17 +244,17 @@ Shortcut Features:

#### 3.5. Debugging a Custom Repository

Modify `configs/config.yaml`, especially the `task_data` section to set the path and entrypoint of the custom repository. We assume there is a `.debugignore` file and a `.debugreadonly` within the repository that labels files/folders that are not seen or not editable, respectively.

As an example, we provide a buggy pytorch code repository in `data/pytorch`.
You can debug a custom repository by using `configs/local.yaml` and modifying the `task_data` section to set the path and entrypoint of the custom repository. We assume there is a `.debugignore` file and a `.debugreadonly` within the repository that labels files/folders that are not seen or not editable, respectively.

python scripts/run.py configs/config.yaml
python scripts/run.py --config configs/local.yaml \
-p task_data.path="/path/to/your/repo" \
-p task_data.entrypoint="pytest tests/"

#### 3.6. Debugging a Custom SWE-Smith Instance

[SWE-Smith](https://github.com/SWE-bench/SWE-smith) allows to generate new buggy code instances. Given a custom HuggingFace dataset (either local or remote) that has a similar structure as [SWE-bench/SWE-smith](https://huggingface.co/datasets/SWE-bench/SWE-smith), one can override the `-p task_data.dataset_id=<dataset_id>` in the command line to run the agent on that dataset. For example, to run on a local dataset:
[SWE-Smith](https://github.com/SWE-bench/SWE-smith) allows to generate new buggy code instances. Given a custom HuggingFace dataset (either local or remote) that has a similar structure as [SWE-bench/SWE-smith](https://huggingface.co/datasets/SWE-bench/SWE-smith), one can override the `dataset.dataset_id` in the command line to run the agent on that dataset. For example, to run on a local dataset:

python scripts/run.py configs/swesmith.yaml -p task_data.dataset_id="path/to/local/dataset"
python scripts/run.py --config configs/swesmith.yaml -p dataset.dataset_id="path/to/local/dataset"

#### 3.7. Design Your Own Tool
`debug-gym`'s modular design makes it extensible. Users are encouraged to extend `debug-gym` to their specific usecases, for example by creating new tools that diversify an agent's action and observation spaces. For detailed instruction on designing new tools that are `debug-gym`-compatible, please refer to the [Technical Report](https://arxiv.org/abs/2503.21557).
Expand All @@ -272,7 +281,8 @@ While `debug-gym` was designed for debugging tasks, the `FreeEnv` environment en
task_name: free-session
output_path: exps/free_env

llm_name: gpt-4o
llm:
name: gpt-4o

tools:
- edit
Expand All @@ -299,7 +309,7 @@ agent:

Run with:

python scripts/run.py configs/free_env.yaml
python scripts/run.py --config configs/free_env.yaml

This provides a sandbox for developing and evaluating coding agents on arbitrary tasks, making `debug-gym` useful for general agent research beyond debugging.

Expand Down
3 changes: 2 additions & 1 deletion configs/aider.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@
task_name: aider
output_path: exps/aider

llm_name: gpt-4o
llm:
name: gpt-4o

# Tools to load into the environment toolbox.
tools:
Expand Down
2 changes: 2 additions & 0 deletions configs/swebench.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ output_path: exps/swebench-verified

llm:
name: gpt-4o
# temperature: 0.7 # optional, overrides llm.yaml default
# max_tokens: 4096 # optional, overrides llm.yaml default

# Tools to load into the environment toolbox.
tools:
Expand Down
198 changes: 127 additions & 71 deletions debug_gym/agents/base_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -238,90 +238,64 @@ def should_stop(self, step: int, info: EnvInfo):
reason = "max_steps reached"
return should_stop, reason

def run(self, env: RepoEnv, llm: LLM, debug=False):
def init(
self, env: RepoEnv, llm: LLM, reset_env: bool = True
) -> EnvInfo:
"""Initialize the agent with environment and LLM.

Args:
env: The environment to interact with.
llm: The language model to use for decision making.
reset_env: Whether to reset the environment (default True).

Returns:
The initial EnvInfo after setup.
"""
self.env = env
self.llm = llm
info = None
step = 0

try:
if reset_env:
info = self.env.reset()
self.history.init(
self.build_system_prompt(info), self.build_instance_prompt(info), info
)
else:
info = self.env.info

if info.resolved:
self.logger.report_progress(
problem_id=env.task_name,
step=0,
total_steps=self.args.max_steps,
score=info.score,
max_score=info.max_score,
status="resolved",
)
return self._build_trajectory()
self.history.init(
self.build_system_prompt(info), self.build_instance_prompt(info), info
)

self.logger.info(
"Available tools (in LLM's tool calling format):\n"
f"{json.dumps(self.llm.define_tools(info.tools), indent=4)}\n"
)
self.logger.info(
"Available tools (in LLM's tool calling format):\n"
f"{json.dumps(self.llm.define_tools(info.tools), indent=4)}\n"
)

highscore = info.score
should_stop = False
step = 1
return info

while not should_stop:
self.logger.info(f"\n{'='*20} STEP {step} {'='*20}\n")
def step(self, info: EnvInfo, debug: bool = False) -> EnvInfo:
"""Execute a single agent step.

messages = self.build_prompt(info)
llm_response = self.llm(messages, info.tools)
Args:
info: Current environment info.
debug: Whether to drop into debugger after LLM call.

if debug:
breakpoint()
Returns:
New EnvInfo after executing the action.
"""
messages = self.build_prompt(info)
llm_response = self.llm(messages, info.tools)

info = self.env.step(
llm_response.tool,
llm_response.response,
llm_response.reasoning_response,
)
self.history.step(info, llm_response)
should_stop, reason = self.should_stop(step + 1, info)
status = (
"resolved"
if info.resolved
else ("unresolved" if should_stop else "running")
)
if debug:
breakpoint()

highscore = max(highscore, info.score)
msg = f"[{env.task_name[:10]:<10}] Step {step} | Score: {info.score}/{info.max_score or '-'} [Best: {highscore}]"
if should_stop:
msg += f" | Stopping Reason: {reason}"
self.logger.info(msg)
step += 1

# keep progress bar running until max_steps is reached
self.logger.report_progress(
problem_id=env.task_name,
step=step,
total_steps=self.args.max_steps,
score=info.score,
max_score=info.max_score,
status=status,
)
return self._build_trajectory()
except Exception as e:
# report any error that happens during the run
self.logger.report_progress(
problem_id=env.task_name,
step=step,
total_steps=step,
score=getattr(info, "score", 0),
max_score=getattr(info, "max_score", None),
status="error",
)
raise e
new_info = self.env.step(
llm_response.tool,
llm_response.response,
llm_response.reasoning_response,
)
self.history.step(new_info, llm_response)

def _build_trajectory(self) -> Dict[str, Any]:
return new_info

def build_trajectory(self) -> Dict[str, Any]:
"""Return the trajectory as a JSON-serializable dict without writing it."""
tools = [f"{tool.name}({tool.arguments})" for tool in self.env.tools]
json_output = {
Expand Down Expand Up @@ -361,3 +335,85 @@ def create_agent(config: Dict[str, Any], **kwargs) -> BaseAgent:

agent = agent_class(agent_args=config, **kwargs)
return agent


def run_agent(
agent: BaseAgent,
env: RepoEnv,
llm: LLM,
debug: bool = False,
reset_env: bool = True,
) -> Dict[str, Any]:
"""Run the agent loop until termination or max steps.

Args:
agent: The agent to run.
env: The environment to interact with.
llm: The language model to use for decision making.
debug: Whether to drop into debugger after each LLM call.
reset_env: Whether to reset the environment (default True).

Returns:
The trajectory as a JSON-serializable dict.
"""
info = None
step = 0

try:
info = agent.init(env, llm, reset_env=reset_env)

if info.resolved:
agent.logger.report_progress(
problem_id=env.task_name,
step=0,
total_steps=agent.args.max_steps,
score=info.score,
max_score=info.max_score,
status="resolved",
)
return agent.build_trajectory()

highscore = info.score
should_stop = False
step = 1

while not should_stop:
agent.logger.info(f"\n{'='*20} STEP {step} {'='*20}\n")

info = agent.step(info, debug=debug)

should_stop, reason = agent.should_stop(step + 1, info)
status = (
"resolved"
if info.resolved
else ("unresolved" if should_stop else "running")
)

highscore = max(highscore, info.score)
msg = f"[{env.task_name[:10]:<10}] Step {step} | Score: {info.score}/{info.max_score or '-'} [Best: {highscore}]"
if should_stop:
msg += f" | Stopping Reason: {reason}"
agent.logger.info(msg)
step += 1

# keep progress bar running until max_steps is reached
agent.logger.report_progress(
problem_id=env.task_name,
step=step,
total_steps=agent.args.max_steps,
score=info.score,
max_score=info.max_score,
status=status,
)
return agent.build_trajectory()
except Exception as e:
# report any error that happens during the run
agent.logger.report_progress(
problem_id=env.task_name,
step=step,
total_steps=step,
score=getattr(info, "score", 0),
max_score=getattr(info, "max_score", None),
status="error",
)
raise e
Loading