microsoft · sordonia · Dec 10, 2025 · Dec 10, 2025 · Dec 10, 2025 · Dec 11, 2025
diff --git a/README.md b/README.md
@@ -81,16 +81,18 @@ One of the core designs of `debug-gym` is the notion of tools. Users can dynamic
 | Tool name | Description |
 | :-: | :----- |
 | `bash` | Run commands in a bash shell. You have access to common Linux and Python packages via pip. State is persistent across command calls within the same session. |
-| `listdir` | It returns the directory tree at a given subdirectory. This is particularly useful when dealing with a repository with multiple files. |
 | `view` | It is used to change an agent's focus to a particular source code file. This is particularly useful when dealing with a repository with multiple files. |
 | `eval` | It runs the current code repository using the provided entrypoint (e.g., pytest), and returns the terminal's output (e.g., error message). |
 | `pdb` | Interactive debugger wrapping the [Python pdb tool](https://docs.python.org/3/library/pdb.html). In addition, users can choose to maintain a set of persistent breakpoints (as in some programming IDEs), which are not reset after every eval. With such feature, a new pdb debugging session is activated automatically, with all the breakpoints restored. Note such breakpoints can be cleared by pdb commands such as `cl`. |
 | `grep` | Search for patterns in files within the repository. Supports both literal string matching and regular expressions. Can search in specific files, directories, or the entire repository. Useful for finding code patterns, function definitions, variable usage, or identifying files containing specific text. |
+| `listdir` | List the file and folder contents of a directory within the working directory, up to a specified depth. Useful for exploring the repository structure. |
 | `edit` | It can be used to edit a certain piece of code to fix the bug. The inputs of this tool call include the file path, the start and end line numbers, and the new code. |
 | `submit` | Submit your changes once the task is complete. By default, it runs evaluation before terminating the session, but this can be disabled via `eval_on_submit: false`. |
 
 Upon importing a tool, its action space and observation space will be automatically merged into `debug-gym`'s action space and observation space; its instruction will also be merged into the overall instruction provided to the agent (e.g., as system prompt).
 
+**Tool Dependencies:** Some tools require additional packages to be installed in the terminal environment. When a tool is added to the configuration, its required dependencies are automatically installed during terminal setup. For example, the `listdir` tool requires the `tree` package, which is automatically installed when the tool is used. This ensures that tools work out of the box without manual configuration.
+
 Users can include a `.debugignore` file in the repository to specify files and directories that are not visible to `debug-gym`, similarly, they can include a `.debugreadonly` to specify files and directories that are read only by the agents (e.g., the test files). Both files share the same syntax as `.gitignore`.
 
 ---
@@ -119,7 +121,7 @@ To demonstrate how to integrate `debug-gym` with coding tasks and repositories,
 | `mini_nightmare` | A set of 10 hand-crafted minimal buggy code snippet where edit-only agents have harder time to tackle. Read details [here](https://github.com/microsoft/debug-gym/blob/main/data/mini_nightmare/mini_nightmare.md). |
 
 > [!NOTE]
-> Since debug-gym focuses on debugging tasks with the use of a debugger, we provide a customized version of `swebench`, called `swebench-debug`, where each problem's codebase already has the gold test patch applied. This allows us to better simulate real-world debugging scenarios where the buggy code is expected to have failing tests and we can set the debugger's entrypoint accordingly. To use `swebench-debug`, use `configs/swebench_debug.yaml` or set `task_data.dataset_type: swebench-debug` in your config file.
+> Since debug-gym focuses on debugging tasks with the use of a debugger, we provide a customized version of `swebench`, called `swebench-debug`, where each problem's codebase already has the gold test patch applied. This allows us to better simulate real-world debugging scenarios where the buggy code is expected to have failing tests and we can set the debugger's entrypoint accordingly. To use `swebench-debug`, use `configs/swebench_debug.yaml` or set `dataset.type: swebench-debug` in your config file.
 
 ---
 
@@ -151,31 +153,38 @@ Terminal selection is configured through the `terminal_config` in your script co
 ## 3. Running Baselines
 We use `.yaml` files to specify configurations. Example config files can be found in `configs/`. To run an agent:
 
-    python scripts/run.py configs/<benchmark name>.yaml
+    python scripts/run.py --config configs/<benchmark_name>.yaml
+
+Common options:
+- `-v` or `-vv`: Verbose or very verbose logging
+- `--debug`: Enter debug mode (press `c` to continue after each step)
+- `-n <num>`: Number of parallel workers (default: 1)
+- `-p key=value`: Override config values (use `.` for nested keys, e.g., `-p llm.name=gpt-4o`)
+- `--force-all`: Re-run all problems even if already completed
+- `--force-failed`: Re-run only failed problems
 
-Add `-v`, `--debug` to be verbose, or to enter debug mode.
 > [!WARNING]
 > When using --debug, you will need to press `c` to continue after each reasoning step.
 
 #### 3.1 Sanity Checks
 
 We can use the `solution_agent` to validate that your `swebench`, `swesmith`, and `r2egym` instances work as expected. This agent will apply a gold patch to the buggy code and check that the tests are failing before applying the patch, and passing after. It also checks that `pdb` tool can be used as expected (if available).
 
-    python scripts/run.py configs/swebench.yaml -p agent.type=solution_agent
-    python scripts/run.py configs/swesmith.yaml -p agent.type=solution_agent
-    python scripts/run.py configs/r2egym.yaml -p agent.type=solution_agent
+    python scripts/run.py --config configs/swebench.yaml -p agent.type=solution_agent
+    python scripts/run.py --config configs/swesmith.yaml -p agent.type=solution_agent
+    python scripts/run.py --config configs/r2egym.yaml -p agent.type=solution_agent
 
 #### 3.2 Human Mode
 
-We provide a human mode that enables developers to manually interact with `debug-gym`. To activate this mode, change the `llm_name` field in your config YAML to `"human"`. Once activated, at every step, the environment will expect a command input (in tool calling format). One can use the `Tab` key to get a list of tool calling templates and fill in any necessary arguments.
+We provide a human mode that enables developers to manually interact with `debug-gym`. To activate this mode, set `llm.name` to `"human"` in your config YAML (or use `-p llm.name=human`). Once activated, at every step, the environment will expect a command input (in tool calling format). One can use the `Tab` key to get a list of tool calling templates and fill in any necessary arguments.
 
 #### 3.3. Overriding Values in Config
 
-The `-p` flag is a handy way to override values defined in the config file. For example, the command below will run on Aider with human mode (even if the config file specifies gpt-4o). The command also overrides the default system prompt (see below for more information).
+The `-p` flag is a handy way to override values defined in the config file. Use `.` notation for nested keys. For example, the command below will run on Aider with human mode (even if the config file specifies gpt-4o). The command also overrides the default system prompt (see below for more information).
 
-    python scripts/run.py configs/aider.yaml \
+    python scripts/run.py --config configs/aider.yaml \
         -v \
-        -p llm_name="human" \
+        -p llm.name="human" \
         -p agent.system_prompt="scripts/templates/human_friendly_system_prompt.jinja"
 
 
@@ -235,17 +244,17 @@ Shortcut Features:
 
 #### 3.5. Debugging a Custom Repository
 
-Modify `configs/config.yaml`, especially the `task_data` section to set the path and entrypoint of the custom repository. We assume there is a `.debugignore` file and a `.debugreadonly` within the repository that labels files/folders that are not seen or not editable, respectively.
-
-As an example, we provide a buggy pytorch code repository in `data/pytorch`.
+You can debug a custom repository by using `configs/local.yaml` and modifying the `task_data` section to set the path and entrypoint of the custom repository. We assume there is a `.debugignore` file and a `.debugreadonly` within the repository that labels files/folders that are not seen or not editable, respectively.
 
-    python scripts/run.py configs/config.yaml
+    python scripts/run.py --config configs/local.yaml \
+        -p task_data.path="/path/to/your/repo" \
+        -p task_data.entrypoint="pytest tests/"
 
 #### 3.6. Debugging a Custom SWE-Smith Instance
 
-[SWE-Smith](https://github.com/SWE-bench/SWE-smith) allows to generate new buggy code instances. Given a custom HuggingFace dataset (either local or remote) that has a similar structure as [SWE-bench/SWE-smith](https://huggingface.co/datasets/SWE-bench/SWE-smith), one can override the `-p task_data.dataset_id=<dataset_id>` in the command line to run the agent on that dataset. For example, to run on a local dataset:
+[SWE-Smith](https://github.com/SWE-bench/SWE-smith) allows to generate new buggy code instances. Given a custom HuggingFace dataset (either local or remote) that has a similar structure as [SWE-bench/SWE-smith](https://huggingface.co/datasets/SWE-bench/SWE-smith), one can override the `dataset.dataset_id` in the command line to run the agent on that dataset. For example, to run on a local dataset:
 
-    python scripts/run.py configs/swesmith.yaml -p task_data.dataset_id="path/to/local/dataset"
+    python scripts/run.py --config configs/swesmith.yaml -p dataset.dataset_id="path/to/local/dataset"
 
 #### 3.7. Design Your Own Tool
 `debug-gym`'s modular design makes it extensible. Users are encouraged to extend `debug-gym` to their specific usecases, for example by creating new tools that diversify an agent's action and observation spaces. For detailed instruction on designing new tools that are `debug-gym`-compatible, please refer to the [Technical Report](https://arxiv.org/abs/2503.21557).
@@ -272,7 +281,8 @@ While `debug-gym` was designed for debugging tasks, the `FreeEnv` environment en
 task_name: free-session
 output_path: exps/free_env
 
-llm_name: gpt-4o
+llm:
+  name: gpt-4o
 
 tools:
   - edit
@@ -299,7 +309,7 @@ agent:
 
 Run with:
 
-    python scripts/run.py configs/free_env.yaml
+    python scripts/run.py --config configs/free_env.yaml
 
 This provides a sandbox for developing and evaluating coding agents on arbitrary tasks, making `debug-gym` useful for general agent research beyond debugging.
 

diff --git a/configs/aider.yaml b/configs/aider.yaml
@@ -2,7 +2,8 @@
 task_name: aider
 output_path: exps/aider
 
-llm_name: gpt-4o
+llm:
+  name: gpt-4o
 
 # Tools to load into the environment toolbox.
 tools:

diff --git a/configs/swebench.yaml b/configs/swebench.yaml
@@ -4,6 +4,8 @@ output_path: exps/swebench-verified
 
 llm:
   name: gpt-4o
+  # temperature: 0.7   # optional, overrides llm.yaml default
+  # max_tokens: 4096   # optional, overrides llm.yaml default
 
 # Tools to load into the environment toolbox.
 tools:

diff --git a/debug_gym/agents/base_agent.py b/debug_gym/agents/base_agent.py
@@ -238,90 +238,64 @@ def should_stop(self, step: int, info: EnvInfo):
             reason = "max_steps reached"
         return should_stop, reason
 
-    def run(self, env: RepoEnv, llm: LLM, debug=False):
+    def init(
+        self, env: RepoEnv, llm: LLM, reset_env: bool = True
+    ) -> EnvInfo:
+        """Initialize the agent with environment and LLM.
+
+        Args:
+            env: The environment to interact with.
+            llm: The language model to use for decision making.
+            reset_env: Whether to reset the environment (default True).
+
+        Returns:
+            The initial EnvInfo after setup.
+        """
         self.env = env
         self.llm = llm
-        info = None
-        step = 0
 
-        try:
+        if reset_env:
             info = self.env.reset()
-            self.history.init(
-                self.build_system_prompt(info), self.build_instance_prompt(info), info
-            )
+        else:
+            info = self.env.info
 
-            if info.resolved:
-                self.logger.report_progress(
-                    problem_id=env.task_name,
-                    step=0,
-                    total_steps=self.args.max_steps,
-                    score=info.score,
-                    max_score=info.max_score,
-                    status="resolved",
-                )
-                return self._build_trajectory()
+        self.history.init(
+            self.build_system_prompt(info), self.build_instance_prompt(info), info
+        )
 
-            self.logger.info(
-                "Available tools (in LLM's tool calling format):\n"
-                f"{json.dumps(self.llm.define_tools(info.tools), indent=4)}\n"
-            )
+        self.logger.info(
+            "Available tools (in LLM's tool calling format):\n"
+            f"{json.dumps(self.llm.define_tools(info.tools), indent=4)}\n"
+        )
 
-            highscore = info.score
-            should_stop = False
-            step = 1
+        return info
 
-            while not should_stop:
-                self.logger.info(f"\n{'='*20} STEP {step} {'='*20}\n")
+    def step(self, info: EnvInfo, debug: bool = False) -> EnvInfo:
+        """Execute a single agent step.
 
-                messages = self.build_prompt(info)
-                llm_response = self.llm(messages, info.tools)
+        Args:
+            info: Current environment info.
+            debug: Whether to drop into debugger after LLM call.
 
-                if debug:
-                    breakpoint()
+        Returns:
+            New EnvInfo after executing the action.
+        """
+        messages = self.build_prompt(info)
+        llm_response = self.llm(messages, info.tools)
 
-                info = self.env.step(
-                    llm_response.tool,
-                    llm_response.response,
-                    llm_response.reasoning_response,
-                )
-                self.history.step(info, llm_response)
-                should_stop, reason = self.should_stop(step + 1, info)
-                status = (
-                    "resolved"
-                    if info.resolved
-                    else ("unresolved" if should_stop else "running")
-                )
+        if debug:
+            breakpoint()
 
-                highscore = max(highscore, info.score)
-                msg = f"[{env.task_name[:10]:<10}] Step {step} | Score: {info.score}/{info.max_score or '-'} [Best: {highscore}]"
-                if should_stop:
-                    msg += f" | Stopping Reason: {reason}"
-                self.logger.info(msg)
-                step += 1
-
-                # keep progress bar running until max_steps is reached
-                self.logger.report_progress(
-                    problem_id=env.task_name,
-                    step=step,
-                    total_steps=self.args.max_steps,
-                    score=info.score,
-                    max_score=info.max_score,
-                    status=status,
-                )
-            return self._build_trajectory()
-        except Exception as e:
-            # report any error that happens during the run
-            self.logger.report_progress(
-                problem_id=env.task_name,
-                step=step,
-                total_steps=step,
-                score=getattr(info, "score", 0),
-                max_score=getattr(info, "max_score", None),
-                status="error",
-            )
-            raise e
+        new_info = self.env.step(
+            llm_response.tool,
+            llm_response.response,
+            llm_response.reasoning_response,
+        )
+        self.history.step(new_info, llm_response)
 
-    def _build_trajectory(self) -> Dict[str, Any]:
+        return new_info
+
+    def build_trajectory(self) -> Dict[str, Any]:
         """Return the trajectory as a JSON-serializable dict without writing it."""
         tools = [f"{tool.name}({tool.arguments})" for tool in self.env.tools]
         json_output = {
@@ -361,3 +335,85 @@ def create_agent(config: Dict[str, Any], **kwargs) -> BaseAgent:
 
     agent = agent_class(agent_args=config, **kwargs)
     return agent
+
+
+def run_agent(
+    agent: BaseAgent,
+    env: RepoEnv,
+    llm: LLM,
+    debug: bool = False,
+    reset_env: bool = True,
+) -> Dict[str, Any]:
+    """Run the agent loop until termination or max steps.
+
+    Args:
+        agent: The agent to run.
+        env: The environment to interact with.
+        llm: The language model to use for decision making.
+        debug: Whether to drop into debugger after each LLM call.
+        reset_env: Whether to reset the environment (default True).
+
+    Returns:
+        The trajectory as a JSON-serializable dict.
+    """
+    info = None
+    step = 0
+
+    try:
+        info = agent.init(env, llm, reset_env=reset_env)
+
+        if info.resolved:
+            agent.logger.report_progress(
+                problem_id=env.task_name,
+                step=0,
+                total_steps=agent.args.max_steps,
+                score=info.score,
+                max_score=info.max_score,
+                status="resolved",
+            )
+            return agent.build_trajectory()
+
+        highscore = info.score
+        should_stop = False
+        step = 1
+
+        while not should_stop:
+            agent.logger.info(f"\n{'='*20} STEP {step} {'='*20}\n")
+
+            info = agent.step(info, debug=debug)
+
+            should_stop, reason = agent.should_stop(step + 1, info)
+            status = (
+                "resolved"
+                if info.resolved
+                else ("unresolved" if should_stop else "running")
+            )
+
+            highscore = max(highscore, info.score)
+            msg = f"[{env.task_name[:10]:<10}] Step {step} | Score: {info.score}/{info.max_score or '-'} [Best: {highscore}]"
+            if should_stop:
+                msg += f" | Stopping Reason: {reason}"
+            agent.logger.info(msg)
+            step += 1
+
+            # keep progress bar running until max_steps is reached
+            agent.logger.report_progress(
+                problem_id=env.task_name,
+                step=step,
+                total_steps=agent.args.max_steps,
+                score=info.score,
+                max_score=info.max_score,
+                status=status,
+            )
+        return agent.build_trajectory()
+    except Exception as e:
+        # report any error that happens during the run
+        agent.logger.report_progress(
+            problem_id=env.task_name,
+            step=step,
+            total_steps=step,
+            score=getattr(info, "score", 0),
+            max_score=getattr(info, "max_score", None),
+            status="error",
+        )
+        raise e