microsoft
diff --git a/‎.github/workflows/tests.yml‎
Lines changed: 17 additions & 5 deletions b/‎.github/workflows/tests.yml‎
Lines changed: 17 additions & 5 deletions
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 16 additions & 11 deletions b/‎CHANGELOG.md‎
Lines changed: 16 additions & 11 deletions
diff --git a/‎MANIFEST.in‎
Lines changed: 2 additions & 1 deletion b/‎MANIFEST.in‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 70 additions & 6 deletions b/‎README.md‎
Lines changed: 70 additions & 6 deletions
@@ -12,11 +12,12 @@ on:
 jobs:
   tests:
     name: Test
-    runs-on: ubuntu-latest
+    runs-on: ${{ matrix.os }}
     strategy:
       fail-fast: false
       matrix:
         language: [ "python" ]
+        os: [ubuntu-latest]
 
     steps:
       - name: Checkout repository
@@ -30,11 +31,22 @@ jobs:
         run: |
           pip install --upgrade pip
           pip install -e '.[dev]'
-      - name: Test with pytest - PR
-        if: github.event_name == 'pull_request'
+      - name: Get changed files related to SWE-Bench or SWE-Smith
+        id: changed-files-specific
+        uses: tj-actions/changed-files@v46.0.5
+        with:
+          files: |
+            debug_gym/gym/envs/swe_*.py
+            tests/gym/envs/test_swe_*.py
+      - name: Test - PR - Fast
+        if: github.event_name == 'pull_request' && steps.changed-files-specific.outputs.any_changed != 'true'
+        run: |
+          DEBUG_GYM_DEBUG=1 pytest -vv -n 16 -k "not test_swe_bench and not test_swe_smith" --cov=debug_gym --cov-report=term-missing --cov-fail-under=80 --timeout=600
+      - name: Test - PR - Slow
+        if: github.event_name == 'pull_request' && steps.changed-files-specific.outputs.any_changed == 'true'
         run: |
-          DEBUG_GYM_DEBUG=1 pytest -vv -n 16 -k "not test_swe_bench and not test_swe_smith" --cov=debug_gym --cov-report=term-missing --cov-fail-under=80 --timeout=300
-      - name: Test with pytest
+          DEBUG_GYM_DEBUG=1 pytest -vv -n 16 --cov=debug_gym --cov-report=term-missing --cov-fail-under=85 --timeout=600
+      - name: Test - main
         if: github.event_name != 'pull_request'
         run: |
           DEBUG_GYM_DEBUG=1 pytest -vv -n 16 --cov=debug_gym --cov-report=term-missing --cov-fail-under=85 --timeout=600
@@ -184,3 +184,4 @@ vscode/out
 vscode/node_modules
 vscode/package-lock.json
 .vscode
+debug_gym/llms/copilot.py
@@ -1,25 +1,30 @@
 # Changelog
 
+### 2025-06-13
+
+Added scripts to filter trajectories generated by `debug-gym`.
+
+
+### 2025-06-11
+
+Added support to [SWE-smith](https://swesmith.com/). Users can use the tasks shipped with the official SWE-smith package, or customized tasks generated using SWE-smith.
+
 ### 2025-06-03
 * Refactored `debug_gym/agents/llm_api.py` into separate modules in `debug_gym/llms/` for OpenAI, AzureOpenAI, Anthropic APIs, and human mode, allowing for easier extension to other LLM providers in the future.
 * Improved the Human mode to support better prompt completion and error handling.
 
-### 2025-05-20
-
-Changed the tool-calling syntax to be compatible with the [OpenAI](https://platform.openai.com/docs/guides/function-calling) and [Anthropic](https://docs.anthropic.com/en/docs/agents-and-tools/tool-use) function-calling formats.
+### 2025-05-28
 
-* Switched tools (view, rewrite, pdb, listdir, eval) to a function-call API with explicit arguments and environment injection.
-* Overhauled LLM interfaces to define, parse, and format function calls, and updated agents to consume `ToolCall` objects.
-* Removed the old conversational-prompt flag from configs.
+Improved the View tool, added the `start` and `end` arguments so the agent can specify a particular chunk of code to view.
 
 ### 2025-05-22
 
 Added in the [analysis](https://github.com/microsoft/debug-gym/tree/main/analysis/json_log_viewer) folder a Flask app to view `.jsonl` log files in the browser.
 
-### 2025-05-28
-
-Improved the View tool, added the `start` and `end` arguments so the agent can specify a particular chunk of code to view.
+### 2025-05-20
 
-### 2025-06-11
+Changed the tool-calling syntax to be compatible with the [OpenAI](https://platform.openai.com/docs/guides/function-calling) and [Anthropic](https://docs.anthropic.com/en/docs/agents-and-tools/tool-use) function-calling formats.
 
-Added support to [SWE-smith](https://swesmith.com/). Users can use the tasks shipped with the official SWE-smith package, or customized tasks generated using SWE-smith.
+* Switched tools (view, rewrite, pdb, listdir, eval) to a function-call API with explicit arguments and environment injection.
+* Overhauled LLM interfaces to define, parse, and format function calls, and updated agents to consume `ToolCall` objects.
+* Removed the old conversational-prompt flag from configs.
@@ -1 +1,2 @@
-include debug_gym/envs/configs/*.yaml
+include debug_gym/envs/configs/*.yaml
+include scripts/templates/*.jinja
@@ -67,6 +67,9 @@ debug_gym
 
 `debug_gym.llms` are the different LLM backends that can be used to instantiate agents. Currently, we support OpenAI, Azure OpenAI, and Anthropic.
 
+> [!WARNING]
+> `debug-gym` has limited support on non-Linux platforms. Interactive terminal sessions using PTY (pseudo-terminal) in Docker are not fully supported on macOS or Windows. As a result, the `pdb` tool (see [2.1. Environment and Tools](#21-environment-and-tools)) only works on Linux.
+
 ---
 
 #### 2.1. Environment and Tools
@@ -137,28 +140,89 @@ We provide a human mode that enables developers to manually interact with `debug
 
 #### 3.3. Overriding Values in Config
 
-`-p` is a handy way to override values defined in config. For example, the below command will run rewrite_agent agent on Aider with human mode (while in config file it specifies gpt-4o).
+The `-p` flag is a handy way to override values defined in the config file. For example, the command below will run the rewrite_agent agent on Aider with human mode (even if the config file specifies gpt-4o). The command also overrides the default system prompt (see below for more information).
+
+    python scripts/run.py scripts/config_aider.yaml \
+        --agent debug_agent \
+        -v \
+        -p debug_agent.llm_name="human" \
+        -p debug_agent.system_prompt_template_file="scripts/templates/human_friendly_system_prompt.jinja"
+
+
+#### 3.4. Customizing the System Prompt with Jinja Templates
+
+`debug-gym` allows you to fully customize the system prompt by providing a [Jinja](https://jinja.palletsprojects.com/) template file. This enables you to control the format and content of the prompt sent to the LLM, making it easier to adapt the environment to your specific needs or research experiments.
+
+To use a custom system prompt template, specify the path to your Jinja template file in your agent's configuration under `system_prompt_template_file`. For example:
+
+```yaml
+debug_agent:
+  system_prompt_template_file: scripts/templates/custom_system_prompt.jinja
+```
+
+Alternatively, you can provide a custom template from the command line with `-p <agent>.system_prompt_template_file="<path/to/template.jinja>"` (see above).
+
+Within your Jinja template, you have access to the `agent` and `info` objects, which provide all relevant context about the current environment and agent state.
+
+#### Custom Jinja Filters
+
+In addition to all [built-in Jinja filters](https://jinja.palletsprojects.com/en/stable/templates/#list-of-builtin-filters), two custom filters are available for use in your template:
+
+- **`to_pretty_json`**: Converts a Python object to a pretty-printed JSON string. Useful for displaying structured data in a readable format.
+    ```jinja
+    {{ info.tools | to_pretty_json }}
+    ```
+
+- **`trim_message`**: Trims a string to fit within a token or character limit, also filtering out non-UTF8 characters. This is helpful for ensuring that large outputs (such as directory trees or evaluation results) do not exceed the LLM's context window. The `trim_message` filter accepts the following arguments to control how messages are trimmed:
+    - **`max_length`**: The maximum number of tokens to keep in the message. If the message exceeds this length, it will be trimmed.
+    - **`max_length_percentage`**: Instead of specifying an absolute number, you can provide a percentage (e.g., `0.1` for 10%) of the LLM's context window. The message will be trimmed to fit within this percentage of the model's maximum context length.
+    - **`where`**: Specifies where to trim the message if it exceeds the limit. The default is `"middle"`, which trims from the middle of the message. Other options are `start` or `end`.
+
+    ```jinja
+    {{ info.dir_tree | trim_message(max_length_percentage=0.1, where="end") }}
+    ```
+
+#### Example Template
+
+```jinja
+System Prompt for Debug-Gym
+
+Task: {{ agent.system_prompt }}
+
+Instructions:
+{{ info.instructions }}
+
+Directory Tree:
+{{ info.dir_tree | trim_message(max_length=1000) }}
+
+Current Breakpoints:
+{{ info.current_breakpoints | to_pretty_json }}
+
+{% if agent.shortcut_features() %}
+Shortcut Features:
+{{ agent.shortcut_features() | to_pretty_json }}
+{% endif %}
+```
 
-    python scripts/run.py scripts/config_aider.yaml --agent rewrite_agent -v -p rewrite_agent.llm_name="human"
 
-#### 3.4. Debugging a Custom Repository
+#### 3.5. Debugging a Custom Repository
 
 Modify `scripts/config.yaml`, especially the `env_kwargs` to set the path and entrypoint of the custom repository. We assume there is a `.debugignore` file and a `.debugreadonly` within the repository that labels files/folders that are not seen or not editable, respectively.
 
 As an example, we provide a buggy pytorch code repository in `data/pytorch`.
 
     python scripts/run.py scripts/config.yaml --agent <agent name>
 
-#### 3.5. Debugging a Custom SWE-Smith Instance
+#### 3.6. Debugging a Custom SWE-Smith Instance
 
 [SWE-Smith](https://github.com/SWE-bench/SWE-smith) allows to generate new buggy code instances. Give a custom HuggingFace dataset (either local or remote) that has a similar structure as [SWE-bench/SWE-smith](https://huggingface.co/datasets/SWE-bench/SWE-smith), one can override the `-p base.env_kwargs.dataset_id=<dataset_id>` in the command line to run the agent on that dataset. For example, to run on a local dataset:
 
     python scripts/run.py scripts/config_swesmith.yaml --agent <agent name> -p base.env_kwargs.dataset_id="path/to/local/dataset"
 
-#### 3.6. Design Your Own Tool
+#### 3.7. Design Your Own Tool
 `debug-gym`'s modular design makes it extensible. Users are encouraged to extend `debug-gym` to their specific usecases, for example by creating new tools that diversify an agent's action and observation spaces. For detailed instruction on designing new tools that are `debug-gym`-compatible, please refer to the [Technical Report](https://arxiv.org/abs/2503.21557).
 
-#### 3.7. Analysis and Visualization
+#### 3.8. Analysis and Visualization
 
 We provide a set of scripts to help analyze the log files (e.g., the `.jsonl` files) generated by the agent.
 - In the `analysis` folder, we provide scripts that used to generate the corresponding figures in our technical report.
Original file line number	Diff line number	Diff line change
`@@ -1 +1,2 @@`
`1`		`-include debug_gym/envs/configs/*.yaml`
	`1`	`+include debug_gym/envs/configs/*.yaml`
	`2`	`+include scripts/templates/*.jinja`