Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,7 @@
[submodule "prime-rl"]
path = prime-rl
url = https://github.com/PrimeIntellect-ai/prime-rl
[submodule "benchmarks"]
path = benchmarks
url = https://github.com/adityasoni9998/benchmarks.git
branch = agentic_code_search
1 change: 1 addition & 0 deletions benchmarks
Submodule benchmarks added at 160f52
6 changes: 6 additions & 0 deletions configs/eval_llm_config_example.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"model": "openai/Qwen/Qwen3-4B",
"api_key": "dummy",
"base_url": "http://localhost:8000/v1",
"temperature": 0.0
}
239 changes: 239 additions & 0 deletions docs/EVAL_INTEGRATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,239 @@
# Evaluation Integration Documentation

This document explains how to run evaluations for code localization agents using the integrated benchmarks system.

## Quick Start

### 1. Start a Local Model with vLLM

Start vLLM with tool calling enabled:

```bash
# For a small model (quick testing)
uv run vllm serve Qwen/Qwen3-4B \
--port 8000 \
--max-model-len 32768 \
--enable-auto-tool-choice \
--tool-call-parser hermes

### 2. Create LLM Config

```bash
mkdir -p configs
cat > configs/llm_config.json << 'EOF'
{
"model": "openai/Qwen/Qwen3-4B",
"api_key": "dummy",
"base_url": "http://localhost:8000/v1",
"temperature": 0.0
}
EOF
```

**Important:** The model name must be prefixed with `openai/` to tell litellm it's an OpenAI-compatible endpoint.

### 3. Run Evaluation

```bash
./scripts/run_eval.sh \
--dataset_file benchmarks/gt_location.jsonl \
--llm-config-path configs/llm_config.json \
--system_prompt_file benchmarks/benchmarks/agentic_code_search/prompts/system_prompt.j2 \
--user_prompt_file benchmarks/benchmarks/agentic_code_search/prompts/file_module_short.j2 \
--tools terminal \
--max-iterations 10 \
--num-workers 1 \
--output-dir ./agentic_code_search_outputs \
--n-limit 1 \
--workspace_base_dir /tmp/testbed/
```

**Key options:**
- `--n-limit 1` - Run on 1 instance (remove for full dataset)
- `--num-workers 1` - Parallel workers (increase for faster eval)
- `--max-iterations 10` - Max agent steps per instance

### 4. Check Results

```bash
# View full output
cat ./agentic_code_search_outputs/agentic_code_search_gt_location/openai/Qwen/Qwen3-4B_sdk_*/output.jsonl | jq .

# View just the reward scores
cat ./agentic_code_search_outputs/agentic_code_search_gt_location/openai/Qwen/Qwen3-4B_sdk_*/output.jsonl | jq '.test_result.reward'
```

### Example Output

```json
{
"file_reward": 0.5,
"module_reward": 0.5,
"entity_reward": 0.4,
"prediction": {
"files": ["sklearn/calibration.py", "sklearn/_config.py", "sklearn/isotonic.py"],
"modules": ["sklearn/calibration.py:_CalibratedClassifier", "sklearn/_config.py:set_config", "sklearn/isotonic.py:IsotonicRegression"],
"entities": ["sklearn/isotonic.py:IsotonicRegression.predict", "sklearn/_config.py:set_config", "sklearn/calibration.py:_CalibratedClassifier.predict_proba"]
},
"ground_truth": {
"files": ["sklearn/isotonic.py"],
"modules": ["sklearn/isotonic.py:IsotonicRegression"],
"entities": ["sklearn/isotonic.py:IsotonicRegression.predict", "sklearn/isotonic.py:IsotonicRegression.transform"]
}
}
```

**Metrics explained:**
- **file_reward** - F1 score for file-level localization
- **module_reward** - F1 score for class-level localization
- **entity_reward** - F1 score for function/method-level localization

---

## Implementation Details

### Goal

Integrate evaluation code from the [benchmarks repo](https://github.com/adityasoni9998/benchmarks/tree/agentic_code_search) into this repository to enable end-to-end training AND evaluation of code localization agents.

**Key requirements:**

- Run trained models on SWE-Bench Pro/Verified benchmarks
- Use the same `software-agent-sdk` for both training and evaluation
- No dependency conflicts with existing SkyRL training setup

### The Problem

The benchmarks repo is designed as a standalone project with its own workspace pointing to `vendor/software-agent-sdk/`. Directly integrating it as a workspace member caused:

1. **Nested workspace error** - uv doesn't support workspaces inside workspaces
2. **Dependency conflicts** - `commit0` requires `datasets==3.0.1`, we need `>=4.0.0`

### The Solution: Runtime sys.path Manipulation

Instead of making benchmarks a proper package in our workspace, we use Python's `sys.path` to import it at runtime:

```python
import sys
sys.path.insert(0, "/path/to/benchmarks")

# Now imports work - and they use OUR installed SDK
from benchmarks.agentic_code_search.run_infer import main
```

**Why this works:**

- When benchmarks code imports `openhands.sdk`, Python searches `sys.path`
- Our SDK packages are already installed via uv workspace
- Python finds our SDK first, not benchmarks' vendor/ (which doesn't exist anyway)

### Version Module Patching

The benchmarks code has a `version.py` that tries to get the SDK SHA from `vendor/software-agent-sdk` (which doesn't exist in our setup). The `eval_runner.py` script pre-creates this module with the SHA from our repo's SDK:

```python
# Pre-create the version module with our SDK SHA before benchmarks imports it
_sdk_sha = _get_sdk_sha_from_parent_repo()
_version_module = ModuleType("benchmarks.utils.version")
_version_module.SDK_SHA = _sdk_sha
_version_module.SDK_SHORT_SHA = _sdk_sha[:7]
sys.modules["benchmarks.utils.version"] = _version_module
```

### Files Added/Modified

| File | Description |
| ------------------------ | ----------------------------------------------------------------------- |
| `benchmarks/` | Git submodule pointing to adityasoni9998/benchmarks@agentic_code_search |
| `.gitmodules` | Submodule configuration |
| `pyproject.toml` | Added jinja2, pandas, tqdm, lmnr dependencies |
| `scripts/eval_runner.py` | Python wrapper that sets up sys.path and runs eval |
| `scripts/run_eval.sh` | Shell wrapper for `uv run` |

### Architecture

```
agentic-code-search-oss/
├── software-agent-sdk/ # Our SDK (used for training AND eval)
│ ├── openhands-sdk/
│ ├── openhands-tools/
│ └── ...
├── benchmarks/ # Submodule (NOT in workspace)
│ └── benchmarks/
│ └── agentic_code_search/
│ ├── run_infer.py # Main eval script
│ ├── eval_infer.py # Results aggregator
│ └── prompts/ # Jinja2 templates
├── scripts/
│ ├── eval_runner.py # sys.path wrapper
│ └── run_eval.sh # Shell wrapper
└── src/ # Training code (unchanged)
```

### How Evaluation Works

```
┌─────────────────┐
│ run_eval.sh │
└────────┬────────┘
│ uv run
┌─────────────────┐
│ eval_runner.py │
│ │
│ sys.path.insert │
│ (benchmarks/) │
└────────┬────────┘
│ import
┌─────────────────────────────────┐
│ benchmarks.agentic_code_search │
│ │
│ from openhands.sdk import ... │──► Uses OUR SDK
└─────────────────────────────────┘
```

### Learnings

1. **uv workspaces don't nest** - Can't add a package with its own workspace as a member
2. **sys.path manipulation is clean** - Keeps submodule pristine, easy to update
3. **Python import resolution** - First match in sys.path wins, so our installed SDK is used
4. **Dependency isolation** - We only add deps we actually need, avoiding conflicts
5. **Version module patching** - Pre-create the version module to use our repo's SDK SHA
6. **litellm provider prefix** - Local vLLM endpoints need `openai/` prefix in model name
7. **vLLM tool calling** - Requires `--enable-auto-tool-choice --tool-call-parser hermes` flags

---

## Troubleshooting

### "LLM Provider NOT provided"

Add `openai/` prefix to your model name in `llm_config.json`:
```json
{"model": "openai/Qwen/Qwen3-4B", ...}
```

### "auto tool choice requires --enable-auto-tool-choice"

Restart vLLM with tool calling flags:
```bash
uv run vllm serve Qwen/Qwen3-4B \
--port 8000 \
--enable-auto-tool-choice \
--tool-call-parser hermes
```

### "Processing 0 instances"

Previous failed runs left stale output. Delete the output directory:
```bash
rm -rf ./agentic_code_search_outputs/
```

### Import errors from benchmarks

Ensure the submodule is initialized:
```bash
git submodule update --init --recursive
```
4 changes: 4 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,10 @@ dependencies = [
"seaborn>=0.13.2",
"gcsfs>=2025.3.0",
"lmcache",
"jinja2",
"pandas",
"tqdm",
"lmnr>=0.7.24",
# "flashinfer-python",
# "flashinfer-jit-cache",
]
Expand Down
63 changes: 63 additions & 0 deletions scripts/eval_runner.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
#!/usr/bin/env python3
"""
Evaluation runner for agentic code search benchmark.

This script adds the benchmarks submodule to sys.path and runs the
agentic_code_search evaluation from the benchmarks package.

Usage:
python scripts/eval_runner.py --dataset_file <path> --llm-config-path <path> [options]

Example:
python scripts/eval_runner.py \
--dataset_file ./data/test.jsonl \
--llm-config-path ./configs/llm.json \
--output-dir ./outputs \
--max-iterations 25 \
--num-workers 4

For all available options, run:
python scripts/eval_runner.py --help
"""

import subprocess
import sys
from pathlib import Path
from types import ModuleType

# Add the benchmarks submodule to sys.path so we can import from it
_benchmarks_path = Path(__file__).parent.parent / "benchmarks"
_project_root = Path(__file__).parent.parent
sys.path.insert(0, str(_benchmarks_path.resolve()))


def _get_sdk_sha_from_parent_repo() -> str:
"""Get SDK SHA from the parent repo's software-agent-sdk submodule."""
sdk_path = _project_root / "software-agent-sdk"
try:
result = subprocess.run(
["git", "submodule", "status", str(sdk_path)],
capture_output=True,
text=True,
check=True,
cwd=str(_project_root),
)
sha = result.stdout.strip().split()[0].lstrip("+-")
return sha
except Exception:
# Fallback if git command fails
return "unknown"


# Pre-create the version module with our SDK SHA before benchmarks imports it
_sdk_sha = _get_sdk_sha_from_parent_repo()
_version_module = ModuleType("benchmarks.utils.version")
_version_module.SDK_SHA = _sdk_sha
_version_module.SDK_SHORT_SHA = _sdk_sha[:7] if _sdk_sha != "unknown" else "unknown"
_version_module.PROJECT_ROOT = _benchmarks_path
sys.modules["benchmarks.utils.version"] = _version_module

from benchmarks.agentic_code_search.run_infer import main

if __name__ == "__main__":
main()
21 changes: 21 additions & 0 deletions scripts/run_eval.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/bin/bash
#
# run_eval.sh - Wrapper script to run the evaluation runner with uv
#
# Usage:
# ./scripts/run_eval.sh [OPTIONS]
#
# Example usage:
# ./scripts/run_eval.sh \
# --dataset_file benchmarks/gt_location.jsonl \
# --llm-config-path configs/llm_config.json \
# --max-iterations 10 \
# --num-workers 1 \
# --tools terminal
#
# Options are passed through to scripts/eval_runner.py
# Run with --help to see all available options:
# ./scripts/run_eval.sh --help
#

uv run python scripts/eval_runner.py "$@"