Enable configurable context condensation in all benchmarks#429
Enable configurable context condensation in all benchmarks#429juanmichelini wants to merge 4 commits intomainfrom
Conversation
This change enables context condensation in all benchmarks and makes it configurable via config.py files and command-line arguments. The default condenser from software-agent-sdk is now used by default with max_size=80 and keep_first=4. Changes: - Add condenser configuration fields to EvalMetadata - Add CONDENSER_DEFAULTS to config.py files in swebench, swtbench, and swebenchmultimodal - Add command-line arguments for controlling condenser (--enable-condenser, --disable-condenser, --condenser-max-size, --condenser-keep-first) - Update agent creation in all benchmarks to use LLMSummarizingCondenser when enabled - Add comprehensive tests for condenser configuration Fixes #407 Co-authored-by: openhands <openhands@all-hands.dev>
|
@OpenHands "condenser_max_size" should be 240 and "condenser_keep_first": should be 2. change that for all benchmarks |
|
I'm on it! juanmichelini can track my progress at all-hands.dev |
…and Commit0 support - Update condenser_max_size from 80 to 240 and condenser_keep_first from 4 to 2 across all benchmarks - Update EvalMetadata defaults in models.py to use new values (240/2) - Add condenser configuration to GAIA and Commit0 benchmarks: - Add CONDENSER_DEFAULTS to config.py files - Add LLMSummarizingCondenser import and agent creation logic to run_infer.py files - Update tests to reflect new condenser defaults and add coverage for GAIA and Commit0 - All tests pass and pre-commit checks pass Co-authored-by: openhands <openhands@all-hands.dev>
Summary of ChangesI've successfully updated the condenser configuration across all benchmarks and added condenser support to GAIA and Commit0 as requested. Here's what was completed: ✅ Checklist - All Requirements Met
Verification
The PR is now updated with all requested changes. All benchmarks (SWE-bench, SWT-bench, SWE-bench Multimodal, GAIA, and Commit0) now use |
|
@OpenHands address enyst concerns please |
|
I'm on it! juanmichelini can track my progress at all-hands.dev |
Address review feedback from enyst: The parameter name service_id is deprecated and has been replaced with usage_id in the current SDK version. Updated all 6 benchmark files to use the correct parameter name: - benchmarks/gaia/run_infer.py - benchmarks/swebenchmultimodal/run_infer.py - benchmarks/swebench/run_infer.py - benchmarks/swtbench/run_infer.py - benchmarks/commit0/run_infer.py - benchmarks/multiswebench/run_infer.py Co-authored-by: openhands <openhands@all-hands.dev>
Summary of WorkI successfully addressed enyst's review concerns on PR #429 "Enable configurable context condensation in all benchmarks". Issue AddressedEnyst identified that the parameter name Changes Made✅ Updated all 6 benchmark files to use the correct parameter name
Each file had one line changed: # Before:
llm=self.metadata.llm.model_copy(update={"service_id": "condenser"})
# After:
llm=self.metadata.llm.model_copy(update={"usage_id": "condenser"})Verification✅ All pre-commit checks passed (Ruff format, Ruff lint, pycodestyle, Pyright) The PR has been updated and enyst's concerns have been fully addressed. The changes are concise and only modify what was necessary to fix the deprecated parameter name. |
|
@enyst @csmith49 I'm running an integration test here https://github.com/OpenHands/software-agent-sdk/actions/runs/22509555202 if that finishes successfully we can check the logs for correct condensation. |
Summary
This PR enables context condensation in all benchmarks and makes it configurable via
config.pyfiles and command-line arguments. The default condenser from software-agent-sdk (LLMSummarizingCondenser) is now used by default withmax_size=80andkeep_first=4.Fixes #407
Changes
Configuration
EvalMetadata: Added three new fields to support condenser configuration:
enable_condenser(bool, default: True): Enable/disable the context condensercondenser_max_size(int, default: 80): Maximum number of events before condensingcondenser_keep_first(int, default: 4): Number of initial events to always keepBenchmark configs: Added
CONDENSER_DEFAULTSto:benchmarks/swebench/config.pybenchmarks/swtbench/config.pybenchmarks/swebenchmultimodal/config.pyCommand-Line Arguments
Added new CLI arguments to control condenser behavior:
--enable-condenser: Explicitly enable the condenser--disable-condenser: Disable the condenser (takes precedence over enable)--condenser-max-size N: Set the maximum number of events before condensing--condenser-keep-first N: Set the number of initial events to always keepAgent Creation
Updated agent creation in all benchmark evaluation classes to use
LLMSummarizingCondenserwhen enabled:benchmarks/swebench/run_infer.pybenchmarks/swtbench/run_infer.pybenchmarks/swebenchmultimodal/run_infer.pybenchmarks/multiswebench/run_infer.pyTesting
Added comprehensive test coverage in
tests/test_condenser_config.py:All tests pass and pre-commit checks (ruff, pycodestyle, pyright) pass.
Usage
Default behavior (condenser enabled)
Disable condenser
Custom condenser settings
Notes
--disable-condenserflag takes precedence over--enable-condenserto allow explicit disabling"condenser") to track token usage separately from the main agent@juanmichelini can click here to continue refining the PR