DRAFT: Upgrade to Python 3.13 with libtmux race condition fix by neubig · Pull Request #1978 · OpenHands/software-agent-sdk

neubig · 2026-02-09T21:04:46Z

Summary

This PR upgrades the project to Python 3.13 and pins libtmux to neubig/libtmux@fix/new-session-race-condition which contains the fix for the race condition reported in libtmux#624.

Context

See upstream PR: tmux-python/libtmux#625

The issue was that new_session() in libtmux would:

Run tmux new-session -P -F#{session_id} to create session
Immediately run tmux list-sessions to fetch full session data

This created a race condition in Python 3.13 environments (especially with PyInstaller + Docker) where list-sessions might not see the newly created session yet, causing TmuxObjectDoesNotExist errors.

The fix expands the -F format string to include all Session fields and parses the output directly, eliminating the separate list-sessions query entirely.

Changes

Update target-version from py312 to py313 in root pyproject.toml (ruff)
Update pythonVersion from 3.12 to 3.13 in root pyproject.toml (pyright)
Update Python version in server.yml build matrix from 3.12 to 3.13
Update Python version in pypi-release.yml from 3.12 to 3.13
Update Python version in pr-review action from 3.12 to 3.13
Pin libtmux to neubig's branch: libtmux @ git+https://github.com/neubig/libtmux.git@fix/new-session-race-condition

Testing

This PR needs integration tests to verify the libtmux fix works correctly in our CI environment. The integration-test label should trigger those tests.

Note

This is a draft PR to test the libtmux fix. Once the upstream PR is merged and released to PyPI, we should update the dependency to the released version.

Related issues:

@neubig can click here to continue refining the PR

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:40e712f-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-40e712f-python \
  ghcr.io/openhands/agent-server:40e712f-python

All tags pushed for this build

ghcr.io/openhands/agent-server:40e712f-golang-amd64
ghcr.io/openhands/agent-server:40e712f-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:40e712f-golang-arm64
ghcr.io/openhands/agent-server:40e712f-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:40e712f-java-amd64
ghcr.io/openhands/agent-server:40e712f-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:40e712f-java-arm64
ghcr.io/openhands/agent-server:40e712f-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:40e712f-python-amd64
ghcr.io/openhands/agent-server:40e712f-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-amd64
ghcr.io/openhands/agent-server:40e712f-python-arm64
ghcr.io/openhands/agent-server:40e712f-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-arm64
ghcr.io/openhands/agent-server:40e712f-golang
ghcr.io/openhands/agent-server:40e712f-java
ghcr.io/openhands/agent-server:40e712f-python

About Multi-Architecture Support

Each variant tag (e.g., 40e712f-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 40e712f-python-amd64) are also available if needed

- Update target-version and pythonVersion to 3.13 in root pyproject.toml - Update Python version in server.yml build matrix to 3.13 - Update Python version in pypi-release.yml to 3.13 - Update Python version in pr-review action to 3.13 - Pin libtmux to neubig/libtmux#fix/new-session-race-condition branch which fixes the race condition in new_session() that causes TmuxObjectDoesNotExist errors in Python 3.13 environments The libtmux fix avoids the race condition by eliminating the separate list-sessions query after session creation, instead parsing the session data directly from the -P output of new-session. Fixes the Python 3.13 + PyInstaller + Docker compatibility issue reported in libtmux#624. Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-02-09T21:04:55Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-02-09T21:09:08Z

🧪 Integration Tests Results

Overall Success Rate: 100.0%
Total Cost: $0.90
Models Tested: 4
Timestamp: 2026-02-09 21:08:59 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_deepseek_deepseek_reasoner: 📥 View & Download Logs
litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs
litellm_proxy_gemini_3_pro_preview: 📥 View & Download Logs
litellm_proxy_claude_sonnet_4_5_20250929: 📥 View & Download Logs

📊 Summary

Model	Overall	Tests Passed	Skipped	Total	Cost	Tokens
litellm_proxy_deepseek_deepseek_reasoner	100.0%	7/7	1	8	$0.03	586,389
litellm_proxy_moonshot_kimi_k2_thinking	100.0%	7/7	1	8	$0.15	225,951
litellm_proxy_gemini_3_pro_preview	100.0%	8/8	0	8	$0.31	230,266
litellm_proxy_claude_sonnet_4_5_20250929	100.0%	8/8	0	8	$0.41	236,012

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

Success Rate: 100.0% (7/7)
Total Cost: $0.03
Token Usage: prompt: 571,330, completion: 15,059, cache_read: 529,280, reasoning: 6,086
Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_122e797_deepseek_v3_2_reasoner_run_N8_20260209_210520
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_moonshot_kimi_k2_thinking

Success Rate: 100.0% (7/7)
Total Cost: $0.15
Token Usage: prompt: 220,509, completion: 5,442, cache_read: 172,544
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_122e797_kimi_k2_thinking_run_N8_20260209_210544
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3_pro_preview

Success Rate: 100.0% (8/8)
Total Cost: $0.31
Token Usage: prompt: 223,711, completion: 6,555, cache_read: 120,260, reasoning: 4,076
Run Suffix: litellm_proxy_gemini_3_pro_preview_122e797_gemini_3_pro_run_N8_20260209_210520

litellm_proxy_claude_sonnet_4_5_20250929

Success Rate: 100.0% (8/8)
Total Cost: $0.41
Token Usage: prompt: 229,436, completion: 6,576, cache_read: 158,218, cache_write: 70,849, reasoning: 1,996
Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_122e797_claude_sonnet_4_5_20250929_run_N8_20260209_210519

github-actions · 2026-02-09T21:11:55Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk/agent
base.py	189	19	89%	200, 289, 293–297, 345–347, 357, 367, 375–376, 486, 523–524, 534–535
openhands-tools/openhands/tools/terminal/terminal
tmux_terminal.py	139	66	52%	29–30, 41–43, 45–56, 83, 90, 119, 123–126, 130, 133–136, 140, 142–143, 146–152, 155, 159–162, 199, 203, 207, 218, 229, 243, 255–262, 270–271, 273–274, 276, 278–280
TOTAL	18782	5244	72%

github-actions · 2026-02-09T21:37:21Z

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Generated: 2026-02-09 21:45:19 UTC

Example	Status	Duration	Cost
01_standalone_sdk/02_custom_tools.py	✅ PASS	28.7s	$0.03
01_standalone_sdk/03_activate_skill.py	✅ PASS	21.6s	$0.03
01_standalone_sdk/05_use_llm_registry.py	✅ PASS	11.9s	$0.01
01_standalone_sdk/07_mcp_integration.py	✅ PASS	48.3s	$0.04
01_standalone_sdk/09_pause_example.py	✅ PASS	19.9s	$0.02
01_standalone_sdk/10_persistence.py	✅ PASS	27.5s	$0.02
01_standalone_sdk/11_async.py	✅ PASS	34.3s	$0.03
01_standalone_sdk/12_custom_secrets.py	✅ PASS	16.5s	$0.02
01_standalone_sdk/13_get_llm_metrics.py	✅ PASS	21.5s	$0.01
01_standalone_sdk/14_context_condenser.py	✅ PASS	4m 25s	$0.53
01_standalone_sdk/17_image_input.py	✅ PASS	18.7s	$0.02
01_standalone_sdk/18_send_message_while_processing.py	✅ PASS	27.9s	$0.02
01_standalone_sdk/19_llm_routing.py	✅ PASS	15.8s	$0.02
01_standalone_sdk/20_stuck_detector.py	✅ PASS	22.3s	$0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py	✅ PASS	9.7s	$0.00
01_standalone_sdk/22_anthropic_thinking.py	✅ PASS	22.2s	$0.02
01_standalone_sdk/23_responses_reasoning.py	✅ PASS	1m 27s	$0.01
01_standalone_sdk/24_planning_agent_workflow.py	✅ PASS	5m 5s	$0.41
01_standalone_sdk/25_agent_delegation.py	✅ PASS	2m 16s	$0.18
01_standalone_sdk/26_custom_visualizer.py	✅ PASS	20.6s	$0.02
01_standalone_sdk/28_ask_agent_example.py	✅ PASS	38.4s	$0.03
01_standalone_sdk/29_llm_streaming.py	✅ PASS	36.6s	$0.03
01_standalone_sdk/30_tom_agent.py	✅ PASS	11.2s	$0.01
01_standalone_sdk/31_iterative_refinement.py	✅ PASS	3m 7s	$0.21
01_standalone_sdk/32_configurable_security_policy.py	✅ PASS	19.7s	$0.02
01_standalone_sdk/34_critic_example.py	❌ FAIL Exit code 1	3.8s	--
01_standalone_sdk/36_event_json_to_openai_messages.py	✅ PASS	9.8s	$0.00
01_standalone_sdk/37_llm_profile_store.py	✅ PASS	4.0s	$0.00
02_remote_agent_server/01_convo_with_local_agent_server.py	✅ PASS	58.3s	$0.04
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py	❌ FAIL Exit code 1	1m 4s	--
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py	❌ FAIL Exit code 1	16.8s	--
02_remote_agent_server/04_convo_with_api_sandboxed_server.py	❌ FAIL Exit code 1	1m 1s	--
02_remote_agent_server/07_convo_with_cloud_workspace.py	✅ PASS	31.8s	$0.02
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py	❌ FAIL Exit code 1	4.1s	--
04_llm_specific_tools/01_gpt5_apply_patch_preset.py	✅ PASS	18.7s	$0.02
04_llm_specific_tools/02_gemini_file_tools.py	✅ PASS	44.4s	$0.08
05_skills_and_plugins/01_loading_agentskills/main.py	✅ PASS	12.5s	$0.01
05_skills_and_plugins/02_loading_plugins/main.py	✅ PASS	7.8s	$0.01

❌ Some tests failed

Total: 38 | Passed: 33 | Failed: 5 | Total Cost: $1.94

Failed examples:

examples/01_standalone_sdk/34_critic_example.py: Exit code 1
examples/02_remote_agent_server/02_convo_with_docker_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/04_convo_with_api_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1

View full workflow run

…HUB_SHA GitHub Actions sets GITHUB_SHA to the merge commit by default, which differs from the PR head commit. Use a custom variable AGENT_SERVER_SHA to explicitly pass the PR head SHA to example scripts for Docker image selection. Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-02-10T09:23:56Z

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Generated: 2026-02-10 09:37:15 UTC

Example	Status	Duration	Cost
01_standalone_sdk/02_custom_tools.py	✅ PASS	25.3s	$0.03
01_standalone_sdk/03_activate_skill.py	✅ PASS	17.9s	$0.03
01_standalone_sdk/05_use_llm_registry.py	✅ PASS	11.1s	$0.01
01_standalone_sdk/07_mcp_integration.py	✅ PASS	41.2s	$0.04
01_standalone_sdk/09_pause_example.py	✅ PASS	13.5s	$0.01
01_standalone_sdk/10_persistence.py	✅ PASS	27.9s	$0.02
01_standalone_sdk/11_async.py	✅ PASS	28.7s	$0.03
01_standalone_sdk/12_custom_secrets.py	✅ PASS	11.2s	$0.01
01_standalone_sdk/13_get_llm_metrics.py	✅ PASS	20.8s	$0.01
01_standalone_sdk/14_context_condenser.py	✅ PASS	4m 22s	$0.54
01_standalone_sdk/17_image_input.py	✅ PASS	17.2s	$0.02
01_standalone_sdk/18_send_message_while_processing.py	✅ PASS	25.1s	$0.01
01_standalone_sdk/19_llm_routing.py	✅ PASS	13.2s	$0.02
01_standalone_sdk/20_stuck_detector.py	✅ PASS	19.5s	$0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py	✅ PASS	14.8s	$0.00
01_standalone_sdk/22_anthropic_thinking.py	✅ PASS	18.1s	$0.01
01_standalone_sdk/23_responses_reasoning.py	✅ PASS	1m 10s	$0.01
01_standalone_sdk/24_planning_agent_workflow.py	✅ PASS	2m 54s	$0.22
01_standalone_sdk/25_agent_delegation.py	✅ PASS	2m 4s	$0.17
01_standalone_sdk/26_custom_visualizer.py	✅ PASS	18.6s	$0.02
01_standalone_sdk/28_ask_agent_example.py	✅ PASS	28.3s	$0.02
01_standalone_sdk/29_llm_streaming.py	✅ PASS	38.6s	$0.03
01_standalone_sdk/30_tom_agent.py	✅ PASS	12.2s	$0.01
01_standalone_sdk/31_iterative_refinement.py	❌ FAIL Timed out after 600 seconds	10m 0s	--
01_standalone_sdk/32_configurable_security_policy.py	✅ PASS	23.7s	$0.02
01_standalone_sdk/34_critic_example.py	❌ FAIL Exit code 1	3.9s	--
01_standalone_sdk/36_event_json_to_openai_messages.py	✅ PASS	21.3s	$0.01
01_standalone_sdk/37_llm_profile_store.py	✅ PASS	4.1s	$0.00
02_remote_agent_server/01_convo_with_local_agent_server.py	✅ PASS	1m 5s	$0.04
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py	❌ FAIL Exit code 1	4.7s	--
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py	❌ FAIL Exit code 1	5.7s	--
02_remote_agent_server/04_convo_with_api_sandboxed_server.py	❌ FAIL Exit code 1	5m 11s	--
02_remote_agent_server/07_convo_with_cloud_workspace.py	✅ PASS	28.3s	$0.03
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py	❌ FAIL Exit code 1	4.7s	--
04_llm_specific_tools/01_gpt5_apply_patch_preset.py	✅ PASS	31.8s	$0.04
04_llm_specific_tools/02_gemini_file_tools.py	✅ PASS	1m 10s	$0.05
05_skills_and_plugins/01_loading_agentskills/main.py	✅ PASS	10.5s	$0.01
05_skills_and_plugins/02_loading_plugins/main.py	✅ PASS	7.5s	$0.01

❌ Some tests failed

Total: 38 | Passed: 32 | Failed: 6 | Total Cost: $1.52

Failed examples:

examples/01_standalone_sdk/31_iterative_refinement.py: Timed out after 600 seconds
examples/01_standalone_sdk/34_critic_example.py: Exit code 1
examples/02_remote_agent_server/02_convo_with_docker_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/04_convo_with_api_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1

View full workflow run

- Regenerate uv.lock with pinned libtmux git dependency - Simplify Generator[T, None, None] to Generator[T] in test files Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-02-10T09:44:52Z

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Generated: 2026-02-10 09:54:23 UTC

Example	Status	Duration	Cost
01_standalone_sdk/02_custom_tools.py	✅ PASS	25.7s	$0.03
01_standalone_sdk/03_activate_skill.py	✅ PASS	20.0s	$0.03
01_standalone_sdk/05_use_llm_registry.py	✅ PASS	14.0s	$0.01
01_standalone_sdk/07_mcp_integration.py	✅ PASS	30.2s	$0.02
01_standalone_sdk/09_pause_example.py	✅ PASS	18.5s	$0.01
01_standalone_sdk/10_persistence.py	✅ PASS	35.4s	$0.03
01_standalone_sdk/11_async.py	✅ PASS	31.4s	$0.03
01_standalone_sdk/12_custom_secrets.py	✅ PASS	20.0s	$0.02
01_standalone_sdk/13_get_llm_metrics.py	✅ PASS	21.1s	$0.02
01_standalone_sdk/14_context_condenser.py	✅ PASS	6m 29s	$0.84
01_standalone_sdk/17_image_input.py	✅ PASS	16.4s	$0.02
01_standalone_sdk/18_send_message_while_processing.py	✅ PASS	23.3s	$0.01
01_standalone_sdk/19_llm_routing.py	✅ PASS	15.6s	$0.02
01_standalone_sdk/20_stuck_detector.py	✅ PASS	15.8s	$0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py	✅ PASS	11.4s	$0.00
01_standalone_sdk/22_anthropic_thinking.py	✅ PASS	16.7s	$0.01
01_standalone_sdk/23_responses_reasoning.py	✅ PASS	57.8s	$0.01
01_standalone_sdk/24_planning_agent_workflow.py	✅ PASS	51.3s	$0.05
01_standalone_sdk/25_agent_delegation.py	✅ PASS	1m 44s	$0.19
01_standalone_sdk/26_custom_visualizer.py	✅ PASS	22.6s	$0.03
01_standalone_sdk/28_ask_agent_example.py	✅ PASS	35.2s	$0.03
01_standalone_sdk/29_llm_streaming.py	✅ PASS	44.6s	$0.04
01_standalone_sdk/30_tom_agent.py	✅ PASS	10.6s	$0.01
01_standalone_sdk/31_iterative_refinement.py	✅ PASS	3m 7s	$0.22
01_standalone_sdk/32_configurable_security_policy.py	✅ PASS	16.6s	$0.02
01_standalone_sdk/34_critic_example.py	❌ FAIL Exit code 1	3.8s	--
01_standalone_sdk/36_event_json_to_openai_messages.py	✅ PASS	13.1s	$0.00
01_standalone_sdk/37_llm_profile_store.py	✅ PASS	4.1s	$0.00
02_remote_agent_server/01_convo_with_local_agent_server.py	✅ PASS	59.4s	$0.04
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py	❌ FAIL Exit code 1	4.8s	--
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py	❌ FAIL Exit code 1	4.9s	--
02_remote_agent_server/04_convo_with_api_sandboxed_server.py	❌ FAIL Exit code 1	5m 11s	--
02_remote_agent_server/07_convo_with_cloud_workspace.py	✅ PASS	28.6s	$0.02
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py	❌ FAIL Exit code 1	5.6s	--
04_llm_specific_tools/01_gpt5_apply_patch_preset.py	✅ PASS	20.8s	$0.03
04_llm_specific_tools/02_gemini_file_tools.py	✅ PASS	57.7s	$0.07
05_skills_and_plugins/01_loading_agentskills/main.py	✅ PASS	14.0s	$0.02
05_skills_and_plugins/02_loading_plugins/main.py	✅ PASS	7.6s	$0.01

❌ Some tests failed

Total: 38 | Passed: 33 | Failed: 5 | Total Cost: $1.90

Failed examples:

examples/01_standalone_sdk/34_critic_example.py: Exit code 1
examples/02_remote_agent_server/02_convo_with_docker_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/04_convo_with_api_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1

View full workflow run

The SHORT_SHA extraction was using github.sha (merge commit) instead of the GITHUB_SHA env var (which is set to PR head SHA for pull_request events). This caused Docker images to be tagged with the wrong SHA, making them unfindable by run-examples.yml which uses the PR head SHA. Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-02-11T00:44:36Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-02-11T00:48:29Z

🧪 Integration Tests Results

Overall Success Rate: 100.0%
Total Cost: $0.91
Models Tested: 4
Timestamp: 2026-02-11 00:48:22 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_deepseek_deepseek_reasoner: 📥 View & Download Logs
litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs
litellm_proxy_gemini_3_pro_preview: 📥 View & Download Logs
litellm_proxy_claude_sonnet_4_5_20250929: 📥 View & Download Logs

📊 Summary

Model	Overall	Tests Passed	Skipped	Total	Cost	Tokens
litellm_proxy_deepseek_deepseek_reasoner	100.0%	7/7	1	8	$0.03	590,886
litellm_proxy_moonshot_kimi_k2_thinking	100.0%	7/7	1	8	$0.18	276,821
litellm_proxy_gemini_3_pro_preview	100.0%	8/8	0	8	$0.29	210,710
litellm_proxy_claude_sonnet_4_5_20250929	100.0%	8/8	0	8	$0.41	233,488

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

Success Rate: 100.0% (7/7)
Total Cost: $0.03
Token Usage: prompt: 577,382, completion: 13,504, cache_read: 535,168, reasoning: 5,367
Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_252274c_deepseek_v3_2_reasoner_run_N8_20260211_004507
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_moonshot_kimi_k2_thinking

Success Rate: 100.0% (7/7)
Total Cost: $0.18
Token Usage: prompt: 271,307, completion: 5,514, cache_read: 214,528
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_252274c_kimi_k2_thinking_run_N8_20260211_004500
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3_pro_preview

Success Rate: 100.0% (8/8)
Total Cost: $0.29
Token Usage: prompt: 204,992, completion: 5,718, cache_read: 104,221, reasoning: 3,287
Run Suffix: litellm_proxy_gemini_3_pro_preview_252274c_gemini_3_pro_run_N8_20260211_004458

litellm_proxy_claude_sonnet_4_5_20250929

Success Rate: 100.0% (8/8)
Total Cost: $0.41
Token Usage: prompt: 226,859, completion: 6,629, cache_read: 155,499, cache_write: 70,991, reasoning: 1,808
Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_252274c_claude_sonnet_4_5_20250929_run_N8_20260211_004500

Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-02-11T00:50:52Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

GITHUB_SHA is a reserved GitHub Actions environment variable that cannot be overridden at job level. Use SDK_SHA which is already supported by build.py and takes precedence over GITHUB_SHA. This ensures Docker images are tagged with the PR head SHA (not the merge commit SHA) so run-examples.yml can find the correct images. Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-02-11T01:04:55Z

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Generated: 2026-02-11 01:17:01 UTC

Example	Status	Duration	Cost
01_standalone_sdk/02_custom_tools.py	✅ PASS	25.4s	$0.02
01_standalone_sdk/03_activate_skill.py	✅ PASS	17.3s	$0.01
01_standalone_sdk/05_use_llm_registry.py	✅ PASS	12.3s	$0.00
01_standalone_sdk/07_mcp_integration.py	✅ PASS	35.1s	$0.01
01_standalone_sdk/09_pause_example.py	✅ PASS	14.9s	$0.01
01_standalone_sdk/10_persistence.py	✅ PASS	26.0s	$0.01
01_standalone_sdk/11_async.py	✅ PASS	30.0s	$0.03
01_standalone_sdk/12_custom_secrets.py	✅ PASS	19.5s	$0.01
01_standalone_sdk/13_get_llm_metrics.py	✅ PASS	18.0s	$0.01
01_standalone_sdk/14_context_condenser.py	✅ PASS	4m 34s	$0.57
01_standalone_sdk/17_image_input.py	✅ PASS	16.9s	$0.01
01_standalone_sdk/18_send_message_while_processing.py	✅ PASS	25.5s	$0.01
01_standalone_sdk/19_llm_routing.py	✅ PASS	12.0s	$0.02
01_standalone_sdk/20_stuck_detector.py	✅ PASS	15.6s	$0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py	✅ PASS	10.1s	$0.00
01_standalone_sdk/22_anthropic_thinking.py	✅ PASS	15.3s	$0.01
01_standalone_sdk/23_responses_reasoning.py	✅ PASS	1m 22s	$0.02
01_standalone_sdk/24_planning_agent_workflow.py	✅ PASS	3m 46s	$0.30
01_standalone_sdk/25_agent_delegation.py	❌ FAIL Timed out after 600 seconds	10m 0s	$0.21
01_standalone_sdk/26_custom_visualizer.py	✅ PASS	19.4s	$0.02
01_standalone_sdk/28_ask_agent_example.py	✅ PASS	31.7s	$0.03
01_standalone_sdk/29_llm_streaming.py	✅ PASS	37.9s	$0.03
01_standalone_sdk/30_tom_agent.py	✅ PASS	17.4s	$0.01
01_standalone_sdk/31_iterative_refinement.py	✅ PASS	3m 49s	$0.26
01_standalone_sdk/32_configurable_security_policy.py	✅ PASS	16.6s	$0.02
01_standalone_sdk/34_critic_example.py	❌ FAIL Exit code 1	3.3s	--
01_standalone_sdk/36_event_json_to_openai_messages.py	✅ PASS	8.8s	$0.00
01_standalone_sdk/37_llm_profile_store.py	✅ PASS	3.4s	$0.00
02_remote_agent_server/01_convo_with_local_agent_server.py	✅ PASS	57.7s	$0.03
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py	❌ FAIL Exit code 1	55.3s	--
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py	❌ FAIL Exit code 1	16.7s	--
02_remote_agent_server/04_convo_with_api_sandboxed_server.py	❌ FAIL Exit code 1	1m 9s	--
02_remote_agent_server/07_convo_with_cloud_workspace.py	✅ PASS	26.6s	$0.02
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py	❌ FAIL Exit code 1	3.5s	--
04_llm_specific_tools/01_gpt5_apply_patch_preset.py	✅ PASS	20.5s	$0.02
04_llm_specific_tools/02_gemini_file_tools.py	✅ PASS	50.1s	$0.06
05_skills_and_plugins/01_loading_agentskills/main.py	✅ PASS	11.5s	$0.01
05_skills_and_plugins/02_loading_plugins/main.py	✅ PASS	6.8s	$0.01

❌ Some tests failed

Total: 38 | Passed: 32 | Failed: 6 | Total Cost: $1.83

Failed examples:

examples/01_standalone_sdk/25_agent_delegation.py: Timed out after 600 seconds
examples/01_standalone_sdk/34_critic_example.py: Exit code 1
examples/02_remote_agent_server/02_convo_with_docker_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/04_convo_with_api_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1

View full workflow run

github-actions · 2026-02-11T01:04:56Z

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Generated: 2026-02-11 01:17:41 UTC

Example	Status	Duration	Cost
01_standalone_sdk/02_custom_tools.py	✅ PASS	26.4s	$0.02
01_standalone_sdk/03_activate_skill.py	✅ PASS	17.6s	$0.01
01_standalone_sdk/05_use_llm_registry.py	✅ PASS	10.9s	$0.00
01_standalone_sdk/07_mcp_integration.py	✅ PASS	48.7s	$0.04
01_standalone_sdk/09_pause_example.py	✅ PASS	14.5s	$0.01
01_standalone_sdk/10_persistence.py	✅ PASS	27.5s	$0.01
01_standalone_sdk/11_async.py	✅ PASS	32.9s	$0.03
01_standalone_sdk/12_custom_secrets.py	✅ PASS	18.2s	$0.01
01_standalone_sdk/13_get_llm_metrics.py	✅ PASS	24.0s	$0.02
01_standalone_sdk/14_context_condenser.py	✅ PASS	7m 0s	$0.86
01_standalone_sdk/17_image_input.py	✅ PASS	17.6s	$0.01
01_standalone_sdk/18_send_message_while_processing.py	✅ PASS	26.7s	$0.01
01_standalone_sdk/19_llm_routing.py	✅ PASS	15.6s	$0.01
01_standalone_sdk/20_stuck_detector.py	✅ PASS	19.7s	$0.01
01_standalone_sdk/21_generate_extraneous_conversation_costs.py	✅ PASS	10.3s	$0.00
01_standalone_sdk/22_anthropic_thinking.py	✅ PASS	14.6s	$0.01
01_standalone_sdk/23_responses_reasoning.py	✅ PASS	1m 30s	$0.01
01_standalone_sdk/24_planning_agent_workflow.py	✅ PASS	3m 26s	$0.27
01_standalone_sdk/25_agent_delegation.py	❌ FAIL Timed out after 600 seconds	10m 0s	$0.14
01_standalone_sdk/26_custom_visualizer.py	✅ PASS	19.3s	$0.02
01_standalone_sdk/28_ask_agent_example.py	✅ PASS	38.9s	$0.04
01_standalone_sdk/29_llm_streaming.py	✅ PASS	41.4s	$0.03
01_standalone_sdk/30_tom_agent.py	✅ PASS	16.3s	$0.01
01_standalone_sdk/31_iterative_refinement.py	✅ PASS	4m 47s	$0.32
01_standalone_sdk/32_configurable_security_policy.py	✅ PASS	18.3s	$0.01
01_standalone_sdk/34_critic_example.py	❌ FAIL Exit code 1	3.8s	--
01_standalone_sdk/36_event_json_to_openai_messages.py	✅ PASS	9.3s	$0.00
01_standalone_sdk/37_llm_profile_store.py	✅ PASS	3.9s	$0.00
02_remote_agent_server/01_convo_with_local_agent_server.py	✅ PASS	59.1s	$0.04
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py	❌ FAIL Exit code 1	4.7s	--
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py	❌ FAIL Exit code 1	4.8s	--
02_remote_agent_server/04_convo_with_api_sandboxed_server.py	❌ FAIL Exit code 1	5m 10s	--
02_remote_agent_server/07_convo_with_cloud_workspace.py	✅ PASS	32.3s	$0.02
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py	❌ FAIL Exit code 1	4.7s	--
04_llm_specific_tools/01_gpt5_apply_patch_preset.py	✅ PASS	23.9s	$0.02
04_llm_specific_tools/02_gemini_file_tools.py	✅ PASS	43.7s	$0.06
05_skills_and_plugins/01_loading_agentskills/main.py	✅ PASS	12.8s	$0.01
05_skills_and_plugins/02_loading_plugins/main.py	✅ PASS	6.8s	$0.01

❌ Some tests failed

Total: 38 | Passed: 32 | Failed: 6 | Total Cost: $2.07

Failed examples:

examples/01_standalone_sdk/25_agent_delegation.py: Timed out after 600 seconds
examples/01_standalone_sdk/34_critic_example.py: Exit code 1
examples/02_remote_agent_server/02_convo_with_docker_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/04_convo_with_api_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1

View full workflow run

github-actions · 2026-02-11T01:05:25Z

🧪 Integration Tests Results

Overall Success Rate: 100.0%
Total Cost: $0.92
Models Tested: 4
Timestamp: 2026-02-11 01:05:18 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_deepseek_deepseek_reasoner: 📥 View & Download Logs
litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs
litellm_proxy_gemini_3_pro_preview: 📥 View & Download Logs
litellm_proxy_claude_sonnet_4_5_20250929: 📥 View & Download Logs

📊 Summary

Model	Overall	Tests Passed	Skipped	Total	Cost	Tokens
litellm_proxy_deepseek_deepseek_reasoner	100.0%	7/7	1	8	$0.05	855,411
litellm_proxy_moonshot_kimi_k2_thinking	100.0%	7/7	1	8	$0.15	226,454
litellm_proxy_gemini_3_pro_preview	100.0%	8/8	0	8	$0.32	231,543
litellm_proxy_claude_sonnet_4_5_20250929	100.0%	8/8	0	8	$0.41	239,056

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

Success Rate: 100.0% (7/7)
Total Cost: $0.05
Token Usage: prompt: 833,861, completion: 21,550, cache_read: 782,592, reasoning: 9,288
Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_dc0581a_deepseek_v3_2_reasoner_run_N8_20260211_005116
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_moonshot_kimi_k2_thinking

Success Rate: 100.0% (7/7)
Total Cost: $0.15
Token Usage: prompt: 220,675, completion: 5,779, cache_read: 174,848
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_dc0581a_kimi_k2_thinking_run_N8_20260211_005117
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3_pro_preview

Success Rate: 100.0% (8/8)
Total Cost: $0.32
Token Usage: prompt: 226,438, completion: 5,105, cache_read: 109,169, reasoning: 3,289
Run Suffix: litellm_proxy_gemini_3_pro_preview_dc0581a_gemini_3_pro_run_N8_20260211_005117

litellm_proxy_claude_sonnet_4_5_20250929

Success Rate: 100.0% (8/8)
Total Cost: $0.41
Token Usage: prompt: 232,404, completion: 6,652, cache_read: 161,444, cache_write: 70,580, reasoning: 1,984
Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_dc0581a_claude_sonnet_4_5_20250929_run_N8_20260211_005117

The libtmux race condition fix (libtmux#625) requires Python 3.13 environment throughout the build process, not just in the base image. Previously, the builder stage was using Python 3.12, which meant dependencies were compiled for Python 3.12 even though the runtime image used Python 3.13. Changes: - Update builder FROM to python:3.13-bookworm - Update uv python install/venv from 3.12 to 3.13 - Update ARG BASE_IMAGE default to python3.13-nodejs22 - Update comment to reflect libtmux fix status Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-02-11T01:38:29Z

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Generated: 2026-02-11 01:50:17 UTC

Example	Status	Duration	Cost
01_standalone_sdk/02_custom_tools.py	✅ PASS	23.8s	$0.03
01_standalone_sdk/03_activate_skill.py	✅ PASS	18.9s	$0.03
01_standalone_sdk/05_use_llm_registry.py	✅ PASS	11.4s	$0.01
01_standalone_sdk/07_mcp_integration.py	✅ PASS	40.4s	$0.04
01_standalone_sdk/09_pause_example.py	✅ PASS	13.0s	$0.01
01_standalone_sdk/10_persistence.py	✅ PASS	26.7s	$0.02
01_standalone_sdk/11_async.py	✅ PASS	36.2s	$0.04
01_standalone_sdk/12_custom_secrets.py	✅ PASS	18.6s	$0.02
01_standalone_sdk/13_get_llm_metrics.py	✅ PASS	21.9s	$0.02
01_standalone_sdk/14_context_condenser.py	❌ FAIL Timed out after 600 seconds	10m 0s	--
01_standalone_sdk/17_image_input.py	✅ PASS	16.6s	$0.02
01_standalone_sdk/18_send_message_while_processing.py	✅ PASS	26.3s	$0.02
01_standalone_sdk/19_llm_routing.py	✅ PASS	13.5s	$0.02
01_standalone_sdk/20_stuck_detector.py	✅ PASS	21.0s	$0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py	✅ PASS	10.4s	$0.00
01_standalone_sdk/22_anthropic_thinking.py	✅ PASS	20.6s	$0.02
01_standalone_sdk/23_responses_reasoning.py	✅ PASS	1m 3s	$0.02
01_standalone_sdk/24_planning_agent_workflow.py	✅ PASS	4m 13s	$0.31
01_standalone_sdk/25_agent_delegation.py	✅ PASS	1m 50s	$0.16
01_standalone_sdk/26_custom_visualizer.py	✅ PASS	20.4s	$0.03
01_standalone_sdk/28_ask_agent_example.py	✅ PASS	46.1s	$0.03
01_standalone_sdk/29_llm_streaming.py	✅ PASS	33.4s	$0.02
01_standalone_sdk/30_tom_agent.py	✅ PASS	8.9s	$0.01
01_standalone_sdk/31_iterative_refinement.py	✅ PASS	5m 2s	$0.38
01_standalone_sdk/32_configurable_security_policy.py	✅ PASS	22.0s	$0.02
01_standalone_sdk/34_critic_example.py	❌ FAIL Exit code 1	3.8s	--
01_standalone_sdk/36_event_json_to_openai_messages.py	✅ PASS	9.6s	$0.00
01_standalone_sdk/37_llm_profile_store.py	✅ PASS	3.8s	$0.00
02_remote_agent_server/01_convo_with_local_agent_server.py	✅ PASS	51.2s	$0.04
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py	❌ FAIL Exit code 1	1m 0s	--
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py	❌ FAIL Exit code 1	16.2s	--
02_remote_agent_server/04_convo_with_api_sandboxed_server.py	❌ FAIL Exit code 1	1m 40s	--
02_remote_agent_server/07_convo_with_cloud_workspace.py	✅ PASS	28.3s	$0.02
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py	❌ FAIL Exit code 1	3m 3s	--
04_llm_specific_tools/01_gpt5_apply_patch_preset.py	✅ PASS	20.1s	$0.02
04_llm_specific_tools/02_gemini_file_tools.py	✅ PASS	1m 1s	$0.08
05_skills_and_plugins/01_loading_agentskills/main.py	✅ PASS	12.8s	$0.02
05_skills_and_plugins/02_loading_plugins/main.py	✅ PASS	7.2s	$0.01

❌ Some tests failed

Total: 38 | Passed: 32 | Failed: 6 | Total Cost: $1.47

Failed examples:

examples/01_standalone_sdk/14_context_condenser.py: Timed out after 600 seconds
examples/01_standalone_sdk/34_critic_example.py: Exit code 1
examples/02_remote_agent_server/02_convo_with_docker_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/04_convo_with_api_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1

View full workflow run

When libtmux's new_session() returns a Session with session_id=None (due to a bug in parse_output() not handling fewer output values), we now retry fetching the session from server.sessions. This is a workaround for the issue where neubig's fix PR #625 doesn't fully handle cases where tmux returns fewer than the expected 125 format field values, causing the zip() to truncate before reaching session_id at index 92. The workaround: 1. Check if session_id is None after new_session() 2. If so, retry up to 3 times with increasing delays 3. Fetch the session by name from server.sessions 4. Raise a clear error if all retries fail See: tmux-python/libtmux#624 Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-02-11T02:04:10Z

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Generated: 2026-02-11 02:13:32 UTC

Example	Status	Duration	Cost
01_standalone_sdk/02_custom_tools.py	✅ PASS	29.9s	$0.03
01_standalone_sdk/03_activate_skill.py	✅ PASS	19.5s	$0.03
01_standalone_sdk/05_use_llm_registry.py	✅ PASS	12.4s	$0.01
01_standalone_sdk/07_mcp_integration.py	✅ PASS	42.8s	$0.04
01_standalone_sdk/09_pause_example.py	✅ PASS	17.1s	$0.01
01_standalone_sdk/10_persistence.py	✅ PASS	26.7s	$0.02
01_standalone_sdk/11_async.py	✅ PASS	38.8s	$0.04
01_standalone_sdk/12_custom_secrets.py	✅ PASS	19.0s	$0.02
01_standalone_sdk/13_get_llm_metrics.py	✅ PASS	23.1s	$0.02
01_standalone_sdk/14_context_condenser.py	✅ PASS	2m 58s	$0.38
01_standalone_sdk/17_image_input.py	✅ PASS	14.5s	$0.02
01_standalone_sdk/18_send_message_while_processing.py	✅ PASS	15.9s	$0.01
01_standalone_sdk/19_llm_routing.py	✅ PASS	13.0s	$0.02
01_standalone_sdk/20_stuck_detector.py	✅ PASS	18.4s	$0.03
01_standalone_sdk/21_generate_extraneous_conversation_costs.py	✅ PASS	10.4s	$0.00
01_standalone_sdk/22_anthropic_thinking.py	✅ PASS	13.7s	$0.01
01_standalone_sdk/23_responses_reasoning.py	✅ PASS	1m 11s	$0.01
01_standalone_sdk/24_planning_agent_workflow.py	✅ PASS	3m 33s	$0.26
01_standalone_sdk/25_agent_delegation.py	✅ PASS	2m 23s	$0.19
01_standalone_sdk/26_custom_visualizer.py	✅ PASS	20.2s	$0.02
01_standalone_sdk/28_ask_agent_example.py	✅ PASS	33.3s	$0.02
01_standalone_sdk/29_llm_streaming.py	✅ PASS	41.2s	$0.03
01_standalone_sdk/30_tom_agent.py	✅ PASS	21.9s	$0.02
01_standalone_sdk/31_iterative_refinement.py	✅ PASS	3m 54s	$0.28
01_standalone_sdk/32_configurable_security_policy.py	✅ PASS	24.9s	$0.02
01_standalone_sdk/34_critic_example.py	❌ FAIL Exit code 1	3.9s	--
01_standalone_sdk/36_event_json_to_openai_messages.py	✅ PASS	10.0s	$0.00
01_standalone_sdk/37_llm_profile_store.py	✅ PASS	4.0s	$0.00
02_remote_agent_server/01_convo_with_local_agent_server.py	✅ PASS	57.8s	$0.04
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py	❌ FAIL Exit code 1	1m 5s	--
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py	❌ FAIL Exit code 1	17.1s	--
02_remote_agent_server/04_convo_with_api_sandboxed_server.py	❌ FAIL Exit code 1	1m 11s	--
02_remote_agent_server/07_convo_with_cloud_workspace.py	✅ PASS	39.3s	$0.03
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py	❌ FAIL Exit code 1	2m 59s	--
04_llm_specific_tools/01_gpt5_apply_patch_preset.py	✅ PASS	23.6s	$0.03
04_llm_specific_tools/02_gemini_file_tools.py	✅ PASS	1m 3s	$0.10
05_skills_and_plugins/01_loading_agentskills/main.py	✅ PASS	13.5s	$0.01
05_skills_and_plugins/02_loading_plugins/main.py	✅ PASS	12.6s	$0.01

❌ Some tests failed

Total: 38 | Passed: 33 | Failed: 5 | Total Cost: $1.75

Failed examples:

examples/01_standalone_sdk/34_critic_example.py: Exit code 1
examples/02_remote_agent_server/02_convo_with_docker_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/04_convo_with_api_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1

View full workflow run

Improved the workaround for the libtmux race condition where new_session() returns a Session with session_id=None. Changes: - Wrap entire session creation in retry loop (up to 5 attempts) - Try to get session from server.sessions if session_id is None - Add increasing delays between retries - Clean up orphan sessions before retry - Better error messages with last error included The root issue is that neubig's PR #625 fix tries to parse 125+ format fields from tmux output, but tmux may not output all fields in some environments (especially Python 3.13 + PyInstaller + Docker), causing session_id (at index 92) to be missing from the parsed output. Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-02-11T02:24:46Z

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Generated: 2026-02-11 02:38:19 UTC

Example	Status	Duration	Cost
01_standalone_sdk/02_custom_tools.py	✅ PASS	26.0s	$0.03
01_standalone_sdk/03_activate_skill.py	✅ PASS	17.3s	$0.03
01_standalone_sdk/05_use_llm_registry.py	✅ PASS	12.5s	$0.01
01_standalone_sdk/07_mcp_integration.py	✅ PASS	40.0s	$0.03
01_standalone_sdk/09_pause_example.py	✅ PASS	15.5s	$0.01
01_standalone_sdk/10_persistence.py	✅ PASS	29.4s	$0.02
01_standalone_sdk/11_async.py	✅ PASS	35.9s	$0.04
01_standalone_sdk/12_custom_secrets.py	✅ PASS	20.5s	$0.02
01_standalone_sdk/13_get_llm_metrics.py	✅ PASS	20.8s	$0.01
01_standalone_sdk/14_context_condenser.py	✅ PASS	4m 48s	$0.56
01_standalone_sdk/17_image_input.py	✅ PASS	15.3s	$0.02
01_standalone_sdk/18_send_message_while_processing.py	✅ PASS	26.4s	$0.02
01_standalone_sdk/19_llm_routing.py	✅ PASS	13.4s	$0.02
01_standalone_sdk/20_stuck_detector.py	✅ PASS	12.8s	$0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py	✅ PASS	10.6s	$0.00
01_standalone_sdk/22_anthropic_thinking.py	✅ PASS	19.7s	$0.01
01_standalone_sdk/23_responses_reasoning.py	✅ PASS	1m 31s	$0.01
01_standalone_sdk/24_planning_agent_workflow.py	✅ PASS	5m 12s	$0.39
01_standalone_sdk/25_agent_delegation.py	✅ PASS	1m 58s	$0.17
01_standalone_sdk/26_custom_visualizer.py	✅ PASS	22.8s	$0.02
01_standalone_sdk/28_ask_agent_example.py	✅ PASS	34.8s	$0.03
01_standalone_sdk/29_llm_streaming.py	✅ PASS	40.9s	$0.03
01_standalone_sdk/30_tom_agent.py	✅ PASS	17.4s	$0.02
01_standalone_sdk/31_iterative_refinement.py	❌ FAIL Timed out after 600 seconds	10m 0s	--
01_standalone_sdk/32_configurable_security_policy.py	✅ PASS	16.7s	$0.02
01_standalone_sdk/34_critic_example.py	❌ FAIL Exit code 1	4.1s	--
01_standalone_sdk/36_event_json_to_openai_messages.py	✅ PASS	9.9s	$0.00
01_standalone_sdk/37_llm_profile_store.py	✅ PASS	3.9s	$0.00
02_remote_agent_server/01_convo_with_local_agent_server.py	✅ PASS	46.6s	$0.03
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py	❌ FAIL Exit code 1	1m 2s	--
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py	❌ FAIL Exit code 1	17.8s	--
02_remote_agent_server/04_convo_with_api_sandboxed_server.py	❌ FAIL Exit code 1	2m 22s	--
02_remote_agent_server/07_convo_with_cloud_workspace.py	✅ PASS	28.0s	$0.02
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py	❌ FAIL Exit code 1	2m 48s	--
04_llm_specific_tools/01_gpt5_apply_patch_preset.py	✅ PASS	21.1s	$0.01
04_llm_specific_tools/02_gemini_file_tools.py	✅ PASS	40.6s	$0.05
05_skills_and_plugins/01_loading_agentskills/main.py	✅ PASS	11.1s	$0.01
05_skills_and_plugins/02_loading_plugins/main.py	✅ PASS	7.1s	$0.01

❌ Some tests failed

Total: 38 | Passed: 32 | Failed: 6 | Total Cost: $1.69

Failed examples:

examples/01_standalone_sdk/31_iterative_refinement.py: Timed out after 600 seconds
examples/01_standalone_sdk/34_critic_example.py: Exit code 1
examples/02_remote_agent_server/02_convo_with_docker_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/04_convo_with_api_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1

View full workflow run

neubig · 2026-02-11T23:48:08Z

@OpenHands merge main. then carefully read all the surrounding context and debug why 02_remote_agent_server/02_convo_with_docker_sandboxed_server.py is failing despite our best efforts to fix it. gather information from the github action logs, reproduce locally, and then try to push a fix. you can re-trigger test-examples and see what happens to debug, but try to do that sparingly and only after you're able to try to debug as much as you can locally.

openhands-ai · 2026-02-11T23:48:17Z

I'm on it! neubig can track my progress at all-hands.dev

Co-authored-by: openhands <openhands@all-hands.dev>

The libtmux fix (neubig's PR #625) uses a format string with 125+ fields, but tmux may not output all fields correctly in some environments (Python 3.13 + PyInstaller + Docker), causing session_id (at index 92) to be missing from the parsed output. This adds a fallback that directly queries tmux using a simple format string ('#{session_id}:#{session_name}') to get the session_id when libtmux's complex format parsing fails. Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2026-02-12T00:22:10Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Run Examples Scripts

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1978 at branch `upgrade-python313-with-libtmux-fix`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

openhands-ai · 2026-02-12T00:23:42Z

Summary

I successfully debugged and fixed the 02_convo_with_docker_sandboxed_server.py test failure.

Root Cause Analysis

The test was failing with:

RuntimeError: Failed to get session_id for session 'openhands-None-...' after 5 attempts.
See https://github.com/tmux-python/libtmux/issues/624

The issue was that the libtmux fix (neubig's PR #625) uses a format string with 125+ fields to parse tmux output. The session_id field is at index 92. In certain environments (Python 3.13 + PyInstaller + Docker), tmux may not output all fields correctly, causing session_id to be missing from the parsed output.

The existing retry logic tried to get the session from server.sessions.filter(), but this uses the same broken parsing logic, so it also failed.

Fix Implemented

Added a new fallback function _get_session_id_directly() that:

Runs a simple tmux list-sessions -F '#{session_id}:#{session_name}' command
Parses the output to find the session_id for the given session name
Bypasses libtmux's complex 125-field format parsing entirely

This fallback is tried before the existing server.sessions.filter() fallback, providing a more reliable way to get the session_id.

Changes Made

Merged main into the branch to get latest changes
Added _get_session_id_directly() function in openhands-tools/openhands/tools/terminal/terminal/tmux_terminal.py
Updated the retry logic to use the new fallback as the first option

Verification

✅ Local testing passed (created 3 tmux terminals successfully)
✅ CI test-examples workflow passed for the Docker-related tests:
- 02_convo_with_docker_sandboxed_server.py - PASS (was failing)
- 03_browser_use_with_docker_sandboxed_server.py - PASS (was failing)
- 04_convo_with_api_sandboxed_server.py - PASS (was failing)

The remaining 3 failures are unrelated to the libtmux issue:

34_critic_example.py - Missing CRITIC_SERVER_URL env var
38_browser_session_recording.py - Timeout (new test)
08_convo_with_apptainer_sandboxed_server.py - Uses latest-python image instead of PR-specific image

Commits Pushed

7dd2427e - Merge main into upgrade-python313-with-libtmux-fix
40e712f6 - fix: add direct tmux fallback for session_id resolution

View full conversation

all-hands-bot

🔴 Critical Design Issues - Do Not Merge Yet

Taste Rating: 🔴 Needs fundamental rework

This PR attempts to fix a real problem (libtmux race condition in Python 3.13) but shows signs that the upstream fix is incomplete. The 80+ lines of workaround/retry logic suggests we're building a band-aid on top of a band-aid.

Core Issue: If the upstream libtmux fix works as intended, why do we need extensive retry logic, fallback mechanisms, and manual session_id extraction? The answer is in your own comments: "tmux may not output all fields" - the upstream fix is incomplete.

Key Insight: Pinning to an unmerged git branch with extensive workarounds is the wrong approach. Either (1) wait for a proper upstream release, or (2) fix the root cause in libtmux properly, or (3) implement a simpler workaround that doesn't require 80 lines of retry spaghetti.

See detailed inline comments below.

openhands-tools/openhands/tools/terminal/terminal/tmux_terminal.py

openhands-tools/pyproject.toml

openhands-sdk/openhands/sdk/agent/base.py

.github/workflows/server.yml

openhands-agent-server/openhands/agent_server/docker/Dockerfile

neubig · 2026-02-15T16:55:09Z

Thanks for the thorough review! I agree with the assessment:

Keeping as DRAFT - This PR will remain as a draft until the upstream libtmux fix is merged and released to PyPI. The git branch dependency is indeed fragile.
Over-engineering concern - The multiple fallback mechanisms were added defensively because the upstream fix was unstable. Once the upstream is properly released, we can simplify to a single approach.
Splitting changes - Will consider splitting into separate PRs:
- Type hint modernization (Python 3.9+ style)
- CI variable rename (SDK_SHA/SDK_REF)
- Python 3.13 + libtmux upgrade (blocked on upstream release)
Debian upgrade - The bullseye → bookworm change was intentional to match Python 3.13's default image. Will verify compatibility before merging.

Leaving the review threads unresolved as reminders for when we revisit this PR after the upstream is ready.

all-hands-bot

🟡 Acceptable for Draft/Testing

Taste Rating: 🟡 Pragmatic for its purpose

This draft PR achieves its stated goal: testing the Python 3.13 upgrade with the libtmux race condition fix in CI. Previous architectural concerns (git branch pin, retry complexity) were discussed and marked resolved - the maintainer understands this is temporary until upstream releases to PyPI.

Key Observations

1. SDK_SHA infrastructure already exists ✅

build.py was updated in Nov 2025 (commit 204d3a4) to read SDK_SHA with fallback to GITHUB_SHA
The workflow changes in this PR correctly use this existing mechanism
No changes to build.py needed

2. Variable naming is intentional ✅

SDK_SHA (in server.yml) = build context for tagging Docker images
AGENT_SERVER_SHA (in run-examples.yml) = runtime context for examples to pull images
Different names reflect their different purposes in separate workflows

3. Comprehensive comments ✅

The added comments explaining SHA handling make the confusing GitHub Actions variable behavior clearer
Debug echo statements (server.yml:242-244) help verify SHA handling during testing
Good engineering for maintainability

4. Type hints are clean ✅

Generator[T] syntax is correct for Python 3.9+
Changes are consistent across codebase

For Production Merge (when upstream is ready)

Before merging to production:

✅ Wait for upstream libtmux PR to be merged and released to PyPI (already noted in PR description)
🟢 Consider removing debug echo statements or gate behind a debug flag
🟡 Evaluate whether the extensive retry logic indicates the upstream fix needs more work

Verdict

✅ Approved for draft/testing purposes

The code is pragmatic for validating the libtmux fix in your CI environment. The previous review comprehensively covered the blockers for production merge, and those concerns remain valid - but this is explicitly a draft PR for testing, not for immediate production use.

The approach is sound for its stated purpose. Good luck with the testing! 🚀

neubig · 2026-02-16T04:12:34Z

Closing in favor of #2092 which has a much simpler fix.

The root cause was discovered to be a locale issue, not a race condition in libtmux. When LC_ALL/LANG are not set in PyInstaller builds, tmux converts UTF-8 separator characters to underscores, breaking libtmux's format parsing.

The fix is simply adding ENV LC_ALL=C.UTF-8 and ENV LANG=C.UTF-8 to the Dockerfile - no code changes or custom libtmux forks needed.

neubig added the integration-test Runs the integration tests and comments the results label Feb 9, 2026 — with OpenHands AI

neubig added the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 9, 2026

neubig removed the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 10, 2026

neubig added the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 10, 2026 — with OpenHands AI

fix: update uv.lock and simplify Generator type hints for Python 3.13

cbb660f

- Regenerate uv.lock with pinned libtmux git dependency - Simplify Generator[T, None, None] to Generator[T] in test files Co-authored-by: openhands <openhands@all-hands.dev>

neubig removed the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 10, 2026

neubig added the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 10, 2026 — with OpenHands AI

neubig removed the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 11, 2026