Skip to content

DRAFT: Upgrade to Python 3.13 with libtmux race condition fix#1978

Closed
neubig wants to merge 11 commits intomainfrom
upgrade-python313-with-libtmux-fix
Closed

DRAFT: Upgrade to Python 3.13 with libtmux race condition fix#1978
neubig wants to merge 11 commits intomainfrom
upgrade-python313-with-libtmux-fix

Conversation

@neubig
Copy link
Contributor

@neubig neubig commented Feb 9, 2026

Summary

This PR upgrades the project to Python 3.13 and pins libtmux to neubig/libtmux@fix/new-session-race-condition which contains the fix for the race condition reported in libtmux#624.

Context

See upstream PR: tmux-python/libtmux#625

The issue was that new_session() in libtmux would:

  1. Run tmux new-session -P -F#{session_id} to create session
  2. Immediately run tmux list-sessions to fetch full session data

This created a race condition in Python 3.13 environments (especially with PyInstaller + Docker) where list-sessions might not see the newly created session yet, causing TmuxObjectDoesNotExist errors.

The fix expands the -F format string to include all Session fields and parses the output directly, eliminating the separate list-sessions query entirely.

Changes

  • Update target-version from py312 to py313 in root pyproject.toml (ruff)
  • Update pythonVersion from 3.12 to 3.13 in root pyproject.toml (pyright)
  • Update Python version in server.yml build matrix from 3.12 to 3.13
  • Update Python version in pypi-release.yml from 3.12 to 3.13
  • Update Python version in pr-review action from 3.12 to 3.13
  • Pin libtmux to neubig's branch: libtmux @ git+https://github.com/neubig/libtmux.git@fix/new-session-race-condition

Testing

This PR needs integration tests to verify the libtmux fix works correctly in our CI environment. The integration-test label should trigger those tests.

Note

This is a draft PR to test the libtmux fix. Once the upstream PR is merged and released to PyPI, we should update the dependency to the released version.


Related issues:

@neubig can click here to continue refining the PR


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:40e712f-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-40e712f-python \
  ghcr.io/openhands/agent-server:40e712f-python

All tags pushed for this build

ghcr.io/openhands/agent-server:40e712f-golang-amd64
ghcr.io/openhands/agent-server:40e712f-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:40e712f-golang-arm64
ghcr.io/openhands/agent-server:40e712f-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:40e712f-java-amd64
ghcr.io/openhands/agent-server:40e712f-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:40e712f-java-arm64
ghcr.io/openhands/agent-server:40e712f-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:40e712f-python-amd64
ghcr.io/openhands/agent-server:40e712f-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-amd64
ghcr.io/openhands/agent-server:40e712f-python-arm64
ghcr.io/openhands/agent-server:40e712f-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-arm64
ghcr.io/openhands/agent-server:40e712f-golang
ghcr.io/openhands/agent-server:40e712f-java
ghcr.io/openhands/agent-server:40e712f-python

About Multi-Architecture Support

  • Each variant tag (e.g., 40e712f-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 40e712f-python-amd64) are also available if needed

- Update target-version and pythonVersion to 3.13 in root pyproject.toml
- Update Python version in server.yml build matrix to 3.13
- Update Python version in pypi-release.yml to 3.13
- Update Python version in pr-review action to 3.13
- Pin libtmux to neubig/libtmux#fix/new-session-race-condition branch
  which fixes the race condition in new_session() that causes
  TmuxObjectDoesNotExist errors in Python 3.13 environments

The libtmux fix avoids the race condition by eliminating the separate
list-sessions query after session creation, instead parsing the session
data directly from the -P output of new-session.

Fixes the Python 3.13 + PyInstaller + Docker compatibility issue
reported in libtmux#624.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig added the integration-test Runs the integration tests and comments the results label Feb 9, 2026 — with OpenHands AI
@github-actions
Copy link
Contributor

github-actions bot commented Feb 9, 2026

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 9, 2026

🧪 Integration Tests Results

Overall Success Rate: 100.0%
Total Cost: $0.90
Models Tested: 4
Timestamp: 2026-02-09 21:08:59 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Tests Passed Skipped Total Cost Tokens
litellm_proxy_deepseek_deepseek_reasoner 100.0% 7/7 1 8 $0.03 586,389
litellm_proxy_moonshot_kimi_k2_thinking 100.0% 7/7 1 8 $0.15 225,951
litellm_proxy_gemini_3_pro_preview 100.0% 8/8 0 8 $0.31 230,266
litellm_proxy_claude_sonnet_4_5_20250929 100.0% 8/8 0 8 $0.41 236,012

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.03
  • Token Usage: prompt: 571,330, completion: 15,059, cache_read: 529,280, reasoning: 6,086
  • Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_122e797_deepseek_v3_2_reasoner_run_N8_20260209_210520
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_moonshot_kimi_k2_thinking

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.15
  • Token Usage: prompt: 220,509, completion: 5,442, cache_read: 172,544
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_122e797_kimi_k2_thinking_run_N8_20260209_210544
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3_pro_preview

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.31
  • Token Usage: prompt: 223,711, completion: 6,555, cache_read: 120,260, reasoning: 4,076
  • Run Suffix: litellm_proxy_gemini_3_pro_preview_122e797_gemini_3_pro_run_N8_20260209_210520

litellm_proxy_claude_sonnet_4_5_20250929

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.41
  • Token Usage: prompt: 229,436, completion: 6,576, cache_read: 158,218, cache_write: 70,849, reasoning: 1,996
  • Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_122e797_claude_sonnet_4_5_20250929_run_N8_20260209_210519

@github-actions
Copy link
Contributor

github-actions bot commented Feb 9, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/agent
   base.py1891989%200, 289, 293–297, 345–347, 357, 367, 375–376, 486, 523–524, 534–535
openhands-tools/openhands/tools/terminal/terminal
   tmux_terminal.py1396652%29–30, 41–43, 45–56, 83, 90, 119, 123–126, 130, 133–136, 140, 142–143, 146–152, 155, 159–162, 199, 203, 207, 218, 229, 243, 255–262, 270–271, 273–274, 276, 278–280
TOTAL18782524472% 

@neubig neubig added the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 9, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 9, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Generated: 2026-02-09 21:45:19 UTC

Example Status Duration Cost
01_standalone_sdk/02_custom_tools.py ✅ PASS 28.7s $0.03
01_standalone_sdk/03_activate_skill.py ✅ PASS 21.6s $0.03
01_standalone_sdk/05_use_llm_registry.py ✅ PASS 11.9s $0.01
01_standalone_sdk/07_mcp_integration.py ✅ PASS 48.3s $0.04
01_standalone_sdk/09_pause_example.py ✅ PASS 19.9s $0.02
01_standalone_sdk/10_persistence.py ✅ PASS 27.5s $0.02
01_standalone_sdk/11_async.py ✅ PASS 34.3s $0.03
01_standalone_sdk/12_custom_secrets.py ✅ PASS 16.5s $0.02
01_standalone_sdk/13_get_llm_metrics.py ✅ PASS 21.5s $0.01
01_standalone_sdk/14_context_condenser.py ✅ PASS 4m 25s $0.53
01_standalone_sdk/17_image_input.py ✅ PASS 18.7s $0.02
01_standalone_sdk/18_send_message_while_processing.py ✅ PASS 27.9s $0.02
01_standalone_sdk/19_llm_routing.py ✅ PASS 15.8s $0.02
01_standalone_sdk/20_stuck_detector.py ✅ PASS 22.3s $0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py ✅ PASS 9.7s $0.00
01_standalone_sdk/22_anthropic_thinking.py ✅ PASS 22.2s $0.02
01_standalone_sdk/23_responses_reasoning.py ✅ PASS 1m 27s $0.01
01_standalone_sdk/24_planning_agent_workflow.py ✅ PASS 5m 5s $0.41
01_standalone_sdk/25_agent_delegation.py ✅ PASS 2m 16s $0.18
01_standalone_sdk/26_custom_visualizer.py ✅ PASS 20.6s $0.02
01_standalone_sdk/28_ask_agent_example.py ✅ PASS 38.4s $0.03
01_standalone_sdk/29_llm_streaming.py ✅ PASS 36.6s $0.03
01_standalone_sdk/30_tom_agent.py ✅ PASS 11.2s $0.01
01_standalone_sdk/31_iterative_refinement.py ✅ PASS 3m 7s $0.21
01_standalone_sdk/32_configurable_security_policy.py ✅ PASS 19.7s $0.02
01_standalone_sdk/34_critic_example.py ❌ FAIL
Exit code 1
3.8s --
01_standalone_sdk/36_event_json_to_openai_messages.py ✅ PASS 9.8s $0.00
01_standalone_sdk/37_llm_profile_store.py ✅ PASS 4.0s $0.00
02_remote_agent_server/01_convo_with_local_agent_server.py ✅ PASS 58.3s $0.04
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py ❌ FAIL
Exit code 1
1m 4s --
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py ❌ FAIL
Exit code 1
16.8s --
02_remote_agent_server/04_convo_with_api_sandboxed_server.py ❌ FAIL
Exit code 1
1m 1s --
02_remote_agent_server/07_convo_with_cloud_workspace.py ✅ PASS 31.8s $0.02
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py ❌ FAIL
Exit code 1
4.1s --
04_llm_specific_tools/01_gpt5_apply_patch_preset.py ✅ PASS 18.7s $0.02
04_llm_specific_tools/02_gemini_file_tools.py ✅ PASS 44.4s $0.08
05_skills_and_plugins/01_loading_agentskills/main.py ✅ PASS 12.5s $0.01
05_skills_and_plugins/02_loading_plugins/main.py ✅ PASS 7.8s $0.01

❌ Some tests failed

Total: 38 | Passed: 33 | Failed: 5 | Total Cost: $1.94

Failed examples:

  • examples/01_standalone_sdk/34_critic_example.py: Exit code 1
  • examples/02_remote_agent_server/02_convo_with_docker_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/04_convo_with_api_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1

View full workflow run

…HUB_SHA

GitHub Actions sets GITHUB_SHA to the merge commit by default, which
differs from the PR head commit. Use a custom variable AGENT_SERVER_SHA
to explicitly pass the PR head SHA to example scripts for Docker image
selection.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig removed the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 10, 2026
@neubig neubig added the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 10, 2026 — with OpenHands AI
@github-actions
Copy link
Contributor

github-actions bot commented Feb 10, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Generated: 2026-02-10 09:37:15 UTC

Example Status Duration Cost
01_standalone_sdk/02_custom_tools.py ✅ PASS 25.3s $0.03
01_standalone_sdk/03_activate_skill.py ✅ PASS 17.9s $0.03
01_standalone_sdk/05_use_llm_registry.py ✅ PASS 11.1s $0.01
01_standalone_sdk/07_mcp_integration.py ✅ PASS 41.2s $0.04
01_standalone_sdk/09_pause_example.py ✅ PASS 13.5s $0.01
01_standalone_sdk/10_persistence.py ✅ PASS 27.9s $0.02
01_standalone_sdk/11_async.py ✅ PASS 28.7s $0.03
01_standalone_sdk/12_custom_secrets.py ✅ PASS 11.2s $0.01
01_standalone_sdk/13_get_llm_metrics.py ✅ PASS 20.8s $0.01
01_standalone_sdk/14_context_condenser.py ✅ PASS 4m 22s $0.54
01_standalone_sdk/17_image_input.py ✅ PASS 17.2s $0.02
01_standalone_sdk/18_send_message_while_processing.py ✅ PASS 25.1s $0.01
01_standalone_sdk/19_llm_routing.py ✅ PASS 13.2s $0.02
01_standalone_sdk/20_stuck_detector.py ✅ PASS 19.5s $0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py ✅ PASS 14.8s $0.00
01_standalone_sdk/22_anthropic_thinking.py ✅ PASS 18.1s $0.01
01_standalone_sdk/23_responses_reasoning.py ✅ PASS 1m 10s $0.01
01_standalone_sdk/24_planning_agent_workflow.py ✅ PASS 2m 54s $0.22
01_standalone_sdk/25_agent_delegation.py ✅ PASS 2m 4s $0.17
01_standalone_sdk/26_custom_visualizer.py ✅ PASS 18.6s $0.02
01_standalone_sdk/28_ask_agent_example.py ✅ PASS 28.3s $0.02
01_standalone_sdk/29_llm_streaming.py ✅ PASS 38.6s $0.03
01_standalone_sdk/30_tom_agent.py ✅ PASS 12.2s $0.01
01_standalone_sdk/31_iterative_refinement.py ❌ FAIL
Timed out after 600 seconds
10m 0s --
01_standalone_sdk/32_configurable_security_policy.py ✅ PASS 23.7s $0.02
01_standalone_sdk/34_critic_example.py ❌ FAIL
Exit code 1
3.9s --
01_standalone_sdk/36_event_json_to_openai_messages.py ✅ PASS 21.3s $0.01
01_standalone_sdk/37_llm_profile_store.py ✅ PASS 4.1s $0.00
02_remote_agent_server/01_convo_with_local_agent_server.py ✅ PASS 1m 5s $0.04
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py ❌ FAIL
Exit code 1
4.7s --
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py ❌ FAIL
Exit code 1
5.7s --
02_remote_agent_server/04_convo_with_api_sandboxed_server.py ❌ FAIL
Exit code 1
5m 11s --
02_remote_agent_server/07_convo_with_cloud_workspace.py ✅ PASS 28.3s $0.03
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py ❌ FAIL
Exit code 1
4.7s --
04_llm_specific_tools/01_gpt5_apply_patch_preset.py ✅ PASS 31.8s $0.04
04_llm_specific_tools/02_gemini_file_tools.py ✅ PASS 1m 10s $0.05
05_skills_and_plugins/01_loading_agentskills/main.py ✅ PASS 10.5s $0.01
05_skills_and_plugins/02_loading_plugins/main.py ✅ PASS 7.5s $0.01

❌ Some tests failed

Total: 38 | Passed: 32 | Failed: 6 | Total Cost: $1.52

Failed examples:

  • examples/01_standalone_sdk/31_iterative_refinement.py: Timed out after 600 seconds
  • examples/01_standalone_sdk/34_critic_example.py: Exit code 1
  • examples/02_remote_agent_server/02_convo_with_docker_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/04_convo_with_api_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1

View full workflow run

- Regenerate uv.lock with pinned libtmux git dependency
- Simplify Generator[T, None, None] to Generator[T] in test files

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig removed the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 10, 2026
@neubig neubig added the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 10, 2026 — with OpenHands AI
@github-actions
Copy link
Contributor

github-actions bot commented Feb 10, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Generated: 2026-02-10 09:54:23 UTC

Example Status Duration Cost
01_standalone_sdk/02_custom_tools.py ✅ PASS 25.7s $0.03
01_standalone_sdk/03_activate_skill.py ✅ PASS 20.0s $0.03
01_standalone_sdk/05_use_llm_registry.py ✅ PASS 14.0s $0.01
01_standalone_sdk/07_mcp_integration.py ✅ PASS 30.2s $0.02
01_standalone_sdk/09_pause_example.py ✅ PASS 18.5s $0.01
01_standalone_sdk/10_persistence.py ✅ PASS 35.4s $0.03
01_standalone_sdk/11_async.py ✅ PASS 31.4s $0.03
01_standalone_sdk/12_custom_secrets.py ✅ PASS 20.0s $0.02
01_standalone_sdk/13_get_llm_metrics.py ✅ PASS 21.1s $0.02
01_standalone_sdk/14_context_condenser.py ✅ PASS 6m 29s $0.84
01_standalone_sdk/17_image_input.py ✅ PASS 16.4s $0.02
01_standalone_sdk/18_send_message_while_processing.py ✅ PASS 23.3s $0.01
01_standalone_sdk/19_llm_routing.py ✅ PASS 15.6s $0.02
01_standalone_sdk/20_stuck_detector.py ✅ PASS 15.8s $0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py ✅ PASS 11.4s $0.00
01_standalone_sdk/22_anthropic_thinking.py ✅ PASS 16.7s $0.01
01_standalone_sdk/23_responses_reasoning.py ✅ PASS 57.8s $0.01
01_standalone_sdk/24_planning_agent_workflow.py ✅ PASS 51.3s $0.05
01_standalone_sdk/25_agent_delegation.py ✅ PASS 1m 44s $0.19
01_standalone_sdk/26_custom_visualizer.py ✅ PASS 22.6s $0.03
01_standalone_sdk/28_ask_agent_example.py ✅ PASS 35.2s $0.03
01_standalone_sdk/29_llm_streaming.py ✅ PASS 44.6s $0.04
01_standalone_sdk/30_tom_agent.py ✅ PASS 10.6s $0.01
01_standalone_sdk/31_iterative_refinement.py ✅ PASS 3m 7s $0.22
01_standalone_sdk/32_configurable_security_policy.py ✅ PASS 16.6s $0.02
01_standalone_sdk/34_critic_example.py ❌ FAIL
Exit code 1
3.8s --
01_standalone_sdk/36_event_json_to_openai_messages.py ✅ PASS 13.1s $0.00
01_standalone_sdk/37_llm_profile_store.py ✅ PASS 4.1s $0.00
02_remote_agent_server/01_convo_with_local_agent_server.py ✅ PASS 59.4s $0.04
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py ❌ FAIL
Exit code 1
4.8s --
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py ❌ FAIL
Exit code 1
4.9s --
02_remote_agent_server/04_convo_with_api_sandboxed_server.py ❌ FAIL
Exit code 1
5m 11s --
02_remote_agent_server/07_convo_with_cloud_workspace.py ✅ PASS 28.6s $0.02
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py ❌ FAIL
Exit code 1
5.6s --
04_llm_specific_tools/01_gpt5_apply_patch_preset.py ✅ PASS 20.8s $0.03
04_llm_specific_tools/02_gemini_file_tools.py ✅ PASS 57.7s $0.07
05_skills_and_plugins/01_loading_agentskills/main.py ✅ PASS 14.0s $0.02
05_skills_and_plugins/02_loading_plugins/main.py ✅ PASS 7.6s $0.01

❌ Some tests failed

Total: 38 | Passed: 33 | Failed: 5 | Total Cost: $1.90

Failed examples:

  • examples/01_standalone_sdk/34_critic_example.py: Exit code 1
  • examples/02_remote_agent_server/02_convo_with_docker_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/04_convo_with_api_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1

View full workflow run

The SHORT_SHA extraction was using github.sha (merge commit) instead of
the GITHUB_SHA env var (which is set to PR head SHA for pull_request events).
This caused Docker images to be tagged with the wrong SHA, making them
unfindable by run-examples.yml which uses the PR head SHA.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig removed the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 11, 2026
@neubig neubig added the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 11, 2026 — with OpenHands AI
@neubig neubig removed the integration-test Runs the integration tests and comments the results label Feb 11, 2026
@neubig neubig added the integration-test Runs the integration tests and comments the results label Feb 11, 2026 — with OpenHands AI
@github-actions
Copy link
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

🧪 Integration Tests Results

Overall Success Rate: 100.0%
Total Cost: $0.91
Models Tested: 4
Timestamp: 2026-02-11 00:48:22 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Tests Passed Skipped Total Cost Tokens
litellm_proxy_deepseek_deepseek_reasoner 100.0% 7/7 1 8 $0.03 590,886
litellm_proxy_moonshot_kimi_k2_thinking 100.0% 7/7 1 8 $0.18 276,821
litellm_proxy_gemini_3_pro_preview 100.0% 8/8 0 8 $0.29 210,710
litellm_proxy_claude_sonnet_4_5_20250929 100.0% 8/8 0 8 $0.41 233,488

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.03
  • Token Usage: prompt: 577,382, completion: 13,504, cache_read: 535,168, reasoning: 5,367
  • Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_252274c_deepseek_v3_2_reasoner_run_N8_20260211_004507
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_moonshot_kimi_k2_thinking

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.18
  • Token Usage: prompt: 271,307, completion: 5,514, cache_read: 214,528
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_252274c_kimi_k2_thinking_run_N8_20260211_004500
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3_pro_preview

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.29
  • Token Usage: prompt: 204,992, completion: 5,718, cache_read: 104,221, reasoning: 3,287
  • Run Suffix: litellm_proxy_gemini_3_pro_preview_252274c_gemini_3_pro_run_N8_20260211_004458

litellm_proxy_claude_sonnet_4_5_20250929

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.41
  • Token Usage: prompt: 226,859, completion: 6,629, cache_read: 155,499, cache_write: 70,991, reasoning: 1,808
  • Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_252274c_claude_sonnet_4_5_20250929_run_N8_20260211_004500

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig removed the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 11, 2026
@neubig neubig added the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 11, 2026 — with OpenHands AI
@neubig neubig removed the integration-test Runs the integration tests and comments the results label Feb 11, 2026
@neubig neubig added the integration-test Runs the integration tests and comments the results label Feb 11, 2026 — with OpenHands AI
@github-actions
Copy link
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

GITHUB_SHA is a reserved GitHub Actions environment variable that cannot
be overridden at job level. Use SDK_SHA which is already supported by
build.py and takes precedence over GITHUB_SHA.

This ensures Docker images are tagged with the PR head SHA (not the merge
commit SHA) so run-examples.yml can find the correct images.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig removed the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 11, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 11, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Generated: 2026-02-11 01:17:01 UTC

Example Status Duration Cost
01_standalone_sdk/02_custom_tools.py ✅ PASS 25.4s $0.02
01_standalone_sdk/03_activate_skill.py ✅ PASS 17.3s $0.01
01_standalone_sdk/05_use_llm_registry.py ✅ PASS 12.3s $0.00
01_standalone_sdk/07_mcp_integration.py ✅ PASS 35.1s $0.01
01_standalone_sdk/09_pause_example.py ✅ PASS 14.9s $0.01
01_standalone_sdk/10_persistence.py ✅ PASS 26.0s $0.01
01_standalone_sdk/11_async.py ✅ PASS 30.0s $0.03
01_standalone_sdk/12_custom_secrets.py ✅ PASS 19.5s $0.01
01_standalone_sdk/13_get_llm_metrics.py ✅ PASS 18.0s $0.01
01_standalone_sdk/14_context_condenser.py ✅ PASS 4m 34s $0.57
01_standalone_sdk/17_image_input.py ✅ PASS 16.9s $0.01
01_standalone_sdk/18_send_message_while_processing.py ✅ PASS 25.5s $0.01
01_standalone_sdk/19_llm_routing.py ✅ PASS 12.0s $0.02
01_standalone_sdk/20_stuck_detector.py ✅ PASS 15.6s $0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py ✅ PASS 10.1s $0.00
01_standalone_sdk/22_anthropic_thinking.py ✅ PASS 15.3s $0.01
01_standalone_sdk/23_responses_reasoning.py ✅ PASS 1m 22s $0.02
01_standalone_sdk/24_planning_agent_workflow.py ✅ PASS 3m 46s $0.30
01_standalone_sdk/25_agent_delegation.py ❌ FAIL
Timed out after 600 seconds
10m 0s $0.21
01_standalone_sdk/26_custom_visualizer.py ✅ PASS 19.4s $0.02
01_standalone_sdk/28_ask_agent_example.py ✅ PASS 31.7s $0.03
01_standalone_sdk/29_llm_streaming.py ✅ PASS 37.9s $0.03
01_standalone_sdk/30_tom_agent.py ✅ PASS 17.4s $0.01
01_standalone_sdk/31_iterative_refinement.py ✅ PASS 3m 49s $0.26
01_standalone_sdk/32_configurable_security_policy.py ✅ PASS 16.6s $0.02
01_standalone_sdk/34_critic_example.py ❌ FAIL
Exit code 1
3.3s --
01_standalone_sdk/36_event_json_to_openai_messages.py ✅ PASS 8.8s $0.00
01_standalone_sdk/37_llm_profile_store.py ✅ PASS 3.4s $0.00
02_remote_agent_server/01_convo_with_local_agent_server.py ✅ PASS 57.7s $0.03
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py ❌ FAIL
Exit code 1
55.3s --
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py ❌ FAIL
Exit code 1
16.7s --
02_remote_agent_server/04_convo_with_api_sandboxed_server.py ❌ FAIL
Exit code 1
1m 9s --
02_remote_agent_server/07_convo_with_cloud_workspace.py ✅ PASS 26.6s $0.02
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py ❌ FAIL
Exit code 1
3.5s --
04_llm_specific_tools/01_gpt5_apply_patch_preset.py ✅ PASS 20.5s $0.02
04_llm_specific_tools/02_gemini_file_tools.py ✅ PASS 50.1s $0.06
05_skills_and_plugins/01_loading_agentskills/main.py ✅ PASS 11.5s $0.01
05_skills_and_plugins/02_loading_plugins/main.py ✅ PASS 6.8s $0.01

❌ Some tests failed

Total: 38 | Passed: 32 | Failed: 6 | Total Cost: $1.83

Failed examples:

  • examples/01_standalone_sdk/25_agent_delegation.py: Timed out after 600 seconds
  • examples/01_standalone_sdk/34_critic_example.py: Exit code 1
  • examples/02_remote_agent_server/02_convo_with_docker_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/04_convo_with_api_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1

View full workflow run

@github-actions
Copy link
Contributor

github-actions bot commented Feb 11, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Generated: 2026-02-11 01:17:41 UTC

Example Status Duration Cost
01_standalone_sdk/02_custom_tools.py ✅ PASS 26.4s $0.02
01_standalone_sdk/03_activate_skill.py ✅ PASS 17.6s $0.01
01_standalone_sdk/05_use_llm_registry.py ✅ PASS 10.9s $0.00
01_standalone_sdk/07_mcp_integration.py ✅ PASS 48.7s $0.04
01_standalone_sdk/09_pause_example.py ✅ PASS 14.5s $0.01
01_standalone_sdk/10_persistence.py ✅ PASS 27.5s $0.01
01_standalone_sdk/11_async.py ✅ PASS 32.9s $0.03
01_standalone_sdk/12_custom_secrets.py ✅ PASS 18.2s $0.01
01_standalone_sdk/13_get_llm_metrics.py ✅ PASS 24.0s $0.02
01_standalone_sdk/14_context_condenser.py ✅ PASS 7m 0s $0.86
01_standalone_sdk/17_image_input.py ✅ PASS 17.6s $0.01
01_standalone_sdk/18_send_message_while_processing.py ✅ PASS 26.7s $0.01
01_standalone_sdk/19_llm_routing.py ✅ PASS 15.6s $0.01
01_standalone_sdk/20_stuck_detector.py ✅ PASS 19.7s $0.01
01_standalone_sdk/21_generate_extraneous_conversation_costs.py ✅ PASS 10.3s $0.00
01_standalone_sdk/22_anthropic_thinking.py ✅ PASS 14.6s $0.01
01_standalone_sdk/23_responses_reasoning.py ✅ PASS 1m 30s $0.01
01_standalone_sdk/24_planning_agent_workflow.py ✅ PASS 3m 26s $0.27
01_standalone_sdk/25_agent_delegation.py ❌ FAIL
Timed out after 600 seconds
10m 0s $0.14
01_standalone_sdk/26_custom_visualizer.py ✅ PASS 19.3s $0.02
01_standalone_sdk/28_ask_agent_example.py ✅ PASS 38.9s $0.04
01_standalone_sdk/29_llm_streaming.py ✅ PASS 41.4s $0.03
01_standalone_sdk/30_tom_agent.py ✅ PASS 16.3s $0.01
01_standalone_sdk/31_iterative_refinement.py ✅ PASS 4m 47s $0.32
01_standalone_sdk/32_configurable_security_policy.py ✅ PASS 18.3s $0.01
01_standalone_sdk/34_critic_example.py ❌ FAIL
Exit code 1
3.8s --
01_standalone_sdk/36_event_json_to_openai_messages.py ✅ PASS 9.3s $0.00
01_standalone_sdk/37_llm_profile_store.py ✅ PASS 3.9s $0.00
02_remote_agent_server/01_convo_with_local_agent_server.py ✅ PASS 59.1s $0.04
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py ❌ FAIL
Exit code 1
4.7s --
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py ❌ FAIL
Exit code 1
4.8s --
02_remote_agent_server/04_convo_with_api_sandboxed_server.py ❌ FAIL
Exit code 1
5m 10s --
02_remote_agent_server/07_convo_with_cloud_workspace.py ✅ PASS 32.3s $0.02
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py ❌ FAIL
Exit code 1
4.7s --
04_llm_specific_tools/01_gpt5_apply_patch_preset.py ✅ PASS 23.9s $0.02
04_llm_specific_tools/02_gemini_file_tools.py ✅ PASS 43.7s $0.06
05_skills_and_plugins/01_loading_agentskills/main.py ✅ PASS 12.8s $0.01
05_skills_and_plugins/02_loading_plugins/main.py ✅ PASS 6.8s $0.01

❌ Some tests failed

Total: 38 | Passed: 32 | Failed: 6 | Total Cost: $2.07

Failed examples:

  • examples/01_standalone_sdk/25_agent_delegation.py: Timed out after 600 seconds
  • examples/01_standalone_sdk/34_critic_example.py: Exit code 1
  • examples/02_remote_agent_server/02_convo_with_docker_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/04_convo_with_api_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1

View full workflow run

@github-actions
Copy link
Contributor

🧪 Integration Tests Results

Overall Success Rate: 100.0%
Total Cost: $0.92
Models Tested: 4
Timestamp: 2026-02-11 01:05:18 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Tests Passed Skipped Total Cost Tokens
litellm_proxy_deepseek_deepseek_reasoner 100.0% 7/7 1 8 $0.05 855,411
litellm_proxy_moonshot_kimi_k2_thinking 100.0% 7/7 1 8 $0.15 226,454
litellm_proxy_gemini_3_pro_preview 100.0% 8/8 0 8 $0.32 231,543
litellm_proxy_claude_sonnet_4_5_20250929 100.0% 8/8 0 8 $0.41 239,056

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.05
  • Token Usage: prompt: 833,861, completion: 21,550, cache_read: 782,592, reasoning: 9,288
  • Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_dc0581a_deepseek_v3_2_reasoner_run_N8_20260211_005116
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_moonshot_kimi_k2_thinking

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.15
  • Token Usage: prompt: 220,675, completion: 5,779, cache_read: 174,848
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_dc0581a_kimi_k2_thinking_run_N8_20260211_005117
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3_pro_preview

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.32
  • Token Usage: prompt: 226,438, completion: 5,105, cache_read: 109,169, reasoning: 3,289
  • Run Suffix: litellm_proxy_gemini_3_pro_preview_dc0581a_gemini_3_pro_run_N8_20260211_005117

litellm_proxy_claude_sonnet_4_5_20250929

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.41
  • Token Usage: prompt: 232,404, completion: 6,652, cache_read: 161,444, cache_write: 70,580, reasoning: 1,984
  • Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_dc0581a_claude_sonnet_4_5_20250929_run_N8_20260211_005117

The libtmux race condition fix (libtmux#625) requires Python 3.13
environment throughout the build process, not just in the base image.
Previously, the builder stage was using Python 3.12, which meant
dependencies were compiled for Python 3.12 even though the runtime
image used Python 3.13.

Changes:
- Update builder FROM to python:3.13-bookworm
- Update uv python install/venv from 3.12 to 3.13
- Update ARG BASE_IMAGE default to python3.13-nodejs22
- Update comment to reflect libtmux fix status

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig removed the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 11, 2026
@neubig neubig added the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 11, 2026 — with OpenHands AI
@github-actions
Copy link
Contributor

github-actions bot commented Feb 11, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Generated: 2026-02-11 01:50:17 UTC

Example Status Duration Cost
01_standalone_sdk/02_custom_tools.py ✅ PASS 23.8s $0.03
01_standalone_sdk/03_activate_skill.py ✅ PASS 18.9s $0.03
01_standalone_sdk/05_use_llm_registry.py ✅ PASS 11.4s $0.01
01_standalone_sdk/07_mcp_integration.py ✅ PASS 40.4s $0.04
01_standalone_sdk/09_pause_example.py ✅ PASS 13.0s $0.01
01_standalone_sdk/10_persistence.py ✅ PASS 26.7s $0.02
01_standalone_sdk/11_async.py ✅ PASS 36.2s $0.04
01_standalone_sdk/12_custom_secrets.py ✅ PASS 18.6s $0.02
01_standalone_sdk/13_get_llm_metrics.py ✅ PASS 21.9s $0.02
01_standalone_sdk/14_context_condenser.py ❌ FAIL
Timed out after 600 seconds
10m 0s --
01_standalone_sdk/17_image_input.py ✅ PASS 16.6s $0.02
01_standalone_sdk/18_send_message_while_processing.py ✅ PASS 26.3s $0.02
01_standalone_sdk/19_llm_routing.py ✅ PASS 13.5s $0.02
01_standalone_sdk/20_stuck_detector.py ✅ PASS 21.0s $0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py ✅ PASS 10.4s $0.00
01_standalone_sdk/22_anthropic_thinking.py ✅ PASS 20.6s $0.02
01_standalone_sdk/23_responses_reasoning.py ✅ PASS 1m 3s $0.02
01_standalone_sdk/24_planning_agent_workflow.py ✅ PASS 4m 13s $0.31
01_standalone_sdk/25_agent_delegation.py ✅ PASS 1m 50s $0.16
01_standalone_sdk/26_custom_visualizer.py ✅ PASS 20.4s $0.03
01_standalone_sdk/28_ask_agent_example.py ✅ PASS 46.1s $0.03
01_standalone_sdk/29_llm_streaming.py ✅ PASS 33.4s $0.02
01_standalone_sdk/30_tom_agent.py ✅ PASS 8.9s $0.01
01_standalone_sdk/31_iterative_refinement.py ✅ PASS 5m 2s $0.38
01_standalone_sdk/32_configurable_security_policy.py ✅ PASS 22.0s $0.02
01_standalone_sdk/34_critic_example.py ❌ FAIL
Exit code 1
3.8s --
01_standalone_sdk/36_event_json_to_openai_messages.py ✅ PASS 9.6s $0.00
01_standalone_sdk/37_llm_profile_store.py ✅ PASS 3.8s $0.00
02_remote_agent_server/01_convo_with_local_agent_server.py ✅ PASS 51.2s $0.04
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py ❌ FAIL
Exit code 1
1m 0s --
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py ❌ FAIL
Exit code 1
16.2s --
02_remote_agent_server/04_convo_with_api_sandboxed_server.py ❌ FAIL
Exit code 1
1m 40s --
02_remote_agent_server/07_convo_with_cloud_workspace.py ✅ PASS 28.3s $0.02
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py ❌ FAIL
Exit code 1
3m 3s --
04_llm_specific_tools/01_gpt5_apply_patch_preset.py ✅ PASS 20.1s $0.02
04_llm_specific_tools/02_gemini_file_tools.py ✅ PASS 1m 1s $0.08
05_skills_and_plugins/01_loading_agentskills/main.py ✅ PASS 12.8s $0.02
05_skills_and_plugins/02_loading_plugins/main.py ✅ PASS 7.2s $0.01

❌ Some tests failed

Total: 38 | Passed: 32 | Failed: 6 | Total Cost: $1.47

Failed examples:

  • examples/01_standalone_sdk/14_context_condenser.py: Timed out after 600 seconds
  • examples/01_standalone_sdk/34_critic_example.py: Exit code 1
  • examples/02_remote_agent_server/02_convo_with_docker_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/04_convo_with_api_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1

View full workflow run

When libtmux's new_session() returns a Session with session_id=None
(due to a bug in parse_output() not handling fewer output values),
we now retry fetching the session from server.sessions.

This is a workaround for the issue where neubig's fix PR #625 doesn't
fully handle cases where tmux returns fewer than the expected 125
format field values, causing the zip() to truncate before reaching
session_id at index 92.

The workaround:
1. Check if session_id is None after new_session()
2. If so, retry up to 3 times with increasing delays
3. Fetch the session by name from server.sessions
4. Raise a clear error if all retries fail

See: tmux-python/libtmux#624

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig removed the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 11, 2026
@neubig neubig added the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 11, 2026 — with OpenHands AI
@github-actions
Copy link
Contributor

github-actions bot commented Feb 11, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Generated: 2026-02-11 02:13:32 UTC

Example Status Duration Cost
01_standalone_sdk/02_custom_tools.py ✅ PASS 29.9s $0.03
01_standalone_sdk/03_activate_skill.py ✅ PASS 19.5s $0.03
01_standalone_sdk/05_use_llm_registry.py ✅ PASS 12.4s $0.01
01_standalone_sdk/07_mcp_integration.py ✅ PASS 42.8s $0.04
01_standalone_sdk/09_pause_example.py ✅ PASS 17.1s $0.01
01_standalone_sdk/10_persistence.py ✅ PASS 26.7s $0.02
01_standalone_sdk/11_async.py ✅ PASS 38.8s $0.04
01_standalone_sdk/12_custom_secrets.py ✅ PASS 19.0s $0.02
01_standalone_sdk/13_get_llm_metrics.py ✅ PASS 23.1s $0.02
01_standalone_sdk/14_context_condenser.py ✅ PASS 2m 58s $0.38
01_standalone_sdk/17_image_input.py ✅ PASS 14.5s $0.02
01_standalone_sdk/18_send_message_while_processing.py ✅ PASS 15.9s $0.01
01_standalone_sdk/19_llm_routing.py ✅ PASS 13.0s $0.02
01_standalone_sdk/20_stuck_detector.py ✅ PASS 18.4s $0.03
01_standalone_sdk/21_generate_extraneous_conversation_costs.py ✅ PASS 10.4s $0.00
01_standalone_sdk/22_anthropic_thinking.py ✅ PASS 13.7s $0.01
01_standalone_sdk/23_responses_reasoning.py ✅ PASS 1m 11s $0.01
01_standalone_sdk/24_planning_agent_workflow.py ✅ PASS 3m 33s $0.26
01_standalone_sdk/25_agent_delegation.py ✅ PASS 2m 23s $0.19
01_standalone_sdk/26_custom_visualizer.py ✅ PASS 20.2s $0.02
01_standalone_sdk/28_ask_agent_example.py ✅ PASS 33.3s $0.02
01_standalone_sdk/29_llm_streaming.py ✅ PASS 41.2s $0.03
01_standalone_sdk/30_tom_agent.py ✅ PASS 21.9s $0.02
01_standalone_sdk/31_iterative_refinement.py ✅ PASS 3m 54s $0.28
01_standalone_sdk/32_configurable_security_policy.py ✅ PASS 24.9s $0.02
01_standalone_sdk/34_critic_example.py ❌ FAIL
Exit code 1
3.9s --
01_standalone_sdk/36_event_json_to_openai_messages.py ✅ PASS 10.0s $0.00
01_standalone_sdk/37_llm_profile_store.py ✅ PASS 4.0s $0.00
02_remote_agent_server/01_convo_with_local_agent_server.py ✅ PASS 57.8s $0.04
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py ❌ FAIL
Exit code 1
1m 5s --
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py ❌ FAIL
Exit code 1
17.1s --
02_remote_agent_server/04_convo_with_api_sandboxed_server.py ❌ FAIL
Exit code 1
1m 11s --
02_remote_agent_server/07_convo_with_cloud_workspace.py ✅ PASS 39.3s $0.03
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py ❌ FAIL
Exit code 1
2m 59s --
04_llm_specific_tools/01_gpt5_apply_patch_preset.py ✅ PASS 23.6s $0.03
04_llm_specific_tools/02_gemini_file_tools.py ✅ PASS 1m 3s $0.10
05_skills_and_plugins/01_loading_agentskills/main.py ✅ PASS 13.5s $0.01
05_skills_and_plugins/02_loading_plugins/main.py ✅ PASS 12.6s $0.01

❌ Some tests failed

Total: 38 | Passed: 33 | Failed: 5 | Total Cost: $1.75

Failed examples:

  • examples/01_standalone_sdk/34_critic_example.py: Exit code 1
  • examples/02_remote_agent_server/02_convo_with_docker_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/04_convo_with_api_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1

View full workflow run

Improved the workaround for the libtmux race condition where
new_session() returns a Session with session_id=None.

Changes:
- Wrap entire session creation in retry loop (up to 5 attempts)
- Try to get session from server.sessions if session_id is None
- Add increasing delays between retries
- Clean up orphan sessions before retry
- Better error messages with last error included

The root issue is that neubig's PR #625 fix tries to parse 125+
format fields from tmux output, but tmux may not output all fields
in some environments (especially Python 3.13 + PyInstaller + Docker),
causing session_id (at index 92) to be missing from the parsed output.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig removed the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 11, 2026
@neubig neubig added the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 11, 2026 — with OpenHands AI
@github-actions
Copy link
Contributor

github-actions bot commented Feb 11, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Generated: 2026-02-11 02:38:19 UTC

Example Status Duration Cost
01_standalone_sdk/02_custom_tools.py ✅ PASS 26.0s $0.03
01_standalone_sdk/03_activate_skill.py ✅ PASS 17.3s $0.03
01_standalone_sdk/05_use_llm_registry.py ✅ PASS 12.5s $0.01
01_standalone_sdk/07_mcp_integration.py ✅ PASS 40.0s $0.03
01_standalone_sdk/09_pause_example.py ✅ PASS 15.5s $0.01
01_standalone_sdk/10_persistence.py ✅ PASS 29.4s $0.02
01_standalone_sdk/11_async.py ✅ PASS 35.9s $0.04
01_standalone_sdk/12_custom_secrets.py ✅ PASS 20.5s $0.02
01_standalone_sdk/13_get_llm_metrics.py ✅ PASS 20.8s $0.01
01_standalone_sdk/14_context_condenser.py ✅ PASS 4m 48s $0.56
01_standalone_sdk/17_image_input.py ✅ PASS 15.3s $0.02
01_standalone_sdk/18_send_message_while_processing.py ✅ PASS 26.4s $0.02
01_standalone_sdk/19_llm_routing.py ✅ PASS 13.4s $0.02
01_standalone_sdk/20_stuck_detector.py ✅ PASS 12.8s $0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py ✅ PASS 10.6s $0.00
01_standalone_sdk/22_anthropic_thinking.py ✅ PASS 19.7s $0.01
01_standalone_sdk/23_responses_reasoning.py ✅ PASS 1m 31s $0.01
01_standalone_sdk/24_planning_agent_workflow.py ✅ PASS 5m 12s $0.39
01_standalone_sdk/25_agent_delegation.py ✅ PASS 1m 58s $0.17
01_standalone_sdk/26_custom_visualizer.py ✅ PASS 22.8s $0.02
01_standalone_sdk/28_ask_agent_example.py ✅ PASS 34.8s $0.03
01_standalone_sdk/29_llm_streaming.py ✅ PASS 40.9s $0.03
01_standalone_sdk/30_tom_agent.py ✅ PASS 17.4s $0.02
01_standalone_sdk/31_iterative_refinement.py ❌ FAIL
Timed out after 600 seconds
10m 0s --
01_standalone_sdk/32_configurable_security_policy.py ✅ PASS 16.7s $0.02
01_standalone_sdk/34_critic_example.py ❌ FAIL
Exit code 1
4.1s --
01_standalone_sdk/36_event_json_to_openai_messages.py ✅ PASS 9.9s $0.00
01_standalone_sdk/37_llm_profile_store.py ✅ PASS 3.9s $0.00
02_remote_agent_server/01_convo_with_local_agent_server.py ✅ PASS 46.6s $0.03
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py ❌ FAIL
Exit code 1
1m 2s --
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py ❌ FAIL
Exit code 1
17.8s --
02_remote_agent_server/04_convo_with_api_sandboxed_server.py ❌ FAIL
Exit code 1
2m 22s --
02_remote_agent_server/07_convo_with_cloud_workspace.py ✅ PASS 28.0s $0.02
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py ❌ FAIL
Exit code 1
2m 48s --
04_llm_specific_tools/01_gpt5_apply_patch_preset.py ✅ PASS 21.1s $0.01
04_llm_specific_tools/02_gemini_file_tools.py ✅ PASS 40.6s $0.05
05_skills_and_plugins/01_loading_agentskills/main.py ✅ PASS 11.1s $0.01
05_skills_and_plugins/02_loading_plugins/main.py ✅ PASS 7.1s $0.01

❌ Some tests failed

Total: 38 | Passed: 32 | Failed: 6 | Total Cost: $1.69

Failed examples:

  • examples/01_standalone_sdk/31_iterative_refinement.py: Timed out after 600 seconds
  • examples/01_standalone_sdk/34_critic_example.py: Exit code 1
  • examples/02_remote_agent_server/02_convo_with_docker_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/04_convo_with_api_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1

View full workflow run

@neubig
Copy link
Contributor Author

neubig commented Feb 11, 2026

@OpenHands merge main. then carefully read all the surrounding context and debug why 02_remote_agent_server/02_convo_with_docker_sandboxed_server.py is failing despite our best efforts to fix it. gather information from the github action logs, reproduce locally, and then try to push a fix. you can re-trigger test-examples and see what happens to debug, but try to do that sparingly and only after you're able to try to debug as much as you can locally.

@openhands-ai
Copy link

openhands-ai bot commented Feb 11, 2026

I'm on it! neubig can track my progress at all-hands.dev

Co-authored-by: openhands <openhands@all-hands.dev>
The libtmux fix (neubig's PR #625) uses a format string with 125+ fields,
but tmux may not output all fields correctly in some environments
(Python 3.13 + PyInstaller + Docker), causing session_id (at index 92)
to be missing from the parsed output.

This adds a fallback that directly queries tmux using a simple format
string ('#{session_id}:#{session_name}') to get the session_id when
libtmux's complex format parsing fails.

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Feb 12, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Run Examples Scripts

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1978 at branch `upgrade-python313-with-libtmux-fix`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

@openhands-ai
Copy link

openhands-ai bot commented Feb 12, 2026

Summary

I successfully debugged and fixed the 02_convo_with_docker_sandboxed_server.py test failure.

Root Cause Analysis

The test was failing with:

RuntimeError: Failed to get session_id for session 'openhands-None-...' after 5 attempts.
See https://github.com/tmux-python/libtmux/issues/624

The issue was that the libtmux fix (neubig's PR #625) uses a format string with 125+ fields to parse tmux output. The session_id field is at index 92. In certain environments (Python 3.13 + PyInstaller + Docker), tmux may not output all fields correctly, causing session_id to be missing from the parsed output.

The existing retry logic tried to get the session from server.sessions.filter(), but this uses the same broken parsing logic, so it also failed.

Fix Implemented

Added a new fallback function _get_session_id_directly() that:

  1. Runs a simple tmux list-sessions -F '#{session_id}:#{session_name}' command
  2. Parses the output to find the session_id for the given session name
  3. Bypasses libtmux's complex 125-field format parsing entirely

This fallback is tried before the existing server.sessions.filter() fallback, providing a more reliable way to get the session_id.

Changes Made

  1. Merged main into the branch to get latest changes
  2. Added _get_session_id_directly() function in openhands-tools/openhands/tools/terminal/terminal/tmux_terminal.py
  3. Updated the retry logic to use the new fallback as the first option

Verification

  • ✅ Local testing passed (created 3 tmux terminals successfully)
  • ✅ CI test-examples workflow passed for the Docker-related tests:
    • 02_convo_with_docker_sandboxed_server.py - PASS (was failing)
    • 03_browser_use_with_docker_sandboxed_server.py - PASS (was failing)
    • 04_convo_with_api_sandboxed_server.py - PASS (was failing)

The remaining 3 failures are unrelated to the libtmux issue:

  • 34_critic_example.py - Missing CRITIC_SERVER_URL env var
  • 38_browser_session_recording.py - Timeout (new test)
  • 08_convo_with_apptainer_sandboxed_server.py - Uses latest-python image instead of PR-specific image

Commits Pushed

  1. 7dd2427e - Merge main into upgrade-python313-with-libtmux-fix
  2. 40e712f6 - fix: add direct tmux fallback for session_id resolution

View full conversation

@neubig neubig added the review-this This label triggers a PR review by OpenHands label Feb 12, 2026 — with OpenHands AI
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical Design Issues - Do Not Merge Yet

Taste Rating: 🔴 Needs fundamental rework

This PR attempts to fix a real problem (libtmux race condition in Python 3.13) but shows signs that the upstream fix is incomplete. The 80+ lines of workaround/retry logic suggests we're building a band-aid on top of a band-aid.

Core Issue: If the upstream libtmux fix works as intended, why do we need extensive retry logic, fallback mechanisms, and manual session_id extraction? The answer is in your own comments: "tmux may not output all fields" - the upstream fix is incomplete.

Key Insight: Pinning to an unmerged git branch with extensive workarounds is the wrong approach. Either (1) wait for a proper upstream release, or (2) fix the root cause in libtmux properly, or (3) implement a simpler workaround that doesn't require 80 lines of retry spaghetti.

See detailed inline comments below.

Copy link
Contributor Author

neubig commented Feb 15, 2026

Thanks for the thorough review! I agree with the assessment:

  1. Keeping as DRAFT - This PR will remain as a draft until the upstream libtmux fix is merged and released to PyPI. The git branch dependency is indeed fragile.

  2. Over-engineering concern - The multiple fallback mechanisms were added defensively because the upstream fix was unstable. Once the upstream is properly released, we can simplify to a single approach.

  3. Splitting changes - Will consider splitting into separate PRs:

    • Type hint modernization (Python 3.9+ style)
    • CI variable rename (SDK_SHA/SDK_REF)
    • Python 3.13 + libtmux upgrade (blocked on upstream release)
  4. Debian upgrade - The bullseye → bookworm change was intentional to match Python 3.13's default image. Will verify compatibility before merging.

Leaving the review threads unresolved as reminders for when we revisit this PR after the upstream is ready.

@neubig neubig requested a review from all-hands-bot February 15, 2026 23:23
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable for Draft/Testing

Taste Rating: 🟡 Pragmatic for its purpose

This draft PR achieves its stated goal: testing the Python 3.13 upgrade with the libtmux race condition fix in CI. Previous architectural concerns (git branch pin, retry complexity) were discussed and marked resolved - the maintainer understands this is temporary until upstream releases to PyPI.


Key Observations

1. SDK_SHA infrastructure already exists

  • build.py was updated in Nov 2025 (commit 204d3a4) to read SDK_SHA with fallback to GITHUB_SHA
  • The workflow changes in this PR correctly use this existing mechanism
  • No changes to build.py needed

2. Variable naming is intentional

  • SDK_SHA (in server.yml) = build context for tagging Docker images
  • AGENT_SERVER_SHA (in run-examples.yml) = runtime context for examples to pull images
  • Different names reflect their different purposes in separate workflows

3. Comprehensive comments

  • The added comments explaining SHA handling make the confusing GitHub Actions variable behavior clearer
  • Debug echo statements (server.yml:242-244) help verify SHA handling during testing
  • Good engineering for maintainability

4. Type hints are clean

  • Generator[T] syntax is correct for Python 3.9+
  • Changes are consistent across codebase

For Production Merge (when upstream is ready)

Before merging to production:

  • ✅ Wait for upstream libtmux PR to be merged and released to PyPI (already noted in PR description)
  • 🟢 Consider removing debug echo statements or gate behind a debug flag
  • 🟡 Evaluate whether the extensive retry logic indicates the upstream fix needs more work

Verdict

Approved for draft/testing purposes

The code is pragmatic for validating the libtmux fix in your CI environment. The previous review comprehensively covered the blockers for production merge, and those concerns remain valid - but this is explicitly a draft PR for testing, not for immediate production use.

The approach is sound for its stated purpose. Good luck with the testing! 🚀

Copy link
Contributor Author

neubig commented Feb 16, 2026

Closing in favor of #2092 which has a much simpler fix.

The root cause was discovered to be a locale issue, not a race condition in libtmux. When LC_ALL/LANG are not set in PyInstaller builds, tmux converts UTF-8 separator characters to underscores, breaking libtmux's format parsing.

The fix is simply adding ENV LC_ALL=C.UTF-8 and ENV LANG=C.UTF-8 to the Dockerfile - no code changes or custom libtmux forks needed.

@neubig neubig closed this Feb 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration-test Runs the integration tests and comments the results needs-review review-this This label triggers a PR review by OpenHands test-examples Run all applicable "examples/" files. Expensive operation.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants