feat: Add multi-model PR review with debate mechanism by xingyaoww · Pull Request #2009 · OpenHands/software-agent-sdk

xingyaoww · 2026-02-11T16:03:37Z

Summary

This PR adds an experimental workflow for PR review using multiple AI models (GPT-5.2, Claude Sonnet 4.5, Gemini 3 Flash) that debate to produce a consolidated, well-reasoned final review.

Key Features

Phase 1: Parallel Reviews

Three AI models independently review the PR
Each model provides its own code review with findings, suggestions, and concerns

Phase 2: Debate

Reviewers are given each other's reviews
Inter-agent communication tools allow models to discuss disagreements
Turn-based synchronization prevents race conditions
Final consolidated review synthesis

Architecture

┌─────────────────────────────────────────────────────────────┐
│                        main.py                              │
│                   (Entry Point)                             │
└─────────────────────────┬───────────────────────────────────┘
                          │
          ┌───────────────┼───────────────┐
          ▼               ▼               ▼
┌─────────────────┐ ┌───────────────┐ ┌─────────────────────┐
│  github_utils   │ │ review_runner │ │ debate_orchestrator │
│  (GitHub API)   │ │ (Multi-Model) │ │   (Coordination)    │
└─────────────────┘ └───────────────┘ └──────────┬──────────┘
                                                 │
                                    ┌────────────┼────────────┐
                                    ▼            ▼            ▼
                              ┌──────────┐ ┌──────────┐ ┌──────────┐
                              │  GPT-5.2 │ │  Claude  │ │  Gemini  │
                              │  Agent   │ │  Agent   │ │  Agent   │
                              └────┬─────┘ └────┬─────┘ └────┬─────┘
                                   │            │            │
                                   └────────────┼────────────┘
                                                │
                                    ┌───────────▼───────────┐
                                    │    debate_tools.py    │
                                    │  - SendToReviewer     │
                                    │  - ConcludeDebate     │
                                    │  - MessageQueue       │
                                    └───────────────────────┘

New Files

File	Description
`github_utils.py`	Refactored GitHub API utilities
`models.py`	Data models for reviews and debate state
`prompt.py`	Prompt templates for reviews and debate
`debate_tools.py`	Inter-agent communication tools (SendToReviewer, ConcludeDebate)
`review_runner.py`	Multi-model parallel review execution
`debate_orchestrator.py`	Debate coordination and synchronization
`main.py`	Entry point for the workflow
`README.md`	Comprehensive documentation

Debate Tools

Each reviewer agent has access to:

SendToReviewer: Send messages to other reviewers (claude, gpt, gemini, or all)
ConcludeDebate: Conclude participation with final position, consensus points, and remaining disagreements

Configuration

Environment variables:

MAX_DEBATE_ROUNDS: Maximum debate rounds (default: 3)
REVIEW_STYLE: 'standard' or 'roasted'
All standard PR review environment variables

⚠️ Experimental Status

This is an experimental workflow for exploring multi-model collaboration patterns. Use the standard 02_pr_review workflow for production use cases.

Testing

All files pass pre-commit hooks (ruff lint/format, pyright, pycodestyle)
Manual testing recommended with actual LLM API keys

Co-authored-by: openhands openhands@all-hands.dev

@xingyaoww can click here to continue refining the PR

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:8fc785c-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-8fc785c-python \
  ghcr.io/openhands/agent-server:8fc785c-python

All tags pushed for this build

ghcr.io/openhands/agent-server:8fc785c-golang-amd64
ghcr.io/openhands/agent-server:8fc785c-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:8fc785c-golang-arm64
ghcr.io/openhands/agent-server:8fc785c-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:8fc785c-java-amd64
ghcr.io/openhands/agent-server:8fc785c-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:8fc785c-java-arm64
ghcr.io/openhands/agent-server:8fc785c-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:8fc785c-python-amd64
ghcr.io/openhands/agent-server:8fc785c-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:8fc785c-python-arm64
ghcr.io/openhands/agent-server:8fc785c-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:8fc785c-golang
ghcr.io/openhands/agent-server:8fc785c-java
ghcr.io/openhands/agent-server:8fc785c-python

About Multi-Architecture Support

Each variant tag (e.g., 8fc785c-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 8fc785c-python-amd64) are also available if needed

This adds an experimental workflow for PR review using multiple AI models (GPT-5.2, Claude Sonnet 4.5, Gemini 3 Flash) that debate to produce a consolidated review. Key features: - Parallel initial reviews from multiple models - Inter-agent communication tools for debate - Turn-based synchronization mechanism - Configurable debate rounds - Final consolidated review synthesis New files: - github_utils.py: Refactored GitHub API utilities - models.py: Data models for reviews and debate - prompt.py: Prompt templates for reviews and debate - debate_tools.py: Inter-agent communication tools - review_runner.py: Multi-model review execution - debate_orchestrator.py: Debate coordination - main.py: Entry point - README.md: Documentation Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

🟡 Acceptable - Interesting experiment, but has fundamental design issues.

VERDICT: ❌ Needs rework - Fix the fake MessageQueue, add tests, and reconsider the threading model.

KEY INSIGHT: You've built a synchronous RPC system but called it "message queue" and "debate" - the complexity doesn't match what it actually does.

[CRITICAL ISSUES]

1. MessageQueue is Not a Queue (debate_tools.py:295-355)

Problem: MessageQueue is a misleading name. Looking at the code:

_response_queues dict is created but never used (line 327)
send_and_wait() just calls _response_handler with a lock (line 335-350)
No actual queuing, no async communication, no message buffering

This is a callback wrapper pretending to be a queue. The entire class could be replaced with:

class ResponseCoordinator:
    def __init__(self, handler):
        self._handler = handler
        self._lock = threading.Lock()
    
    def call_and_wait(self, sender, recipient, message):
        with self._lock:
            return self._handler(sender, recipient, message)

Impact: Misleading abstraction that will confuse maintainers. Remove the unused _response_queues or implement a real message queue.

2. No Tests for Complex Orchestration (main.py:1-214)

Problem: This PR adds:

Complex threading and synchronization
Inter-agent communication
Multi-model orchestration
Message routing and debate state

Zero tests. The repo has test infrastructure - use it.

Required tests:

Unit tests for MessageQueue behavior
Integration tests for debate orchestration
Error handling tests (agent timeout, model failure, network errors)
State management tests (message ordering, round tracking)

3. Agent Response Can Hang Forever (debate_orchestrator.py:155-180)

Problem: _get_agent_response() calls conversation.run() with no timeout. If the agent fails to respond or gets stuck, this blocks indefinitely.

Fix: Add timeout to conversation.run() and handle timeout explicitly:

try:
    conversation.run(timeout=60)  # or configurable
except TimeoutError:
    return "Agent did not respond in time."

[IMPROVEMENT OPPORTUNITIES]

4. Threading Pattern is Wrong (debate_orchestrator.py:300-320)

Problem: You spawn threads to run agents in parallel, but each agent immediately blocks in send_and_wait() when trying to communicate. This defeats the purpose of threading.

You're getting the complexity of threading without the benefits of parallelism.

Better approaches:

Sequential debate (simplest, clearest)
True async with asyncio (if you need parallelism)
Actor model (if you need real concurrency)

Current approach: complexity of threading + none of the benefits.

5. Blocking != Turn-Based (README.md:129-136)

Problem: The README claims "turn-based synchronization" but the implementation is just blocking RPC calls.

Turn-based means agents take turns speaking to the group. Your implementation makes Agent A block until Agent B responds, which could then block on Agent C.

The behavior works, but the description is misleading.

6. Hardcoded Model Names (models.py:14-18)

Problem: Model names like openai/gpt-5.2 will break when providers deprecate them.

Suggestions:

Make models configurable via env vars
Add fallback mechanism
Document model lifecycle

For an experimental workflow, maybe acceptable, but will cause maintenance pain.

7. ThreadPoolExecutor Could Be Simpler (review_runner.py:154-190)

Problem: You're wrapping SDK Conversation/Agent (which has async internals) in threads. Why not use asyncio directly?

Not critical for an experiment, but consider for better resource usage.

[GOOD PARTS]

✅ github_utils.py - Solid error handling, pragmatic truncation, reasonable GraphQL pagination
✅ Documentation - Thorough README with clear examples
✅ Error handling - Generally decent throughout
✅ Configuration - Environment variables are appropriate
✅ Module structure - Reasonable separation of concerns

Cost/Benefit Question

Pragmatic concern: Does multi-model debate solve a real problem?

Cost: 3-5x more expensive than single review (3 models + debate rounds)
Benefit: Maybe slightly better review quality?
Complexity: Significant (threading, communication, orchestration)

This feels like solving an imaginary problem. If the goal is better reviews, consider:

Better prompts for a single model
Simple consolidation without "debate"
Human review of AI output

Is the debate actually improving reviews enough to justify 5x cost + complexity?

openhands-ai · 2026-02-11T16:15:53Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Check duplicate example numbers

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #2009 at branch `feature/multi-model-debate-pr-review`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

xingyaoww added the review-this This label triggers a PR review by OpenHands label Feb 11, 2026 — with OpenHands AI

all-hands-bot reviewed Feb 11, 2026

View reviewed changes

xingyaoww marked this pull request as draft February 11, 2026 16:07

enyst added the behavior-initiative This is related to the system prompt sections and LLM steering. label Feb 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add multi-model PR review with debate mechanism#2009

feat: Add multi-model PR review with debate mechanism#2009
xingyaoww wants to merge 1 commit intomainfrom
feature/multi-model-debate-pr-review

xingyaoww commented Feb 11, 2026 •

edited by github-actions bot

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

openhands-ai bot commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

xingyaoww commented Feb 11, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

Phase 1: Parallel Reviews

Phase 2: Debate

Architecture

New Files

Debate Tools

Configuration

⚠️ Experimental Status

Testing

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

[CRITICAL ISSUES]

1. MessageQueue is Not a Queue (debate_tools.py:295-355)

2. No Tests for Complex Orchestration (main.py:1-214)

3. Agent Response Can Hang Forever (debate_orchestrator.py:155-180)

[IMPROVEMENT OPPORTUNITIES]

4. Threading Pattern is Wrong (debate_orchestrator.py:300-320)

5. Blocking != Turn-Based (README.md:129-136)

6. Hardcoded Model Names (models.py:14-18)

7. ThreadPoolExecutor Could Be Simpler (review_runner.py:154-190)

[GOOD PARTS]

Cost/Benefit Question

Uh oh!

openhands-ai bot commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xingyaoww commented Feb 11, 2026 •

edited by github-actions bot

Loading