Skip to content

feat: Add multi-model PR review with debate mechanism#2009

Draft
xingyaoww wants to merge 1 commit intomainfrom
feature/multi-model-debate-pr-review
Draft

feat: Add multi-model PR review with debate mechanism#2009
xingyaoww wants to merge 1 commit intomainfrom
feature/multi-model-debate-pr-review

Conversation

@xingyaoww
Copy link
Collaborator

@xingyaoww xingyaoww commented Feb 11, 2026

Summary

This PR adds an experimental workflow for PR review using multiple AI models (GPT-5.2, Claude Sonnet 4.5, Gemini 3 Flash) that debate to produce a consolidated, well-reasoned final review.

Key Features

Phase 1: Parallel Reviews

  • Three AI models independently review the PR
  • Each model provides its own code review with findings, suggestions, and concerns

Phase 2: Debate

  • Reviewers are given each other's reviews
  • Inter-agent communication tools allow models to discuss disagreements
  • Turn-based synchronization prevents race conditions
  • Final consolidated review synthesis

Architecture

┌─────────────────────────────────────────────────────────────┐
│                        main.py                              │
│                   (Entry Point)                             │
└─────────────────────────┬───────────────────────────────────┘
                          │
          ┌───────────────┼───────────────┐
          ▼               ▼               ▼
┌─────────────────┐ ┌───────────────┐ ┌─────────────────────┐
│  github_utils   │ │ review_runner │ │ debate_orchestrator │
│  (GitHub API)   │ │ (Multi-Model) │ │   (Coordination)    │
└─────────────────┘ └───────────────┘ └──────────┬──────────┘
                                                 │
                                    ┌────────────┼────────────┐
                                    ▼            ▼            ▼
                              ┌──────────┐ ┌──────────┐ ┌──────────┐
                              │  GPT-5.2 │ │  Claude  │ │  Gemini  │
                              │  Agent   │ │  Agent   │ │  Agent   │
                              └────┬─────┘ └────┬─────┘ └────┬─────┘
                                   │            │            │
                                   └────────────┼────────────┘
                                                │
                                    ┌───────────▼───────────┐
                                    │    debate_tools.py    │
                                    │  - SendToReviewer     │
                                    │  - ConcludeDebate     │
                                    │  - MessageQueue       │
                                    └───────────────────────┘

New Files

File Description
github_utils.py Refactored GitHub API utilities
models.py Data models for reviews and debate state
prompt.py Prompt templates for reviews and debate
debate_tools.py Inter-agent communication tools (SendToReviewer, ConcludeDebate)
review_runner.py Multi-model parallel review execution
debate_orchestrator.py Debate coordination and synchronization
main.py Entry point for the workflow
README.md Comprehensive documentation

Debate Tools

Each reviewer agent has access to:

  • SendToReviewer: Send messages to other reviewers (claude, gpt, gemini, or all)
  • ConcludeDebate: Conclude participation with final position, consensus points, and remaining disagreements

Configuration

Environment variables:

  • MAX_DEBATE_ROUNDS: Maximum debate rounds (default: 3)
  • REVIEW_STYLE: 'standard' or 'roasted'
  • All standard PR review environment variables

⚠️ Experimental Status

This is an experimental workflow for exploring multi-model collaboration patterns. Use the standard 02_pr_review workflow for production use cases.

Testing

  • All files pass pre-commit hooks (ruff lint/format, pyright, pycodestyle)
  • Manual testing recommended with actual LLM API keys

Co-authored-by: openhands openhands@all-hands.dev

@xingyaoww can click here to continue refining the PR


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:8fc785c-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-8fc785c-python \
  ghcr.io/openhands/agent-server:8fc785c-python

All tags pushed for this build

ghcr.io/openhands/agent-server:8fc785c-golang-amd64
ghcr.io/openhands/agent-server:8fc785c-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:8fc785c-golang-arm64
ghcr.io/openhands/agent-server:8fc785c-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:8fc785c-java-amd64
ghcr.io/openhands/agent-server:8fc785c-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:8fc785c-java-arm64
ghcr.io/openhands/agent-server:8fc785c-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:8fc785c-python-amd64
ghcr.io/openhands/agent-server:8fc785c-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:8fc785c-python-arm64
ghcr.io/openhands/agent-server:8fc785c-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:8fc785c-golang
ghcr.io/openhands/agent-server:8fc785c-java
ghcr.io/openhands/agent-server:8fc785c-python

About Multi-Architecture Support

  • Each variant tag (e.g., 8fc785c-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 8fc785c-python-amd64) are also available if needed

This adds an experimental workflow for PR review using multiple AI models
(GPT-5.2, Claude Sonnet 4.5, Gemini 3 Flash) that debate to produce a
consolidated review.

Key features:
- Parallel initial reviews from multiple models
- Inter-agent communication tools for debate
- Turn-based synchronization mechanism
- Configurable debate rounds
- Final consolidated review synthesis

New files:
- github_utils.py: Refactored GitHub API utilities
- models.py: Data models for reviews and debate
- prompt.py: Prompt templates for reviews and debate
- debate_tools.py: Inter-agent communication tools
- review_runner.py: Multi-model review execution
- debate_orchestrator.py: Debate coordination
- main.py: Entry point
- README.md: Documentation

Co-authored-by: openhands <openhands@all-hands.dev>
@xingyaoww xingyaoww added the review-this This label triggers a PR review by OpenHands label Feb 11, 2026 — with OpenHands AI
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Interesting experiment, but has fundamental design issues.

VERDICT:Needs rework - Fix the fake MessageQueue, add tests, and reconsider the threading model.

KEY INSIGHT: You've built a synchronous RPC system but called it "message queue" and "debate" - the complexity doesn't match what it actually does.


[CRITICAL ISSUES]

1. MessageQueue is Not a Queue (debate_tools.py:295-355)

Problem: MessageQueue is a misleading name. Looking at the code:

  • _response_queues dict is created but never used (line 327)
  • send_and_wait() just calls _response_handler with a lock (line 335-350)
  • No actual queuing, no async communication, no message buffering

This is a callback wrapper pretending to be a queue. The entire class could be replaced with:

class ResponseCoordinator:
    def __init__(self, handler):
        self._handler = handler
        self._lock = threading.Lock()
    
    def call_and_wait(self, sender, recipient, message):
        with self._lock:
            return self._handler(sender, recipient, message)

Impact: Misleading abstraction that will confuse maintainers. Remove the unused _response_queues or implement a real message queue.

2. No Tests for Complex Orchestration (main.py:1-214)

Problem: This PR adds:

  • Complex threading and synchronization
  • Inter-agent communication
  • Multi-model orchestration
  • Message routing and debate state

Zero tests. The repo has test infrastructure - use it.

Required tests:

  1. Unit tests for MessageQueue behavior
  2. Integration tests for debate orchestration
  3. Error handling tests (agent timeout, model failure, network errors)
  4. State management tests (message ordering, round tracking)

3. Agent Response Can Hang Forever (debate_orchestrator.py:155-180)

Problem: _get_agent_response() calls conversation.run() with no timeout. If the agent fails to respond or gets stuck, this blocks indefinitely.

Fix: Add timeout to conversation.run() and handle timeout explicitly:

try:
    conversation.run(timeout=60)  # or configurable
except TimeoutError:
    return "Agent did not respond in time."

[IMPROVEMENT OPPORTUNITIES]

4. Threading Pattern is Wrong (debate_orchestrator.py:300-320)

Problem: You spawn threads to run agents in parallel, but each agent immediately blocks in send_and_wait() when trying to communicate. This defeats the purpose of threading.

You're getting the complexity of threading without the benefits of parallelism.

Better approaches:

  1. Sequential debate (simplest, clearest)
  2. True async with asyncio (if you need parallelism)
  3. Actor model (if you need real concurrency)

Current approach: complexity of threading + none of the benefits.

5. Blocking != Turn-Based (README.md:129-136)

Problem: The README claims "turn-based synchronization" but the implementation is just blocking RPC calls.

Turn-based means agents take turns speaking to the group. Your implementation makes Agent A block until Agent B responds, which could then block on Agent C.

The behavior works, but the description is misleading.

6. Hardcoded Model Names (models.py:14-18)

Problem: Model names like openai/gpt-5.2 will break when providers deprecate them.

Suggestions:

  1. Make models configurable via env vars
  2. Add fallback mechanism
  3. Document model lifecycle

For an experimental workflow, maybe acceptable, but will cause maintenance pain.

7. ThreadPoolExecutor Could Be Simpler (review_runner.py:154-190)

Problem: You're wrapping SDK Conversation/Agent (which has async internals) in threads. Why not use asyncio directly?

Not critical for an experiment, but consider for better resource usage.


[GOOD PARTS]

github_utils.py - Solid error handling, pragmatic truncation, reasonable GraphQL pagination
Documentation - Thorough README with clear examples
Error handling - Generally decent throughout
Configuration - Environment variables are appropriate
Module structure - Reasonable separation of concerns


Cost/Benefit Question

Pragmatic concern: Does multi-model debate solve a real problem?

  • Cost: 3-5x more expensive than single review (3 models + debate rounds)
  • Benefit: Maybe slightly better review quality?
  • Complexity: Significant (threading, communication, orchestration)

This feels like solving an imaginary problem. If the goal is better reviews, consider:

  1. Better prompts for a single model
  2. Simple consolidation without "debate"
  3. Human review of AI output

Is the debate actually improving reviews enough to justify 5x cost + complexity?

@xingyaoww xingyaoww marked this pull request as draft February 11, 2026 16:07
@openhands-ai
Copy link

openhands-ai bot commented Feb 11, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Check duplicate example numbers

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #2009 at branch `feature/multi-model-debate-pr-review`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

@enyst enyst added the behavior-initiative This is related to the system prompt sections and LLM steering. label Feb 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

behavior-initiative This is related to the system prompt sections and LLM steering. review-this This label triggers a PR review by OpenHands

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants