feat: Add multi-model PR review with debate mechanism#2009
feat: Add multi-model PR review with debate mechanism#2009
Conversation
This adds an experimental workflow for PR review using multiple AI models (GPT-5.2, Claude Sonnet 4.5, Gemini 3 Flash) that debate to produce a consolidated review. Key features: - Parallel initial reviews from multiple models - Inter-agent communication tools for debate - Turn-based synchronization mechanism - Configurable debate rounds - Final consolidated review synthesis New files: - github_utils.py: Refactored GitHub API utilities - models.py: Data models for reviews and debate - prompt.py: Prompt templates for reviews and debate - debate_tools.py: Inter-agent communication tools - review_runner.py: Multi-model review execution - debate_orchestrator.py: Debate coordination - main.py: Entry point - README.md: Documentation Co-authored-by: openhands <openhands@all-hands.dev>
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Interesting experiment, but has fundamental design issues.
VERDICT: ❌ Needs rework - Fix the fake MessageQueue, add tests, and reconsider the threading model.
KEY INSIGHT: You've built a synchronous RPC system but called it "message queue" and "debate" - the complexity doesn't match what it actually does.
[CRITICAL ISSUES]
1. MessageQueue is Not a Queue (debate_tools.py:295-355)
Problem: MessageQueue is a misleading name. Looking at the code:
_response_queuesdict is created but never used (line 327)send_and_wait()just calls_response_handlerwith a lock (line 335-350)- No actual queuing, no async communication, no message buffering
This is a callback wrapper pretending to be a queue. The entire class could be replaced with:
class ResponseCoordinator:
def __init__(self, handler):
self._handler = handler
self._lock = threading.Lock()
def call_and_wait(self, sender, recipient, message):
with self._lock:
return self._handler(sender, recipient, message)Impact: Misleading abstraction that will confuse maintainers. Remove the unused _response_queues or implement a real message queue.
2. No Tests for Complex Orchestration (main.py:1-214)
Problem: This PR adds:
- Complex threading and synchronization
- Inter-agent communication
- Multi-model orchestration
- Message routing and debate state
Zero tests. The repo has test infrastructure - use it.
Required tests:
- Unit tests for MessageQueue behavior
- Integration tests for debate orchestration
- Error handling tests (agent timeout, model failure, network errors)
- State management tests (message ordering, round tracking)
3. Agent Response Can Hang Forever (debate_orchestrator.py:155-180)
Problem: _get_agent_response() calls conversation.run() with no timeout. If the agent fails to respond or gets stuck, this blocks indefinitely.
Fix: Add timeout to conversation.run() and handle timeout explicitly:
try:
conversation.run(timeout=60) # or configurable
except TimeoutError:
return "Agent did not respond in time."[IMPROVEMENT OPPORTUNITIES]
4. Threading Pattern is Wrong (debate_orchestrator.py:300-320)
Problem: You spawn threads to run agents in parallel, but each agent immediately blocks in send_and_wait() when trying to communicate. This defeats the purpose of threading.
You're getting the complexity of threading without the benefits of parallelism.
Better approaches:
- Sequential debate (simplest, clearest)
- True async with asyncio (if you need parallelism)
- Actor model (if you need real concurrency)
Current approach: complexity of threading + none of the benefits.
5. Blocking != Turn-Based (README.md:129-136)
Problem: The README claims "turn-based synchronization" but the implementation is just blocking RPC calls.
Turn-based means agents take turns speaking to the group. Your implementation makes Agent A block until Agent B responds, which could then block on Agent C.
The behavior works, but the description is misleading.
6. Hardcoded Model Names (models.py:14-18)
Problem: Model names like openai/gpt-5.2 will break when providers deprecate them.
Suggestions:
- Make models configurable via env vars
- Add fallback mechanism
- Document model lifecycle
For an experimental workflow, maybe acceptable, but will cause maintenance pain.
7. ThreadPoolExecutor Could Be Simpler (review_runner.py:154-190)
Problem: You're wrapping SDK Conversation/Agent (which has async internals) in threads. Why not use asyncio directly?
Not critical for an experiment, but consider for better resource usage.
[GOOD PARTS]
✅ github_utils.py - Solid error handling, pragmatic truncation, reasonable GraphQL pagination
✅ Documentation - Thorough README with clear examples
✅ Error handling - Generally decent throughout
✅ Configuration - Environment variables are appropriate
✅ Module structure - Reasonable separation of concerns
Cost/Benefit Question
Pragmatic concern: Does multi-model debate solve a real problem?
- Cost: 3-5x more expensive than single review (3 models + debate rounds)
- Benefit: Maybe slightly better review quality?
- Complexity: Significant (threading, communication, orchestration)
This feels like solving an imaginary problem. If the goal is better reviews, consider:
- Better prompts for a single model
- Simple consolidation without "debate"
- Human review of AI output
Is the debate actually improving reviews enough to justify 5x cost + complexity?
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
Summary
This PR adds an experimental workflow for PR review using multiple AI models (GPT-5.2, Claude Sonnet 4.5, Gemini 3 Flash) that debate to produce a consolidated, well-reasoned final review.
Key Features
Phase 1: Parallel Reviews
Phase 2: Debate
Architecture
New Files
github_utils.pymodels.pyprompt.pydebate_tools.pyreview_runner.pydebate_orchestrator.pymain.pyREADME.mdDebate Tools
Each reviewer agent has access to:
Configuration
Environment variables:
MAX_DEBATE_ROUNDS: Maximum debate rounds (default: 3)REVIEW_STYLE: 'standard' or 'roasted'This is an experimental workflow for exploring multi-model collaboration patterns. Use the standard
02_pr_reviewworkflow for production use cases.Testing
Co-authored-by: openhands openhands@all-hands.dev
@xingyaoww can click here to continue refining the PR
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:8fc785c-pythonRun
All tags pushed for this build
About Multi-Architecture Support
8fc785c-python) is a multi-arch manifest supporting both amd64 and arm648fc785c-python-amd64) are also available if needed