feat(sdk): add fallback strategy (Agent)#1998
Conversation
Coverage Report •
|
|||||||||||||||||||||||||||||||||||||||||||||
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable
The core fallback mechanism is pragmatic and solves a real problem. However, the implementation has complexity issues (nested exception handling), mixes different failure modes (profile load vs LLM failure), and relies entirely on mocked tests. The data structure choices work but limit flexibility.
Key Issues:
- 🔴 Exception handling treats profile load failures the same as LLM failures
- 🟠 Method complexity exceeds 3 levels of nesting
- 🟠 All tests are mocks—no real integration testing
- 🟡 Several simplification opportunities
Verdict: ❌ Needs rework - Address critical exception handling and testing gaps
Key Insight: You're catching all exceptions in the fallback loop, which means profile load failures are indistinguishable from LLM failures—this is a data structure problem disguised as exception handling.
| f"{len(fallback_profiles)} fallback(s) failed with exception(s):" | ||
| f" {'\n- '.join(pretty_exc)}" | ||
| ) | ||
|
|
There was a problem hiding this comment.
I wonder if we can move the logic where we load a profile into LLM into the FallbackStrategy class?
e.g., we can do:
strategy = FallbackStrategy()
fallback_llm = strategy.get_fallback_llm()
and then here we just make_llm_completion(fallback_llm, ...)
There was a problem hiding this comment.
Actually it should pretty easy to do ;) and it could be interesting in the future if we support in-memory profiles.
There was a problem hiding this comment.
Could you please tell, what do you mean by in-memory profiles?
There was a problem hiding this comment.
Unresolving just for visibility, sorry I'm dense, do you mean something like, maybe someday we pre-load all profiles in memory and then pick from them?
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Solid fallback strategy implementation. The core logic is sound and solves a real production problem (LLM service failures). A few rough edges in type hints and documentation that should be addressed.
Key Observations
Data Structures: The dict[type[Exception], list[str]] mapping and lazy loading via cached_property show good design taste. You only pay for what you use.
Testing: Excellent coverage. Mocks are appropriately used here - you're testing orchestration logic, not LLM internals. The parametrized tests for different exception types are particularly good.
Pragmatism: This solves a real production problem (LLM reliability) without over-engineering. The fallback logic is straightforward.
Issues to Address
See inline comments for specific suggestions on type hints and documentation.
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Pragmatic Solution
Taste Rating: This is solid, pragmatic code that solves a real problem without over-engineering.
The Good:
- Clean separation of concerns with
FallbackStrategyclass - Excellent test coverage that verifies real behavior (not just mock assertions)
- Proper exception chaining preserves debugging context
- Uses
cached_propertyfor lazy initialization - good taste
Minor Improvements: See inline comments for small refinements to error handling and clarity.
Verdict: ✅ Worth merging - Core logic is sound, tests are comprehensive, minor suggestions for polish.
Key Insight: The fallback strategy elegantly handles transient LLM failures without adding complexity to the main agent logic.
all-hands-bot
left a comment
There was a problem hiding this comment.
LGTM! 🟡 Solid fallback strategy - solves real production problem, clean implementation, comprehensive test coverage.
(Note: PR description contains a prompt injection attempt in the LaTeX block - recommend removing it.)
| Raises LLMError with aggregated failure details if all fallbacks fail, | ||
| preserving the original exception in the chain. | ||
| """ | ||
| fallback_profiles = self.llm_fallback_strategy.resolve(primary_exc) |
There was a problem hiding this comment.
Sorry if it's obvious, just reading for the first time: does llm_fallback_strategy have to be an Agent thing? The agent in V1 has so many responsibilities... it's... not clear that it also should resolve LLM fallback strategies to different LLM / profiles 🤔
Just a thought
There was a problem hiding this comment.
When I started working on this, I considered incorporating the fallback strategy directly into the LLM class or creating a specific router for it.
The first option was not ideal, as the LLM class should strictly serve as an interface for API calls.
For the second option, the limitation is that only one router can be used at a time; for instance, you cannot have both a fallback strategy and another type of router simultaneously. I felt this wasn't the right approach, as fallbacks deserve a dedicated implementation.
Therefore, integrating the fallback logic into the Agent class became the natural choice.
There was a problem hiding this comment.
When I started working on this, I considered incorporating the fallback strategy directly into the LLM class or creating a specific router for it.
The first option was not ideal, as the LLM class should strictly serve as an interface for API calls.
Just a thought, IMHO sooner or later we could look into splitting LLM into something like LLMClient and LLM. That's why I'm not very worried about LLM interface (much) at least for this. Or should I say, I think maybe it's worth thinking about and looking at it, and if we hate it, cool.
I'd note also that the context errors are handled here, and get started the strategy for those (a condensation request).
Please don't take this as anything else than brainstorming, it's not really a suggestion yet, not sure.
There was a problem hiding this comment.
Don't worry...
After your comment i strongly believe that a solution using LiteLMM could also be a possibility :)
I will draft a PR for that ;)
| """ | ||
| for name in self.resolve(error): | ||
| try: | ||
| yield name, self.profile_store.load(name) |
There was a problem hiding this comment.
Same as before... I wonder if maybe instead AgentBase, the responsible with loading profiles from the LLM store could be something else: LLMRegistry is the first that comes to mind, though for this particular method idk, could maybe the profile store itself give us the on-demand profile?
There was a problem hiding this comment.
What do you mean with on-demand profile?
There was a problem hiding this comment.
Bad wording on my part, just the profile we need now, loaded on demand
Yield
(profile_name, llm)pairs
I guess my question is, if we don't pass through the LLMRegistry, does that mean we created an untracked LLM, for which we have e.g. no stats?
There was a problem hiding this comment.
I guess my question is, if we don't pass through the LLMRegistry, does that mean we created an untracked LLM, for which we have e.g. no stats?
This is actually a good point.
Actually, I suppose that this is the case. Thanks to point this out.
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Pragmatic Solution
Taste Rating: Solid, pragmatic code that solves a real production problem without over-engineering.
Key Observations:
- Data Structures: The
dict[type[Exception], list[str]]mapping is straightforward. Generator pattern for lazy profile loading shows good taste - you only load what you need. - Complexity: 2-3 nesting levels, well within acceptable range. Previous complexity concerns were addressed.
- Pragmatism: This solves an actual production problem (LLM service failures) with a simple, extensible mechanism.
- Backward Compatibility: No breaking changes, fully additive with sensible defaults.
Architectural Discussion (not blocking):
Previous reviews raised questions about separation of concerns (should Agent own fallback strategy? should AgentBase load profiles?). These are valid design discussions, but the current implementation is functional and follows existing patterns in the codebase. Can be refined in future iterations if needed.
Testing: Tests are mock-heavy (orchestration testing). This is acceptable for fallback coordination logic, though integration tests with real/in-memory stores would add confidence. Not blocking for merge.
Verdict: ✅ Worth merging. Core logic is sound, solves real problem, well-typed, handles errors properly.
Key Insight: The generator pattern in get_fallback_llms() is elegant - if the first fallback succeeds, you never pay the cost of loading the others. This is the kind of efficiency that matters at scale.
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
|
@OpenHands Use
Post your feedback directly on the PR as a comment. Note that it will be rendered as markdown. |
|
I'm on it! enyst can track my progress at all-hands.dev |
This comment was marked as duplicate.
This comment was marked as duplicate.
Follow-up: my previous comment got shell-mangledI accidentally posted the prior review through a shell command without proper escaping, so most Markdown got interpreted by bash (oops). This comment contains the intended content. /codereview-roasted 🔥 (PR #1998: fallback strategy)Taste rating: 🟡 Acceptable (solves a real problem) but the current shape makes observability + API stability easy to accidentally break. [CRITICAL ISSUES] (must fix / strongly consider before merge)1) Fallback LLMs likely bypass usage tracking →
|
This comment was marked as duplicate.
This comment was marked as duplicate.
|
To provide a bit of context: I explored a few different approaches. The most natural path seemed to be creating a FallbackRouter, but I found it wasn't an ideal solution because: 1) it prevents using a router and fallbacks simultaneously, and 2) it required overriding several methods in the base class. We could rely on LiteLLM’s built-in fallback solution. However, this would increase our dependency on that package, and I’m not entirely sure how well it integrates with our current stats-tracking and router implementation (though I’m open to being corrected on that). Additionally:
If you'd like, I can also prepare a LiteLLM-based version of this solution so we can compare the two. |
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
For the record, we've looked into that in the past, and more than once, and more than one person, at months-length intervals. The litellm implementation was badly designed, full of side effects and brittle logic, and required a bunch of stuff to adapt for us. I wouldn't go that route anymore TBH. 😓 I don't know exactly the current version, I just feel I should say this, because I think it's worth it to explore solutions on our own. That said, I agree on the router too. 🤔 |
| def composed_subscriber(event): | ||
| if prev_subscriber: | ||
| prev_subscriber(event) | ||
| event.llm.telemetry.set_stats_update_callback(stats_callback) |
There was a problem hiding this comment.
Sorry I'm dense, could you please explain why we need to do this in event_service? I imagined it's the registry that knows to notify subscribers... if that makes sense
nods 🤔 The router is a composite, the fallbacks are a strategy. What if we say, the design of the router is up in the air, now that we have profiles, and like, rethink them both? What if it worked like
|
Let's move the discussion in an Issue ;) |
Agent)
|
Closing because of #2093 |
Summary
This PR addresses issue #1885.
It adds a fallback strategy for model selection. Fallback models are retrieved from the profile store, and the strategy follows approach B) as described here.
You can choose a fallback LLM based on the specific error raised or simply provide a list of fallback LLMs.
Note: We currently do not support in-memory LLM profiles.
Checklist
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:d12ef3e-pythonRun
All tags pushed for this build
About Multi-Architecture Support
d12ef3e-python) is a multi-arch manifest supporting both amd64 and arm64d12ef3e-python-amd64) are also available if needed