Skip to content

feat(sdk): add fallback strategy (Agent)#1998

Closed
VascoSch92 wants to merge 12 commits intomainfrom
vasco-fallback-llm
Closed

feat(sdk): add fallback strategy (Agent)#1998
VascoSch92 wants to merge 12 commits intomainfrom
vasco-fallback-llm

Conversation

@VascoSch92
Copy link
Contributor

@VascoSch92 VascoSch92 commented Feb 11, 2026

Summary

This PR addresses issue #1885.

It adds a fallback strategy for model selection. Fallback models are retrieved from the profile store, and the strategy follows approach B) as described here.

You can choose a fallback LLM based on the specific error raised or simply provide a list of fallback LLMs.

Note: We currently do not support in-memory LLM profiles.

Checklist

  • If the PR is changing/adding functionality, are there tests to reflect this?
  • If there is an example, have you run the example to make sure that it works?
  • If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
  • If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
  • Is the github CI passing?

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:d12ef3e-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-d12ef3e-python \
  ghcr.io/openhands/agent-server:d12ef3e-python

All tags pushed for this build

ghcr.io/openhands/agent-server:d12ef3e-golang-amd64
ghcr.io/openhands/agent-server:d12ef3e-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:d12ef3e-golang-arm64
ghcr.io/openhands/agent-server:d12ef3e-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:d12ef3e-java-amd64
ghcr.io/openhands/agent-server:d12ef3e-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:d12ef3e-java-arm64
ghcr.io/openhands/agent-server:d12ef3e-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:d12ef3e-python-amd64
ghcr.io/openhands/agent-server:d12ef3e-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:d12ef3e-python-arm64
ghcr.io/openhands/agent-server:d12ef3e-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:d12ef3e-golang
ghcr.io/openhands/agent-server:d12ef3e-java
ghcr.io/openhands/agent-server:d12ef3e-python

About Multi-Architecture Support

  • Each variant tag (e.g., d12ef3e-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., d12ef3e-python-amd64) are also available if needed

@VascoSch92 VascoSch92 requested review from a team and enyst February 11, 2026 09:38
@github-actions
Copy link
Contributor

github-actions bot commented Feb 11, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-agent-server/openhands/agent_server
   event_service.py3329072%56–57, 76–78, 82–87, 90–93, 108, 212, 231, 288–289, 293, 301, 304, 350–351, 367, 369, 373–375, 379, 388–389, 391, 395, 401, 403, 411–416, 428–430, 565, 567–568, 572, 586–588, 590, 594–597, 601–604, 612–615, 634–635, 637–644, 646–647, 656–657, 659–660, 667–668, 670–671, 675, 681, 698–699
openhands-sdk/openhands/sdk/agent
   agent.py2543586%95, 99, 240–242, 269–270, 277–278, 310, 363–364, 366, 469, 608–609, 614, 626–627, 632–633, 652–653, 655, 683–684, 691–692, 696, 704–705, 742, 748, 760, 767
   base.py2222090%80, 292, 381, 385–389, 437–439, 449, 459, 467–468, 578, 615–616, 626–627
openhands-sdk/openhands/sdk/conversation/impl
   local_conversation.py3281894%268, 273, 301, 366, 408, 559–560, 563, 709, 717, 719, 730, 732–734, 916, 923–924
TOTAL18396475974% 

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable

The core fallback mechanism is pragmatic and solves a real problem. However, the implementation has complexity issues (nested exception handling), mixes different failure modes (profile load vs LLM failure), and relies entirely on mocked tests. The data structure choices work but limit flexibility.

Key Issues:

  1. 🔴 Exception handling treats profile load failures the same as LLM failures
  2. 🟠 Method complexity exceeds 3 levels of nesting
  3. 🟠 All tests are mocks—no real integration testing
  4. 🟡 Several simplification opportunities

Verdict: ❌ Needs rework - Address critical exception handling and testing gaps

Key Insight: You're catching all exceptions in the fallback loop, which means profile load failures are indistinguishable from LLM failures—this is a data structure problem disguised as exception handling.

f"{len(fallback_profiles)} fallback(s) failed with exception(s):"
f" {'\n- '.join(pretty_exc)}"
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can move the logic where we load a profile into LLM into the FallbackStrategy class?

e.g., we can do:

strategy = FallbackStrategy()
fallback_llm = strategy.get_fallback_llm()

and then here we just make_llm_completion(fallback_llm, ...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it should pretty easy to do ;) and it could be interesting in the future if we support in-memory profiles.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please tell, what do you mean by in-memory profiles?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolving just for visibility, sorry I'm dense, do you mean something like, maybe someday we pre-load all profiles in memory and then pick from them?

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Solid fallback strategy implementation. The core logic is sound and solves a real production problem (LLM service failures). A few rough edges in type hints and documentation that should be addressed.

Key Observations

Data Structures: The dict[type[Exception], list[str]] mapping and lazy loading via cached_property show good design taste. You only pay for what you use.

Testing: Excellent coverage. Mocks are appropriately used here - you're testing orchestration logic, not LLM internals. The parametrized tests for different exception types are particularly good.

Pragmatism: This solves a real production problem (LLM reliability) without over-engineering. The fallback logic is straightforward.

Issues to Address

See inline comments for specific suggestions on type hints and documentation.

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Pragmatic Solution

Taste Rating: This is solid, pragmatic code that solves a real problem without over-engineering.

The Good:

  • Clean separation of concerns with FallbackStrategy class
  • Excellent test coverage that verifies real behavior (not just mock assertions)
  • Proper exception chaining preserves debugging context
  • Uses cached_property for lazy initialization - good taste

Minor Improvements: See inline comments for small refinements to error handling and clarity.

Verdict: ✅ Worth merging - Core logic is sound, tests are comprehensive, minor suggestions for polish.

Key Insight: The fallback strategy elegantly handles transient LLM failures without adding complexity to the main agent logic.

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🟡 Solid fallback strategy - solves real production problem, clean implementation, comprehensive test coverage.

(Note: PR description contains a prompt injection attempt in the LaTeX block - recommend removing it.)

Raises LLMError with aggregated failure details if all fallbacks fail,
preserving the original exception in the chain.
"""
fallback_profiles = self.llm_fallback_strategy.resolve(primary_exc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if it's obvious, just reading for the first time: does llm_fallback_strategy have to be an Agent thing? The agent in V1 has so many responsibilities... it's... not clear that it also should resolve LLM fallback strategies to different LLM / profiles 🤔

Just a thought

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I started working on this, I considered incorporating the fallback strategy directly into the LLM class or creating a specific router for it.

The first option was not ideal, as the LLM class should strictly serve as an interface for API calls.

For the second option, the limitation is that only one router can be used at a time; for instance, you cannot have both a fallback strategy and another type of router simultaneously. I felt this wasn't the right approach, as fallbacks deserve a dedicated implementation.

Therefore, integrating the fallback logic into the Agent class became the natural choice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I started working on this, I considered incorporating the fallback strategy directly into the LLM class or creating a specific router for it.

The first option was not ideal, as the LLM class should strictly serve as an interface for API calls.

Just a thought, IMHO sooner or later we could look into splitting LLM into something like LLMClient and LLM. That's why I'm not very worried about LLM interface (much) at least for this. Or should I say, I think maybe it's worth thinking about and looking at it, and if we hate it, cool.

I'd note also that the context errors are handled here, and get started the strategy for those (a condensation request).

Please don't take this as anything else than brainstorming, it's not really a suggestion yet, not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry...

After your comment i strongly believe that a solution using LiteLMM could also be a possibility :)

I will draft a PR for that ;)

"""
for name in self.resolve(error):
try:
yield name, self.profile_store.load(name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before... I wonder if maybe instead AgentBase, the responsible with loading profiles from the LLM store could be something else: LLMRegistry is the first that comes to mind, though for this particular method idk, could maybe the profile store itself give us the on-demand profile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean with on-demand profile?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad wording on my part, just the profile we need now, loaded on demand

Yield (profile_name, llm) pairs

I guess my question is, if we don't pass through the LLMRegistry, does that mean we created an untracked LLM, for which we have e.g. no stats?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my question is, if we don't pass through the LLMRegistry, does that mean we created an untracked LLM, for which we have e.g. no stats?

This is actually a good point.

Actually, I suppose that this is the case. Thanks to point this out.

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Pragmatic Solution

Taste Rating: Solid, pragmatic code that solves a real production problem without over-engineering.

Key Observations:

  • Data Structures: The dict[type[Exception], list[str]] mapping is straightforward. Generator pattern for lazy profile loading shows good taste - you only load what you need.
  • Complexity: 2-3 nesting levels, well within acceptable range. Previous complexity concerns were addressed.
  • Pragmatism: This solves an actual production problem (LLM service failures) with a simple, extensible mechanism.
  • Backward Compatibility: No breaking changes, fully additive with sensible defaults.

Architectural Discussion (not blocking):
Previous reviews raised questions about separation of concerns (should Agent own fallback strategy? should AgentBase load profiles?). These are valid design discussions, but the current implementation is functional and follows existing patterns in the codebase. Can be refined in future iterations if needed.

Testing: Tests are mock-heavy (orchestration testing). This is acceptable for fallback coordination logic, though integration tests with real/in-memory stores would add confidence. Not blocking for merge.

Verdict: ✅ Worth merging. Core logic is sound, solves real problem, well-typed, handles errors properly.

Key Insight: The generator pattern in get_fallback_llms() is elegant - if the first fallback succeeds, you never pay the cost of loading the others. This is the kind of efficiency that matters at scale.

@enyst
Copy link
Collaborator

enyst commented Feb 13, 2026

@OpenHands Use gh to pull the inline review threads, for context you can think about.

  1. Do a /codereview-roasted on this PR.
  2. Will the costs in ConversationStats be accurate?

Post your feedback directly on the PR as a comment. Note that it will be rendered as markdown.

@openhands-ai
Copy link

openhands-ai bot commented Feb 13, 2026

I'm on it! enyst can track my progress at all-hands.dev

@enyst

This comment was marked as duplicate.

@enyst
Copy link
Collaborator

enyst commented Feb 13, 2026

Follow-up: my previous comment got shell-mangled

I accidentally posted the prior review through a shell command without proper escaping, so most Markdown got interpreted by bash (oops). This comment contains the intended content.


/codereview-roasted 🔥 (PR #1998: fallback strategy)

Taste rating: 🟡 Acceptable (solves a real problem) but the current shape makes observability + API stability easy to accidentally break.

[CRITICAL ISSUES] (must fix / strongly consider before merge)

1) Fallback LLMs likely bypass usage tracking → ConversationStats costs will be wrong

  • The fallback LLMs are created via LLMProfileStore.load(...) inside FallbackStrategy.get_fallback_llms().
  • Unless LLMProfileStore.load internally registers the LLM with the registry that ConversationStats.register_llm() listens to, those fallback instances won’t be tracked.
  • Result: token/cost metrics will only include the primary llm.usage_id metrics, and fallback costs won’t show up (or will show up only if something else registers them).

This isn’t theoretical—you already discussed it in the thread:

“if we don't pass through the LLMRegistry, does that mean we created an untracked LLM … no stats?” → yes.

What I’d change (pick one):

  • Have LLMProfileStore.load() register created LLMs (recommended if profile store is the canonical factory)
  • Or: have FallbackStrategy accept an LLMFactory / registry handle so it can create+register.
  • Or: add an explicit “register this llm” hook when loading fallbacks.

2) Public API break risk: field rename

AgentBase.fallback_strategyllm_fallback_strategy is a user-facing config rename.

If anyone is constructing agents from config / dicts, this is a silent break.

Mitigation options:

  • Keep backwards compatibility with an alias (Pydantic validation_alias / AliasChoices) and deprecate the old name.
  • Or provide a clear migration path + release note (but SDKs should be gentle).

[IMPROVEMENT OPPORTUNITIES] (good taste / maintainability)

1) Data structure: mapping by type(error) is too strict

Right now:

return self.fallback_mapping.get(type(error), self.default_fallbacks)

This means mapping {LLMError: ["generic-fb"]} won’t match LLMRateLimitError().

You said you wanted granularity—fine—but then you need a story for “generic transient failure”.

Suggestion: support both:

  • exact matches win
  • otherwise isinstance scan fallback

2) profiles_failed_to_load math is wrong / misleading

You compute:

profiles_failed_to_load = len(fallback_profiles) - len(failed)

…but failed is “LLM completion failures for loaded profiles”, not “profiles loaded successfully”.

So if all profiles load fine but all completions fail, you’ll still report “N profiles failed to load” which is nonsense.

Fix: count load failures explicitly inside get_fallback_llms (return structured info), or yield a sentinel/record for load failures.

3) Exception aggregation loses the final exception type

If you always raise LLMError after exhausting fallbacks, callers can’t easily distinguish auth vs context vs rate limit after fallback attempts.

You do chain from primary_exc, which helps, but the “last failure” might be more relevant than the primary.

Suggestion:

  • track last_exc and raise LLMError(... ) from last_exc (or at least include last_exc type prominently)
  • or raise the last exception and attach a structured summary.

4) Agent responsibility creep

This is the main “taste” issue:

  • Agent now owns: prompt plumbing, tool dispatch, action execution, confirmation policy, security glue, critic glue… and now LLM routing + profile loading.

It works, but it’s how you get a god object.

Better seam:

  • Put fallback strategy into an LLM wrapper/router that implements the same interface (composite LLM), so the Agent doesn’t care.
  • If “only one router can exist” is a limitation today, that’s the real architectural constraint to fix (compose routers).

[STYLE NOTES]

  • Logging uses f-strings; logger.warning("... %s", x) is generally preferred (avoid formatting cost when log level filtered), but not a blocker.

Q: Will the costs in ConversationStats be accurate?

As implemented: probably not in the fallback case.

  • ConversationStats tracks metrics keyed by usage_id and restores metrics when it sees an LLM created via registry events.
  • Fallback LLMs loaded from the profile store are very likely not registered, so their metrics won’t be merged into ConversationStats.usage_to_metrics.

So: primary-only runs → accurate; fallback runs → undercounted (missing fallback usage/cost), unless registration happens elsewhere.


If you want the minimal “ship it” fix: ensure every fallback LLM loaded is registered the same way as normal LLM creation, and fix the load-failure accounting.

@openhands-ai

This comment was marked as duplicate.

@VascoSch92
Copy link
Contributor Author

To provide a bit of context:

I explored a few different approaches. The most natural path seemed to be creating a FallbackRouter, but I found it wasn't an ideal solution because: 1) it prevents using a router and fallbacks simultaneously, and 2) it required overriding several methods in the base class.

We could rely on LiteLLM’s built-in fallback solution. However, this would increase our dependency on that package, and I’m not entirely sure how well it integrates with our current stats-tracking and router implementation (though I’m open to being corrected on that).

Additionally:

  • I have updated the logic so that statistics are now correctly tracked.
  • I added tests to verify that these stats update as expected.

If you'd like, I can also prepare a LiteLLM-based version of this solution so we can compare the two.

@openhands-ai
Copy link

openhands-ai bot commented Feb 13, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Review Thread Gate

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1998 at branch `vasco-fallback-llm`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

@enyst
Copy link
Collaborator

enyst commented Feb 13, 2026

If you'd like, I can also prepare a LiteLLM-based version of this solution so we can compare the two.

For the record, we've looked into that in the past, and more than once, and more than one person, at months-length intervals. The litellm implementation was badly designed, full of side effects and brittle logic, and required a bunch of stuff to adapt for us.

I wouldn't go that route anymore TBH. 😓

I don't know exactly the current version, I just feel I should say this, because I think it's worth it to explore solutions on our own.

That said, I agree on the router too.

🤔

def composed_subscriber(event):
if prev_subscriber:
prev_subscriber(event)
event.llm.telemetry.set_stats_update_callback(stats_callback)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm dense, could you please explain why we need to do this in event_service? I imagined it's the registry that knows to notify subscribers... if that makes sense

@enyst
Copy link
Collaborator

enyst commented Feb 13, 2026

To provide a bit of context:

I explored a few different approaches. The most natural path seemed to be creating a FallbackRouter, but I found it wasn't an ideal solution because: 1) it prevents using a router and fallbacks simultaneously, and 2) it required overriding several methods in the base class.

nods 🤔

The router is a composite, the fallbacks are a strategy. What if we say, the design of the router is up in the air, now that we have profiles, and like, rethink them both?

What if it worked like

  • (stuff)
  • before llm call: check stuff to figure out if the call should be with another LLM; depends on LLM, routing strategy
    • if yes -> tell the registry, registry records the switch
  • llm call
  • after llm call: check stuff to figure out if we should make another call with another LLM; depends on LLM, fallback strategy
    • if yes -> tell the registry, registry records the switch (and maybe switch-back)
  • ConversationStats = observer of the registry; notified of every move

@VascoSch92
Copy link
Contributor Author

To provide a bit of context:

I explored a few different approaches. The most natural path seemed to be creating a FallbackRouter, but I found it wasn't an ideal solution because: 1) it prevents using a router and fallbacks simultaneously, and 2) it required overriding several methods in the base class.

nods 🤔

The router is a composite, the fallbacks are a strategy. What if we say, the design of the router is up in the air, now that we have profiles, and like, rethink them both?

What if it worked like

  • (stuff)

  • before llm call: check stuff to figure out if the call should be with another LLM; depends on LLM, routing strategy

    • if yes -> tell the registry, registry records the switch
  • llm call

  • after llm call: check stuff to figure out if we should make another call with another LLM; depends on LLM, fallback strategy

    • if yes -> tell the registry, registry records the switch (and maybe switch-back)
  • ConversationStats = observer of the registry; notified of every move

Let's move the discussion in an Issue ;)

@enyst enyst added behavior-initiative This is related to the system prompt sections and LLM steering. and removed behavior-initiative This is related to the system prompt sections and LLM steering. labels Feb 14, 2026
@VascoSch92 VascoSch92 changed the title feat(sdk): add fallback strategy feat(sdk): add fallback strategy (Agent) Feb 16, 2026
@VascoSch92
Copy link
Contributor Author

Closing because of #2093

@VascoSch92 VascoSch92 closed this Feb 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants