Blog/Deep Dive

Building Reliable LLM Apps: A Failover Architecture Guide

How to design LLM applications that survive provider outages. Circuit breakers, fallback chains, health checks, and real-world failure patterns explained.

9 min read2026-02-13LLMWise Team

failoverreliabilitycircuit-breakermesh-routingarchitecture

Why LLM reliability is a production problem

If your application depends on a single LLM provider, you have a single point of failure. And LLM providers fail more often than you might expect.

OpenAI has reported multiple high-profile outages over the past two years. Anthropic, Google, and every other major provider have had their own incidents -- rate limit storms during peak traffic, region-specific capacity exhaustion, and silent degradations where latency triples before the service finally returns errors. When your provider goes down, your product goes down with it.

The irony is that most teams build sophisticated retry logic for their databases and microservices but treat their LLM integration as a simple HTTP call. That approach works in development. It breaks in production at the worst possible moment.

LLM failover is not a nice-to-have. For any application where AI features are core to the user experience, it is table stakes for ai reliability.

Common failure modes

Before designing a failover architecture, you need to understand what actually goes wrong. These are the failure modes we see most often across LLMWise traffic:

Rate limiting (HTTP 429) is the most frequent failure. Every provider enforces per-minute and per-day token limits. During traffic spikes, you will hit them. The problem compounds when retries from multiple clients create a thundering herd effect.

Timeouts happen when a provider is technically up but responding too slowly. A model that usually responds in 2 seconds suddenly takes 30. Your users will not wait that long.

Server errors (500/502/503) indicate genuine provider-side failures. These are usually transient but can last minutes to hours during major incidents.

Capacity exhaustion is subtler. Some providers return success but queue your request for so long that the response is stale by the time it arrives. Others reject requests outright when GPU capacity is full.

Model deprecation is the slow-motion failure. A model you depend on gets sunset with 30 days notice. If your architecture is hardcoded to a single model, you have a migration project on your hands. Our migration guide covers how to handle this gracefully.

Pattern 1: Circuit breaker

The circuit breaker pattern, borrowed from electrical engineering, prevents your system from repeatedly hammering a failing provider. It has three states:

Closed (normal operation): Requests flow through. Failures are counted.
Open (provider failing): After a failure threshold is reached, the circuit "opens" and all requests are immediately redirected to fallbacks. No traffic is sent to the failing provider.
Half-open (testing recovery): After a cooldown period, one test request is sent to the provider. If it succeeds, the circuit closes. If it fails, the circuit stays open.

LLMWise implements this with a threshold of 3 consecutive failures, which triggers a 30-second open window before a half-open retry. This is aggressive by design -- in LLM workflows, even a few seconds of failed requests can cascade into visible user-facing errors. For OpenRouter-routed traffic specifically, we use a separate threshold: 6 consecutive 429 errors trigger a 20-second open window, since rate limits are more transient than server errors.

The key insight with an llm circuit breaker is that the threshold should be low. Unlike traditional microservices where a single failed request might be noise, consecutive LLM failures almost always indicate a real problem. Three failures in a row is a strong signal.

Pattern 2: Fallback chains

A fallback chain defines an ordered list of models to try when your primary model is unavailable. The ordering matters -- you want models that are functionally similar but hosted on different providers.

A strong fallback chain for general-purpose text generation might look like:

GPT-5.2 (OpenAI) -- primary, best quality for your use case
Claude Sonnet 4.5 (Anthropic) -- different provider, comparable quality
Gemini 3 Flash (Google) -- different provider again, lower latency, slightly different strengths

The critical design principle: cross provider boundaries. Having GPT-5.2 fall back to GPT-4o gives you zero protection against an OpenAI-wide outage. You need models from at least two, ideally three, different providers in your chain.

For a deeper look at how to design these chains, see our failover routing overview and step-by-step failover setup guide.

Pattern 3: Health-aware routing

Circuit breakers are reactive -- they trip after failures happen. Health-aware routing is proactive. By tracking per-provider success rates and latency over a rolling window, you can route traffic away from a degrading provider before it fully fails.

Metrics to track per model:

Success rate over the last 5 minutes
P95 latency (a sudden spike is an early warning sign)
Exponentially weighted moving average (EWMA) latency to smooth out noise while still reacting to trends

When a provider's success rate drops below a threshold (say, 95%) or its P95 latency exceeds 2x its baseline, start shifting traffic to healthier alternatives. This approach catches the "slow degradation" failure mode that circuit breakers miss entirely.

LLMWise tracks a running average latency per model (weighted 80/20 toward historical data) and exposes this through route tracing events so you can observe provider health in real time.

Pattern 4: Cost-bounded failover

Naive failover can blow up your bill. If your primary model costs $0.50 per million input tokens and your fallback costs $15.00, a sustained outage on the primary could multiply your spend by 30x overnight.

Design your fallback chains with cost tiers in mind:

Tier 1 (primary): Best quality model for your use case
Tier 2 (comparable fallback): Similar capability, similar cost, different provider
Tier 3 (budget fallback): Acceptable quality, significantly cheaper -- used when both Tier 1 and 2 are down

Set cost alerts on your failover rate. If fallback activations exceed a threshold (say, 10% of requests over a 5-minute window), trigger an alert. This catches both provider issues and configuration mistakes.

LLMWise's credit settlement system handles this automatically: each request reconciles the reserved credit cost against the actual provider cost, so failover to a cheaper model refunds the difference rather than charging you the primary model's rate.

Implementing failover with LLMWise Mesh mode

Mesh mode wraps all four patterns -- circuit breaker, fallback chains, health tracking, and cost settlement -- into a single API call. You define your primary model and fallback chain, and LLMWise handles the routing:

curl -X POST https://llmwise.ai/api/v1/chat \
  -H "Authorization: Bearer mm_sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.2",
    "routing": {
      "strategy": "rate-limit",
      "fallback": ["claude-sonnet-4.5", "gemini-3-flash", "deepseek-v3"]
    },
    "messages": [
      {
        "role": "user",
        "content": "Analyze the Q4 revenue trends and flag anomalies."
      }
    ],
    "stream": true
  }'

The response is a stream of Server-Sent Events. Route events tell you what is happening behind the scenes:

{"event": "route", "model": "gpt-5.2", "status": "trying"}
{"event": "route", "model": "gpt-5.2", "status": "failed", "status_code": 429, "latency_ms": 230}
{"event": "route", "model": "claude-sonnet-4.5", "status": "trying"}
{"event": "chunk", "model": "claude-sonnet-4.5", "delta": "Based on the Q4 data", "done": false}
{"event": "chunk", "model": "claude-sonnet-4.5", "delta": "...", "done": true, "latency_ms": 1842}
{"event": "trace", "final_model": "claude-sonnet-4.5", "attempts": 2, "saved_ms": 230, "total_ms": 2072}

The trace event at the end gives you a complete audit trail: which models were tried, which failed, the final model that served the response, and how much time was spent on failed attempts. This data feeds directly into monitoring. For a deeper exploration of intelligent routing, see our routing explainer.

Monitoring and alerting

Failover infrastructure is only as good as your observability. Track these metrics:

Error rate per provider: Percentage of requests returning 4xx/5xx. Alert at 5%.
Failover rate: Percentage of requests served by a fallback model instead of the primary. Alert at 10% over a 5-minute window -- this means something is wrong with your primary.
Latency P95: Track per-provider and end-to-end (including failover time). A spike in end-to-end P95 without a per-provider spike means your failover itself is slow.
Cost deviation: Compare actual spend against expected spend. If failover is routing to more expensive models, your daily cost will drift upward.
Circuit breaker state changes: Log every open/close transition. A circuit that is flapping (rapidly opening and closing) indicates an intermittent issue that needs investigation.

The Mesh trace events provide all of this data per-request. Aggregate them into your observability stack (Datadog, Grafana, or even a simple time-series database) and set alerts on the thresholds above.

Conclusion

LLM providers will have outages. Rate limits will be hit. Models will be deprecated. The question is whether your application handles these failures gracefully or passes them directly to your users.

Start with the circuit breaker pattern and a two-model fallback chain across different providers. That alone will handle the majority of real-world failures. Then layer on health-aware routing and cost controls as your traffic grows. Or use LLMWise Mesh mode to get all four patterns in a single API call with zero infrastructure to manage.