Retry, Fallback, and Circuit Breaking: Building LLM Infrastructure That Survives Outages

The Outage You Didn't Plan For

At 11:47pm on a Thursday, your primary LLM provider started returning 503s. Not all requests - about 30%. Enough to degrade your system visibly. Not enough to trigger your simple error rate alert.

By midnight, it was 60%. Your application was timing out on every other request. Support tickets were filing. Your on-call engineer was staring at a graph that looked like a saw blade.

At 12:23am the provider posted an incident report. Estimated resolution: 2-4 hours.

Your system had three options: wait for the provider to recover, fail entirely, or fall back to something that kept working. You'd built options one and two. Option three didn't exist.

This is the failure mode the Graceful Degradation layer exists to prevent. Not model errors - infrastructure failures. The LLM endpoint going slow, rate-limited, or offline entirely. And since you treat the LLM as the application rather than a dependency, you have no fallback when it fails.

This is Part 6 of the Harness Engineering series. Part 1 introduced the seven-layer Harness Architecture. This article covers Layer 6 - Graceful Degradation - including retry strategies, fallback model selection, and circuit breaker patterns for LLM infrastructure.

The LLM Is an External Dependency

This reframe matters more than any specific pattern.

Your LLM provider is an external service. It has SLAs it may not meet. It has rate limits. It has incident history. It behaves exactly like any other third-party API in your infrastructure - except most teams treat it as if it were local function call.

Every pattern that distributed systems engineers use for external service reliability applies to LLMs:

Retry with backoff for transient failures
Circuit breakers to stop hammering degraded endpoints
Fallback paths for when the primary is unavailable
Timeouts to prevent cascading latency
Bulkheads to isolate LLM failures from the rest of the system

None of this is novel. It's all standard distributed systems practice. The novelty is applying it to an LLM - which most teams don't, because they built the LLM call as if it were a reliable internal function.

Treat your LLM like you treat your payment processor. It will fail. Design for it.

I call the deliberate allocation of reliability work across these patterns Failure Budget Allocation: your system has a maximum tolerable error rate, and each resilience layer absorbs a portion of that budget. Retry absorbs transient noise. The circuit breaker absorbs degraded endpoints. The fallback absorbs extended outages. If any layer is missing, the failures it should have absorbed become user-facing incidents.

The Three Failure Modes of LLM Infrastructure

Failure Mode 1: Transient Errors

Rate limit hits. Temporary 503s. Network timeouts. Connection resets. These are brief, recoverable, and common. They affect individual requests, not the entire endpoint.

The right response is retry with exponential backoff. Wait, try again, wait longer, try again. Most transient failures resolve within seconds.

What teams get wrong: retrying immediately and repeatedly, which hammers an already-stressed endpoint and worsens the problem. Or not retrying at all, which turns a transient 1-second failure into a user-facing error.

Failure Mode 2: Degraded Performance

The endpoint is responding but slowly. P50 latency goes from 800ms to 8 seconds. Some requests succeed, some time out. This is the hardest failure mode to detect and respond to.

The right response is timeout enforcement plus fallback. Set hard timeouts on LLM calls. When the timeout rate exceeds a threshold, treat it as a degraded endpoint and route to a fallback.

What teams get wrong: no timeouts on LLM calls (requests hang for 30+ seconds), no threshold monitoring (don't know it's degraded until users complain), no fallback (when they detect it, there's nowhere to go).

Failure Mode 3: Extended Outage

The endpoint is down or unresponsive for minutes to hours. This is rare but happens. Provider incidents, maintenance windows, regional outages.

The right response is a circuit breaker that stops attempting the failed endpoint, routes all traffic to a fallback, and periodically probes for recovery.

What teams get wrong: no circuit breaker (keep hammering the dead endpoint until it recovers), no fallback (system is fully down for the duration), no recovery detection (don't know when it's safe to resume primary).

Building the Retry Layer

Retry logic for LLM calls has specific requirements that differ from standard HTTP retry:

code

import asyncioimport timeimport randomfrom enum import Enumclass RetryOutcome(Enum):    SUCCESS = "success"    RETRYABLE_ERROR = "retryable_error"    NON_RETRYABLE_ERROR = "non_retryable_error"RETRYABLE_STATUS_CODES = {429, 500, 502, 503, 504}NON_RETRYABLE_STATUS_CODES = {400, 401, 403, 404}async def llm_call_with_retry(    client,    request: dict,    max_retries: int = 3,    base_delay: float = 1.0,    max_delay: float = 60.0,    timeout: float = 30.0,) -> dict:    last_error = None    for attempt in range(max_retries + 1):        try:            response = await asyncio.wait_for(                client.call(request),                timeout=timeout            )            return response        except asyncio.TimeoutError:            last_error = "Request timed out"            # Timeout is always retryable        except RateLimitError as e:            last_error = str(e)            # Respect Retry-After header if present            retry_after = e.retry_after or base_delay * (2 ** attempt)            if attempt < max_retries:                await asyncio.sleep(min(retry_after, max_delay))            continue        except APIError as e:            if e.status_code in NON_RETRYABLE_STATUS_CODES:                raise  # Don't retry auth errors, bad requests, etc.            last_error = str(e)        if attempt < max_retries:            # Exponential backoff with full jitter            delay = min(base_delay * (2 ** attempt), max_delay)            jitter = random.uniform(0, delay)            await asyncio.sleep(jitter)    raise MaxRetriesExceeded(f"Failed after {max_retries} retries. Last error: {last_error}")

Without retry: every transient 503 is a user-facing failure. A one-second blip becomes a broken request.

Without jitter: all clients retry simultaneously after a rate limit, causing a thundering herd that worsens the outage they're recovering from.

Building the Circuit Breaker

The circuit breaker pattern stops attempting a failing endpoint before you've exhausted all retries on every single request. It tracks the health of the endpoint globally and opens when the failure rate crosses a threshold.

code

from datetime import datetime, timedeltafrom collections import dequeimport threadingclass CircuitState(Enum):    CLOSED = "closed"       # Normal operation    OPEN = "open"           # Failing - reject requests immediately    HALF_OPEN = "half_open" # Testing recoveryclass CircuitBreaker:    def __init__(        self,        failure_threshold: float = 0.5,   # Open at 50% failure rate        window_seconds: int = 60,          # Over a 60-second window        min_requests: int = 10,            # Minimum requests before evaluating        recovery_timeout: int = 60,        # Seconds before probing recovery    ):        self.failure_threshold = failure_threshold        self.window = timedelta(seconds=window_seconds)        self.min_requests = min_requests        self.recovery_timeout = recovery_timeout        self.state = CircuitState.CLOSED        self.requests: deque = deque()  # (timestamp, success: bool)        self.opened_at: datetime | None = None        self._lock = threading.Lock()    def record(self, success: bool):        with self._lock:            now = datetime.utcnow()            self.requests.append((now, success))            # Prune old requests outside window            cutoff = now - self.window            while self.requests and self.requests[0][0] < cutoff:                self.requests.popleft()            self._evaluate_state(now)    def _evaluate_state(self, now: datetime):        if self.state == CircuitState.OPEN:            if (now - self.opened_at).seconds >= self.recovery_timeout:                self.state = CircuitState.HALF_OPEN            return        if len(self.requests) < self.min_requests:            return        failures = sum(1 for _, success in self.requests if not success)        failure_rate = failures / len(self.requests)        if failure_rate >= self.failure_threshold:            self.state = CircuitState.OPEN            self.opened_at = now    def allow_request(self) -> bool:        with self._lock:            if self.state == CircuitState.CLOSED:                return True            if self.state == CircuitState.OPEN:                return False            if self.state == CircuitState.HALF_OPEN:                # Allow one probe request                return True    def on_success(self):        with self._lock:            if self.state == CircuitState.HALF_OPEN:                self.state = CircuitState.CLOSED                self.requests.clear()            self.record(True)    def on_failure(self):        with self._lock:            if self.state == CircuitState.HALF_OPEN:                self.state = CircuitState.OPEN                self.opened_at = datetime.utcnow()            self.record(False)

The circuit breaker has three states:

Closed: normal operation. All requests go through. Failures are tracked.

Open: endpoint is failing. Requests are rejected immediately without attempting the LLM call. This is the key behavior - you stop paying retry costs on a dead endpoint and stop making your users wait for timeouts.

Half-open: recovery probe. One request is allowed through. If it succeeds, the circuit closes. If it fails, the circuit reopens and the recovery timeout resets.

Without a circuit breaker: every request during an outage burns a full retry budget before failing. At scale, this means thousands of requests waiting 30 seconds each, consuming connection pools and cascading the failure to the rest of your system.

Building the Fallback Layer

A circuit breaker that rejects requests is better than hammering a dead endpoint. A fallback that serves degraded-but-functional responses is better than rejecting requests.

Fallback strategies in order of decreasing preference:

1. Secondary model provider. Route to a different model (or the same model via a different provider) when the primary is unavailable.

code

class FallbackRouter:    def __init__(self, primary, secondary, circuit_breaker: CircuitBreaker):        self.primary = primary        self.secondary = secondary        self.breaker = circuit_breaker    async def call(self, request: dict) -> dict:        if self.breaker.allow_request():            try:                result = await self.primary.call(request)                self.breaker.on_success()                return {**result, "provider": "primary"}            except Exception as e:                self.breaker.on_failure()                # Fall through to secondary        # Primary unavailable - use secondary        try:            result = await self.secondary.call(request)            return {**result, "provider": "secondary", "degraded": True}        except Exception:            raise AllProvidersUnavailable("Both primary and secondary LLM providers failed")

2. Smaller/faster model on same provider. If GPT-4o is rate-limited, fall back to GPT-4o-mini. If Claude Sonnet is unavailable, fall back to Claude Haiku. Lower quality, but functional.

3. Cached responses. For common queries, serve cached responses from previous successful calls. Not appropriate for dynamic tasks, but acceptable for FAQ-style or classification tasks where recency doesn't matter.

4. Rule-based fallback. For classification and routing tasks, implement a deterministic fallback that uses keyword matching or simple heuristics. Much lower quality than the LLM, but infinitely better than a 503.

5. Graceful degradation message. The last resort. Tell the user the AI feature is temporarily unavailable and offer a manual path. Honest, recoverable, and far better than a spinner that never resolves.

The Named Pattern: Failure Budget Allocation

Borrowing from Google's SRE practices, I call this Failure Budget Allocation for LLM infrastructure.

Your system has a reliability budget - a maximum tolerable error rate. Allocate that budget across your failure modes deliberately:

code

Total error budget: 0.1% (99.9% availability target)├── Transient errors (absorbed by retry):     0.01%├── Degraded performance (absorbed by timeout + fallback): 0.05%└── Extended outage (absorbed by circuit breaker + fallback): 0.04%

Each layer in the resilience stack absorbs a portion of the failure budget. Retry absorbs transient failures. The circuit breaker absorbs extended outages. The fallback absorbs what both miss.

If your fallback is unavailable, you have no failure budget for extended outages. This is the most common gap - teams build retry but not fallback, leaving extended outages as full-budget events.

The Full Resilience Pipeline

Retry, circuit breaker, and fallback don't operate independently - they form a layered decision flow on every LLM call. Here's how a request moves through all three:

mermaid

graph TD
    A[Incoming LLM Request] --> B{Circuit Breaker State?}
    B -- Open --> C[Route to Fallback]
    B -- Closed / Half-Open --> D[Primary LLM Call]
    D -- Success --> E[Record Success]
    E --> F[Return Response]
    D -- Timeout / 5xx --> G[Record Failure]
    G --> H{Retries Left?}
    H -- Yes --> I[Backoff + Jitter]
    I --> D
    H -- No --> J[Circuit Breaker Evaluates]
    J -- Threshold Not Met --> C
    J -- Threshold Met --> K[Circuit Opens]
    K --> C
    C --> L{Fallback Available?}
    L -- Secondary Provider --> M[Secondary LLM Call]
    L -- Cached Response --> N[Return Cache]
    L -- Rule-based --> O[Deterministic Response]
    L -- None --> P[Degradation Message]
    M --> F
    N --> F
    O --> F
    B -- Half-Open Probe Success --> Q[Circuit Closes]
    Q --> D

    style A fill:#4A90E2,color:#fff
    style B fill:#7B68EE,color:#fff
    style C fill:#FFA07A,color:#fff
    style D fill:#FFD93D,color:#000
    style E fill:#6BCF7F,color:#fff
    style F fill:#4A90E2,color:#fff
    style G fill:#E74C3C,color:#fff
    style H fill:#7B68EE,color:#fff
    style I fill:#9B59B6,color:#fff
    style J fill:#6BCF7F,color:#fff
    style K fill:#E74C3C,color:#fff
    style L fill:#7B68EE,color:#fff
    style M fill:#FFD93D,color:#000
    style N fill:#98D8C8,color:#000
    style O fill:#98D8C8,color:#000
    style P fill:#E74C3C,color:#fff
    style Q fill:#6BCF7F,color:#fff

Four things this diagram makes explicit that prose alone undersells. First, the circuit breaker check happens before the LLM call - an open circuit never attempts the primary, saving the full retry budget from being wasted on a dead endpoint. Second, the Half-Open probe success closes the circuit and feeds back into the primary path - recovery is automatic, not manual. Third, the fallback has four tiers of its own, each progressively lower quality - secondary provider, cached response, rule-based, degradation message. The diagram shows that "fallback" is not a single thing. Fourth, backoff and jitter (I) is a distinct node between retry decision and retry attempt - it's not built into the LLM call, it's a deliberate pause the harness enforces before trying again.

What Observability Looks Like for This Layer

Circuit breaker state changes - log every CLOSED→OPEN and OPEN→CLOSED transition with timestamp, failure rate at trigger, and duration in OPEN state. This is your LLM infrastructure incident log.

Retry attempt distribution - what fraction of requests succeed on first attempt vs. require 1, 2, or 3 retries? A shifting distribution (more requests requiring multiple retries) is an early signal of endpoint degradation before the circuit breaker opens.

Fallback activation rate - what fraction of requests are served by the secondary provider or cached response? A rising fallback rate signals primary provider degradation even when the circuit breaker hasn't opened yet.

Provider latency percentiles - track P50, P95, and P99 latency separately for primary and secondary. A rising P99 while P50 is stable is a classic sign of partial degradation - some requests are hanging while most succeed.

Timeout rate - requests that hit the hard timeout before returning. A rising timeout rate is often the first observable signal of provider degradation, appearing before error rates increase.

What to Build First

First: Hard timeouts on all LLM calls. If your LLM calls have no timeout, a slow provider turns into hanging requests that consume connections and memory until your server runs out of both. Add a 30-second hard timeout immediately.

Second: Retry with exponential backoff and jitter. Handle 429s and 5xx errors with retries. Respect Retry-After headers. Use full jitter. Three retries maximum.

Third: Circuit breaker. Wire up a circuit breaker on your primary LLM client. Start with conservative thresholds (50% failure rate over 60 seconds, 10 request minimum). Tighten as you observe production behavior.

Fourth: Secondary provider. Configure a secondary LLM client (different provider or smaller model on same provider). Wire it to activate when the circuit breaker opens.

Fifth: Degradation messaging. Implement a user-facing message for when all providers are unavailable. "Our AI assistant is temporarily unavailable. You can still [manual path]." Better than a spinner.

Sixth: Observability. Add metrics for all of the above. Circuit state, retry counts, fallback activations, latency percentiles. You cannot manage what you cannot measure.

The Principle

Your LLM provider will have an incident. The question was never whether - it's when, and whether your system was designed to handle it.

Retry absorbs the noise. The circuit breaker absorbs the storm. The fallback keeps the lights on.

An LLM system without a circuit breaker is a system that fails loudly and completely when its provider has an incident. An LLM system with one fails gracefully, partially, and recovers automatically.

Your users shouldn't know your provider had an incident. That's the standard. Build to it.

What's Next in This Series

Part 1: Harness Engineering - The Missing Layer - The full seven-layer Harness Architecture overview
Part 2: Normalization and Input Defense - Prompt injection, input sanitization, and multi-surface consistency
Part 3: Context Engineering - Memory architectures, retrieval strategies, and context compression
Part 4: Gated Execution - Policy engines, human-in-the-loop design, and dry-run patterns
Part 5: Validation Layer Design - Schema validators, semantic checks, and repair prompt patterns
Part 7: State Management for Agentic Systems - Checkpoint-resume strategies, cross-session memory, and durable state for long-running agents
Part 8: Deterministic Constraint Systems - Building tool registries and action manifests that prevent hallucinated actions in agentic systems

References

Fowler, M. (2018). Circuit Breaker. martinfowler.com. https://martinfowler.com/bliki/CircuitBreaker.html
Nygard, M. (2007). Release It! Design and Deploy Production-Ready Software. Pragmatic Bookshelf.
Google SRE Team. (2016). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media. https://sre.google/sre-book/table-of-contents/
Brooker, M. (2022). Exponential Backoff And Jitter. AWS Architecture Blog. https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (2018). The Site Reliability Workbook. O'Reilly Media. https://sre.google/workbook/table-of-contents/
Netflix Tech Blog. (2012). Fault Tolerance in a High Volume, Distributed System. https://netflixtechblog.com/fault-tolerance-in-a-high-volume-distributed-system-91ab4faae74a
Anthropic. (2024). Claude API Error Handling. https://docs.anthropic.com/en/api/errors

Systems Design

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:

The Outage You Didn't Plan For

The LLM Is an External Dependency

The Three Failure Modes of LLM Infrastructure

Failure Mode 1: Transient Errors

Failure Mode 2: Degraded Performance

Failure Mode 3: Extended Outage

Building the Retry Layer

Building the Circuit Breaker

Building the Fallback Layer

The Named Pattern: Failure Budget Allocation

The Full Resilience Pipeline

What Observability Looks Like for This Layer

What to Build First

The Principle

What's Next in This Series

References

Related Articles

Comments