The Outage You Didn't Plan For
At 11:47pm on a Thursday, your primary LLM provider started returning 503s. Not all requests - about 30%. Enough to degrade your system visibly. Not enough to trigger your simple error rate alert.
By midnight, it was 60%. Your application was timing out on every other request. Support tickets were filing. Your on-call engineer was staring at a graph that looked like a saw blade.
At 12:23am the provider posted an incident report. Estimated resolution: 2-4 hours.
Your system had three options: wait for the provider to recover, fail entirely, or fall back to something that kept working. You'd built options one and two. Option three didn't exist.
This is the failure mode the Graceful Degradation layer exists to prevent. Not model errors - infrastructure failures. The LLM endpoint going slow, rate-limited, or offline entirely. And since you treat the LLM as the application rather than a dependency, you have no fallback when it fails.
This is Part 6 of the Harness Engineering series. Part 1 introduced the seven-layer Harness Architecture. This article covers Layer 6 - Graceful Degradation - including retry strategies, fallback model selection, and circuit breaker patterns for LLM infrastructure.
The LLM Is an External Dependency
This reframe matters more than any specific pattern.
Your LLM provider is an external service. It has SLAs it may not meet. It has rate limits. It has incident history. It behaves exactly like any other third-party API in your infrastructure - except most teams treat it as if it were local function call.
Every pattern that distributed systems engineers use for external service reliability applies to LLMs:
- Retry with backoff for transient failures
- Circuit breakers to stop hammering degraded endpoints
- Fallback paths for when the primary is unavailable
- Timeouts to prevent cascading latency
- Bulkheads to isolate LLM failures from the rest of the system
None of this is novel. It's all standard distributed systems practice. The novelty is applying it to an LLM - which most teams don't, because they built the LLM call as if it were a reliable internal function.
Treat your LLM like you treat your payment processor. It will fail. Design for it.
I call the deliberate allocation of reliability work across these patterns Failure Budget Allocation: your system has a maximum tolerable error rate, and each resilience layer absorbs a portion of that budget. Retry absorbs transient noise. The circuit breaker absorbs degraded endpoints. The fallback absorbs extended outages. If any layer is missing, the failures it should have absorbed become user-facing incidents.
The Three Failure Modes of LLM Infrastructure
Failure Mode 1: Transient Errors
Rate limit hits. Temporary 503s. Network timeouts. Connection resets. These are brief, recoverable, and common. They affect individual requests, not the entire endpoint.
The right response is retry with exponential backoff. Wait, try again, wait longer, try again. Most transient failures resolve within seconds.
What teams get wrong: retrying immediately and repeatedly, which hammers an already-stressed endpoint and worsens the problem. Or not retrying at all, which turns a transient 1-second failure into a user-facing error.
Failure Mode 2: Degraded Performance
The endpoint is responding but slowly. P50 latency goes from 800ms to 8 seconds. Some requests succeed, some time out. This is the hardest failure mode to detect and respond to.
The right response is timeout enforcement plus fallback. Set hard timeouts on LLM calls. When the timeout rate exceeds a threshold, treat it as a degraded endpoint and route to a fallback.
What teams get wrong: no timeouts on LLM calls (requests hang for 30+ seconds), no threshold monitoring (don't know it's degraded until users complain), no fallback (when they detect it, there's nowhere to go).
Failure Mode 3: Extended Outage
The endpoint is down or unresponsive for minutes to hours. This is rare but happens. Provider incidents, maintenance windows, regional outages.
The right response is a circuit breaker that stops attempting the failed endpoint, routes all traffic to a fallback, and periodically probes for recovery.
What teams get wrong: no circuit breaker (keep hammering the dead endpoint until it recovers), no fallback (system is fully down for the duration), no recovery detection (don't know when it's safe to resume primary).
Building the Retry Layer
Retry logic for LLM calls has specific requirements that differ from standard HTTP retry:
import asyncioimport timeimport randomfrom enum import Enumclass RetryOutcome(Enum): SUCCESS = "success" RETRYABLE_ERROR = "retryable_error" NON_RETRYABLE_ERROR = "non_retryable_error"RETRYABLE_STATUS_CODES = {429, 500, 502, 503, 504}NON_RETRYABLE_STATUS_CODES = {400, 401, 403, 404}async def llm_call_with_retry( client, request: dict, max_retries: int = 3, base_delay: float = 1.0, max_delay: float = 60.0, timeout: float = 30.0,) -> dict: last_error = None for attempt in range(max_retries + 1): try: response = await asyncio.wait_for( client.call(request), timeout=timeout ) return response except asyncio.TimeoutError: last_error = "Request timed out" # Timeout is always retryable except RateLimitError as e: last_error = str(e) # Respect Retry-After header if present retry_after = e.retry_after or base_delay * (2 ** attempt) if attempt < max_retries: await asyncio.sleep(min(retry_after, max_delay)) continue except APIError as e: if e.status_code in NON_RETRYABLE_STATUS_CODES: raise # Don't retry auth errors, bad requests, etc. last_error = str(e) if attempt < max_retries: # Exponential backoff with full jitter delay = min(base_delay * (2 ** attempt), max_delay) jitter = random.uniform(0, delay) await asyncio.sleep(jitter) raise MaxRetriesExceeded(f"Failed after {max_retries} retries. Last error: {last_error}")Without retry: every transient 503 is a user-facing failure. A one-second blip becomes a broken request.
Without jitter: all clients retry simultaneously after a rate limit, causing a thundering herd that worsens the outage they're recovering from.
Building the Circuit Breaker
The circuit breaker pattern stops attempting a failing endpoint before you've exhausted all retries on every single request. It tracks the health of the endpoint globally and opens when the failure rate crosses a threshold.
from datetime import datetime, timedeltafrom collections import dequeimport threadingclass CircuitState(Enum): CLOSED = "closed" # Normal operation OPEN = "open" # Failing - reject requests immediately HALF_OPEN = "half_open" # Testing recoveryclass CircuitBreaker: def __init__( self, failure_threshold: float = 0.5, # Open at 50% failure rate window_seconds: int = 60, # Over a 60-second window min_requests: int = 10, # Minimum requests before evaluating recovery_timeout: int = 60, # Seconds before probing recovery ): self.failure_threshold = failure_threshold self.window = timedelta(seconds=window_seconds) self.min_requests = min_requests self.recovery_timeout = recovery_timeout self.state = CircuitState.CLOSED self.requests: deque = deque() # (timestamp, success: bool) self.opened_at: datetime | None = None self._lock = threading.Lock() def record(self, success: bool): with self._lock: now = datetime.utcnow() self.requests.append((now, success)) # Prune old requests outside window cutoff = now - self.window while self.requests and self.requests[0][0] < cutoff: self.requests.popleft() self._evaluate_state(now) def _evaluate_state(self, now: datetime): if self.state == CircuitState.OPEN: if (now - self.opened_at).seconds >= self.recovery_timeout: self.state = CircuitState.HALF_OPEN return if len(self.requests) < self.min_requests: return failures = sum(1 for _, success in self.requests if not success) failure_rate = failures / len(self.requests) if failure_rate >= self.failure_threshold: self.state = CircuitState.OPEN self.opened_at = now def allow_request(self) -> bool: with self._lock: if self.state == CircuitState.CLOSED: return True if self.state == CircuitState.OPEN: return False if self.state == CircuitState.HALF_OPEN: # Allow one probe request return True def on_success(self): with self._lock: if self.state == CircuitState.HALF_OPEN: self.state = CircuitState.CLOSED self.requests.clear() self.record(True) def on_failure(self): with self._lock: if self.state == CircuitState.HALF_OPEN: self.state = CircuitState.OPEN self.opened_at = datetime.utcnow() self.record(False)The circuit breaker has three states:
Closed: normal operation. All requests go through. Failures are tracked.
Open: endpoint is failing. Requests are rejected immediately without attempting the LLM call. This is the key behavior - you stop paying retry costs on a dead endpoint and stop making your users wait for timeouts.
Half-open: recovery probe. One request is allowed through. If it succeeds, the circuit closes. If it fails, the circuit reopens and the recovery timeout resets.
Without a circuit breaker: every request during an outage burns a full retry budget before failing. At scale, this means thousands of requests waiting 30 seconds each, consuming connection pools and cascading the failure to the rest of your system.
Building the Fallback Layer
A circuit breaker that rejects requests is better than hammering a dead endpoint. A fallback that serves degraded-but-functional responses is better than rejecting requests.
Fallback strategies in order of decreasing preference:
1. Secondary model provider. Route to a different model (or the same model via a different provider) when the primary is unavailable.
class FallbackRouter: def __init__(self, primary, secondary, circuit_breaker: CircuitBreaker): self.primary = primary self.secondary = secondary self.breaker = circuit_breaker async def call(self, request: dict) -> dict: if self.breaker.allow_request(): try: result = await self.primary.call(request) self.breaker.on_success() return {**result, "provider": "primary"} except Exception as e: self.breaker.on_failure() # Fall through to secondary # Primary unavailable - use secondary try: result = await self.secondary.call(request) return {**result, "provider": "secondary", "degraded": True} except Exception: raise AllProvidersUnavailable("Both primary and secondary LLM providers failed")2. Smaller/faster model on same provider. If GPT-4o is rate-limited, fall back to GPT-4o-mini. If Claude Sonnet is unavailable, fall back to Claude Haiku. Lower quality, but functional.
3. Cached responses. For common queries, serve cached responses from previous successful calls. Not appropriate for dynamic tasks, but acceptable for FAQ-style or classification tasks where recency doesn't matter.
4. Rule-based fallback. For classification and routing tasks, implement a deterministic fallback that uses keyword matching or simple heuristics. Much lower quality than the LLM, but infinitely better than a 503.
5. Graceful degradation message. The last resort. Tell the user the AI feature is temporarily unavailable and offer a manual path. Honest, recoverable, and far better than a spinner that never resolves.
The Named Pattern: Failure Budget Allocation
Borrowing from Google's SRE practices, I call this Failure Budget Allocation for LLM infrastructure.
Your system has a reliability budget - a maximum tolerable error rate. Allocate that budget across your failure modes deliberately:
Total error budget: 0.1% (99.9% availability target)├── Transient errors (absorbed by retry): 0.01%├── Degraded performance (absorbed by timeout + fallback): 0.05%└── Extended outage (absorbed by circuit breaker + fallback): 0.04%Each layer in the resilience stack absorbs a portion of the failure budget. Retry absorbs transient failures. The circuit breaker absorbs extended outages. The fallback absorbs what both miss.
If your fallback is unavailable, you have no failure budget for extended outages. This is the most common gap - teams build retry but not fallback, leaving extended outages as full-budget events.
What Observability Looks Like for This Layer
Circuit breaker state changes - log every CLOSED→OPEN and OPEN→CLOSED transition with timestamp, failure rate at trigger, and duration in OPEN state. This is your LLM infrastructure incident log.
Retry attempt distribution - what fraction of requests succeed on first attempt vs. require 1, 2, or 3 retries? A shifting distribution (more requests requiring multiple retries) is an early signal of endpoint degradation before the circuit breaker opens.
Fallback activation rate - what fraction of requests are served by the secondary provider or cached response? A rising fallback rate signals primary provider degradation even when the circuit breaker hasn't opened yet.
Provider latency percentiles - track P50, P95, and P99 latency separately for primary and secondary. A rising P99 while P50 is stable is a classic sign of partial degradation - some requests are hanging while most succeed.
Timeout rate - requests that hit the hard timeout before returning. A rising timeout rate is often the first observable signal of provider degradation, appearing before error rates increase.
What to Build First
First: Hard timeouts on all LLM calls. If your LLM calls have no timeout, a slow provider turns into hanging requests that consume connections and memory until your server runs out of both. Add a 30-second hard timeout immediately.
Second: Retry with exponential backoff and jitter. Handle 429s and 5xx errors with retries. Respect Retry-After headers. Use full jitter. Three retries maximum.
Third: Circuit breaker. Wire up a circuit breaker on your primary LLM client. Start with conservative thresholds (50% failure rate over 60 seconds, 10 request minimum). Tighten as you observe production behavior.
Fourth: Secondary provider. Configure a secondary LLM client (different provider or smaller model on same provider). Wire it to activate when the circuit breaker opens.
Fifth: Degradation messaging. Implement a user-facing message for when all providers are unavailable. "Our AI assistant is temporarily unavailable. You can still [manual path]." Better than a spinner.
Sixth: Observability. Add metrics for all of the above. Circuit state, retry counts, fallback activations, latency percentiles. You cannot manage what you cannot measure.
The Principle
Your LLM provider will have an incident. The question was never whether - it's when, and whether your system was designed to handle it.
Retry absorbs the noise. The circuit breaker absorbs the storm. The fallback keeps the lights on.
An LLM system without a circuit breaker is a system that fails loudly and completely when its provider has an incident. An LLM system with one fails gracefully, partially, and recovers automatically.
Your users shouldn't know your provider had an incident. That's the standard. Build to it.
What's Next in This Series
- Part 1: Harness Engineering - The Missing Layer - The full seven-layer Harness Architecture overview
- Part 2: Normalization and Input Defense - Prompt injection, input sanitization, and multi-surface consistency
- Part 3: Context Engineering - Memory architectures, retrieval strategies, and context compression
- Part 4: Gated Execution - Policy engines, human-in-the-loop design, and dry-run patterns
- Part 5: Validation Layer Design - Schema validators, semantic checks, and repair prompt patterns
- Part 7: State Management for Agentic Systems - Checkpoint-resume strategies, cross-session memory, and durable state for long-running agents
- Part 8: Deterministic Constraint Systems - Building tool registries and action manifests that prevent hallucinated actions in agentic systems
References
-
Fowler, M. (2018). Circuit Breaker. martinfowler.com. https://martinfowler.com/bliki/CircuitBreaker.html
-
Nygard, M. (2007). Release It! Design and Deploy Production-Ready Software. Pragmatic Bookshelf.
-
Google SRE Team. (2016). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media. https://sre.google/sre-book/table-of-contents/
-
Brooker, M. (2022). Exponential Backoff And Jitter. AWS Architecture Blog. https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
-
Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (2018). The Site Reliability Workbook. O'Reilly Media. https://sre.google/workbook/table-of-contents/
-
Netflix Tech Blog. (2012). Fault Tolerance in a High Volume, Distributed System. https://netflixtechblog.com/fault-tolerance-in-a-high-volume-distributed-system-91ab4faae74a
-
Anthropic. (2024). Claude API Error Handling. https://docs.anthropic.com/en/api/errors
Related Articles
- Normalization and Input Defense: Hardening the Entry Point of Your LLM System
- Harness Engineering: The Missing Layer Between LLMs and Production Systems
- Validation Layer Design: Building the Reflex That Catches What the Model Gets Wrong
- Context Engineering: What the Model Sees Is What the Model Does