The Bug That Wasn't a Bug
A team ran a nightly pipeline that extracted structured data from legal documents using an LLM. It had been running reliably for two months.
One night it failed. Silent failure - no exception thrown, no alert fired. The output database had been accepting records for six weeks with a damage_cap field stored as a string: "$500,000". Not a number. A string.
Downstream analytics, which assumed an integer, had been silently coercing it to zero. Every risk calculation for those six weeks had a damage cap of $0.
Nobody noticed until a lawyer asked why all their high-value contracts showed zero liability exposure.
The LLM had been returning "$500,000" instead of 500000 for six weeks. Every single record. The application code had accepted it. The database had stored it as VARCHAR. The analytics had silently broken.
There was no bug in the code. There was no validation layer.
This is Part 5 of the Harness Engineering series. Part 1 introduced the seven-layer Harness Architecture. This article goes deep on Layer 5 - the Validation and Repair Layer - covering schema validators, semantic checks, repair prompt patterns, and the fail-fast vs. repair decision.
What the Validation Layer Actually Does
In Part 1 I described the Validation and Repair Layer as "the difference between a flaky demo and a production system." Let me be more specific.
The model will produce malformed output. This is not a model quality problem - it is a property of probabilistic systems operating at scale. Even a highly capable model running a well-designed prompt will produce type errors, missing fields, invalid enum values, and structurally broken JSON at some non-trivial rate. Under load, at edge cases, after prompt drift, after context window pressure - the rate increases.
The Validation Layer has two jobs:
Detection - catch every output that doesn't meet the contract your downstream systems expect. Not occasionally. Every time. Before a single malformed record reaches your database, your API, or your user.
Recovery - when validation fails, either repair the output automatically or fail in a controlled, observable, recoverable way. No silent failures. No downstream corruption. No $0 damage caps propagating for six weeks.
I structure this as a Validation Cascade: three tiers of checks applied in sequence, each catching what the previous cannot. Schema first (structure), semantic second (meaning), business rules third (domain constraints). The cascade runs cheapest-first - which is also why it works at production scale.
The model doesn't need to be perfect. It needs to be correctable. The Validation Layer is what makes correction possible.
The Three Validation Tiers
Validation is not a single check. It's a hierarchy of checks, applied in order, each tier catching failures the previous one cannot.
Tier 1: Schema Validation
The first check is structural. Does the output conform to the expected shape? Are required fields present? Are types correct? Is the JSON well-formed?
This is where Pydantic earns its place. Define your expected output as a Pydantic model and let it do the structural validation for you:
from pydantic import BaseModel, field_validator, model_validatorfrom typing import Literalfrom decimal import Decimalclass LiabilityClause(BaseModel): clause_text: str risk_type: str severity: Literal["low", "medium", "high", "critical"] damage_cap: int | None # Must be integer, not string jurisdiction: str effective_date: str | None @field_validator("damage_cap", mode="before") @classmethod def coerce_damage_cap(cls, v): if v is None: return None if isinstance(v, str): # Strip currency symbols and commas before failing cleaned = v.replace("$", "").replace(",", "").strip() try: return int(float(cleaned)) except ValueError: raise ValueError(f"damage_cap must be a number, got: {v!r}") return v @field_validator("severity") @classmethod def validate_severity(cls, v): if v not in ("low", "medium", "high", "critical"): raise ValueError(f"severity must be one of low/medium/high/critical, got: {v!r}") return vclass ExtractionResult(BaseModel): clauses: list[LiabilityClause] document_type: str jurisdiction_detected: str @model_validator(mode="after") def at_least_one_clause(self): if not self.clauses: raise ValueError("Extraction must return at least one clause") return selfNotice the coerce_damage_cap validator. It doesn't just reject "$500,000" - it strips the currency symbol and converts it. This is defensive coercion: attempt to recover the intent before raising an error. The model meant 500000. Get 500000. Log that a coercion happened. Move on.
Defensive coercion handles the long tail of formatting variations without triggering a repair loop for every minor difference. Reserve repair loops for failures that genuinely can't be coerced.
Tier 2: Semantic Validation
Schema validation checks structure. Semantic validation checks meaning.
A response can be structurally valid and semantically wrong. severity: "low" for a clause that says "contractor bears unlimited liability for all damages" is structurally fine and semantically broken.
class SemanticValidator: def __init__(self, llm): self.llm = llm def validate(self, result: ExtractionResult, source_document: str) -> list[str]: issues = [] for clause in result.clauses: issues.extend(self._check_severity_consistency(clause)) issues.extend(self._check_damage_cap_plausibility(clause, source_document)) return issues def _check_severity_consistency(self, clause: LiabilityClause) -> list[str]: # Rule-based semantic check: unlimited liability should never be "low" unlimited_indicators = [ "unlimited", "all damages", "any and all", "without limit" ] text_lower = clause.clause_text.lower() if any(ind in text_lower for ind in unlimited_indicators): if clause.severity in ("low", "medium"): return [ f"Clause mentions unlimited liability but severity is '{clause.severity}'. " f"Expected 'high' or 'critical'." ] return [] def _check_damage_cap_plausibility( self, clause: LiabilityClause, source: str ) -> list[str]: # Use LLM as judge for complex semantic checks if clause.damage_cap is not None and clause.damage_cap > 0: prompt = ( f"Source clause: {clause.clause_text}\n\n" f"Extracted damage_cap: ${clause.damage_cap:,}\n\n" "Does the extracted damage cap accurately reflect the source clause? " "Reply with only 'yes' or 'no'." ) response = self.llm.call(prompt).strip().lower() if response == "no": return [f"Damage cap ${clause.damage_cap:,} appears inconsistent with source clause text."] return []Semantic validation has two modes: rule-based (fast, deterministic, covers known failure patterns) and LLM-as-judge (flexible, catches novel failures, slower and more expensive). Use rule-based for common patterns you can enumerate. Use LLM-as-judge only for complex semantic consistency checks that can't be rule-encoded, and only when the cost of a false negative justifies the additional LLM call.
Tier 3: Business Rule Validation
The third tier enforces constraints that are specific to your domain and invisible to the model.
class BusinessRuleValidator: def __init__(self, db): self.db = db def validate(self, result: ExtractionResult, document_id: str) -> list[str]: issues = [] # Cross-reference: jurisdiction must be in our supported set supported = self.db.get_supported_jurisdictions() if result.jurisdiction_detected not in supported: issues.append( f"Jurisdiction '{result.jurisdiction_detected}' not in supported set. " f"Supported: {supported}" ) # Uniqueness: no duplicate clause_text within same document seen_texts = set() for clause in result.clauses: if clause.clause_text in seen_texts: issues.append(f"Duplicate clause text detected: {clause.clause_text[:50]}...") seen_texts.add(clause.clause_text) # Referential integrity: document_type must match DB record doc_type = self.db.get_document_type(document_id) if doc_type and doc_type != result.document_type: issues.append( f"Extracted document_type '{result.document_type}' " f"doesn't match DB record '{doc_type}'" ) return issuesBusiness rule validation is where domain knowledge lives - the constraints your model cannot know because they exist in your systems, not in the document being processed.
The Repair Loop Pattern
When validation fails and defensive coercion can't recover the output, you have a choice: fail fast or repair.
Fail fast when:
- The failure is unrecoverable (the model returned HTML instead of JSON)
- The failure has cascaded past the point where a retry would be coherent
- You've already retried and the same failure recurs
- The cost of retry exceeds the value of the result
Repair when:
- The failure is specific and correctable (wrong type on a known field)
- The schema violation can be described precisely in a prompt
- The model is likely to succeed on a targeted correction request
- The downstream cost of failure exceeds the cost of an additional LLM call
The repair loop is simple in concept and critical in implementation:
import jsonfrom pydantic import ValidationErrordef validated_extraction( prompt: str, document: str, schema: type[BaseModel], llm, max_retries: int = 3) -> BaseModel: current_prompt = prompt last_error = None for attempt in range(max_retries): raw = llm.call(current_prompt + f"\n\nDocument:\n{document}") # Parse JSON try: parsed = json.loads(raw) except json.JSONDecodeError as e: last_error = f"Invalid JSON: {e}" current_prompt = build_json_repair_prompt(prompt, raw, last_error) continue # Schema validation try: result = schema.model_validate(parsed) return result except ValidationError as e: last_error = format_validation_errors(e) current_prompt = build_schema_repair_prompt(prompt, raw, last_error) continue raise MaxRetriesExceeded( f"Validation failed after {max_retries} attempts. Last error: {last_error}" )def build_schema_repair_prompt(original_prompt: str, bad_output: str, errors: str) -> str: return ( f"{original_prompt}\n\n" f"Your previous response contained validation errors:\n{errors}\n\n" f"Your previous response was:\n{bad_output}\n\n" "Please correct these specific errors and return valid JSON. " "Do not change any fields that were correct." )def format_validation_errors(e: ValidationError) -> str: lines = [] for error in e.errors(): field = " -> ".join(str(loc) for loc in error["loc"]) lines.append(f"- Field '{field}': {error['msg']} (got: {error.get('input', 'unknown')})") return "\n".join(lines)Two implementation details that matter enormously:
"Do not change any fields that were correct." Without this instruction, the model often fixes the errored field but introduces new errors in previously correct fields. Constrain the repair to the failing fields only.
Exponential backoff between retries. If the model fails once, it's unlikely to succeed with the same prompt immediately. Add a small delay between attempts. On the third attempt, consider widening the prompt with additional examples of the correct format.
Instructor: The Production Standard
In Part 1, I named Pydantic and Instructor as the industry-standard tools for this layer. Here's why Instructor specifically earns that designation.
Instructor wraps LLM calls with automatic Pydantic-based validation and retry, turning the repair loop above into a single function call:
import instructorfrom anthropic import Anthropicclient = instructor.from_anthropic(Anthropic())result = client.messages.create( model="claude-sonnet-4-6", max_tokens=2000, messages=[{"role": "user", "content": f"Extract liability clauses:\n{document}"}], response_model=ExtractionResult, # Your Pydantic model max_retries=3, # Automatic repair loop)# result is already a validated ExtractionResult instanceInstructor handles JSON parsing, schema validation, and retry with error feedback automatically. For schema validation, it is the right default. Build on top of it for semantic and business rule validation - those remain your responsibility.
If you are writing your own JSON parsing loop without Instructor, you are solving a solved problem. Stop and use Instructor.
The Named Pattern: Validation Cascade
I call the three-tier structure the Validation Cascade: schema first, semantic second, business rules third.
The cascade runs in this order for a reason. Schema validation is cheap and fast - it runs entirely in memory with no LLM calls. Semantic validation is more expensive - it may involve an LLM-as-judge call. Business rule validation requires database access.
Run the cheapest check first. Gate the expensive checks behind the cheap ones.
A schema failure short-circuits the cascade - no point running semantic validation on output that can't even be parsed. A schema pass but semantic failure short-circuits business rule validation. Only output that passes all upstream tiers reaches the final tier.
This cascade structure means your expensive checks only run on output that has already cleared the cheap ones - dramatically reducing the cost of comprehensive validation.
What Observability Looks Like for This Layer
Validation failure rate by tier - what fraction of outputs fail at schema vs. semantic vs. business rule? A high schema failure rate signals prompt drift or model degradation. A high semantic failure rate signals the model is misunderstanding the task. A high business rule failure rate signals your domain constraints are tighter than the model's outputs.
Repair success rate - of outputs that fail validation and enter the repair loop, what fraction succeed on retry? Below 70% suggests your repair prompts need improvement. Above 95% suggests you could add more aggressive validation - the model corrects reliably enough to warrant stricter checks.
Defensive coercion rate - how often does Pydantic coerce a value rather than reject it? A rising coercion rate on a specific field signals the model is developing a systematic formatting habit that differs from your schema. Address it in the prompt before it becomes a repair loop dependency.
Retry distribution - what fraction of successes required 1 retry vs. 2 vs. 3? If most successes require 3 retries, your base prompt needs work. If most succeed on first attempt, your validation is healthy.
What to Build First
First: Pydantic models for all structured outputs. If you're calling json.loads() without Pydantic validation on the result, you have unvalidated LLM output in production. Add Pydantic models today.
Second: Install Instructor. Replace manual retry loops with Instructor's automatic validation and retry. Immediately reduces boilerplate and improves repair consistency.
Third: Defensive coercion for common type mismatches. Add field validators for the types the model consistently gets wrong: currency strings to integers, date strings to date objects, percentage strings to floats.
Fourth: Schema repair prompts. Write targeted repair prompts for your most common validation failures. Don't rely on Instructor's default error messages - customize them to explain the correct format explicitly.
Fifth: Rule-based semantic validators. Add checks for the semantic failures you can enumerate from your domain knowledge. Unlimited liability with low severity. Negative damage caps. Contradictory date ranges.
Sixth: LLM-as-judge for complex semantic validation. Add LLM-as-judge checks only for semantic failures you cannot encode as rules. Be selective - each LLM-as-judge call doubles your inference cost for that record.
The Principle
The model is a probabilistic system. Probabilistic systems produce incorrect outputs at some non-zero rate. That rate is not zero even for the best models on the best prompts.
Your validation layer is what transforms a probabilistic system into a reliable one.
Not by making the model perfect - that's not the goal and not achievable. By catching every deviation from the contract your downstream systems depend on, and either correcting it automatically or failing in a way that is observable, recoverable, and auditable.
The $0 damage caps ran for six weeks because there was no validation layer. There was no mechanism to catch the deviation, no alert when it started, no visibility into how long it had been happening.
Silent failures are the most expensive kind. The Validation Cascade makes LLM failures loud, caught, and correctable - before they reach your database, your users, or your lawyers.
What's Next in This Series
- Part 1: Harness Engineering - The Missing Layer - The full seven-layer Harness Architecture overview
- Part 2: Normalization and Input Defense - Prompt injection, input sanitization, and multi-surface consistency
- Part 3: Context Engineering - Memory architectures, retrieval strategies, and context compression
- Part 4: Gated Execution - Policy engines, human-in-the-loop design, and dry-run patterns
- Part 6: Retry, Fallback, and Circuit Breaking - Building resilient LLM infrastructure that survives model outages and latency spikes
- Part 7: State Management for Agentic Systems - Checkpoint-resume strategies, cross-session memory, and durable state for long-running agents
- Part 8: Deterministic Constraint Systems - Building tool registries and action manifests that prevent hallucinated actions in agentic systems
References
-
Liu, J. (2023). Instructor: Structured outputs for LLMs. https://github.com/instructor-ai/instructor
-
Pydantic. (2024). Pydantic V2 Documentation. https://docs.pydantic.dev/latest/
-
Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. arXiv:2203.02155. https://arxiv.org/abs/2203.02155
-
Shinn, N., Cassano, F., Labash, B., et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366. https://arxiv.org/abs/2303.11366
-
Zheng, L., Chiang, W. L., Sheng, Y., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023. https://arxiv.org/abs/2306.05685
-
Chase, H. (2022). LangChain: Building applications with LLMs through composability. https://github.com/langchain-ai/langchain
-
Anthropic. (2024). Tool use with Claude. https://docs.anthropic.com/en/docs/build-with-claude/tool-use
Related Articles
- Normalization and Input Defense: Hardening the Entry Point of Your LLM System
- Harness Engineering: The Missing Layer Between LLMs and Production Systems
- Retry, Fallback, and Circuit Breaking: Building LLM Infrastructure That Survives Outages
- Context Engineering: What the Model Sees Is What the Model Does