Stop Pasting Screenshots: How AI Engineers Document Systems with Mermaid

Introduction

Six months into your LLM project, someone asks: “How does our RAG pipeline actually work?” You dig through Slack. Check Notion. Find three different architecture diagrams—each contradicting the others. None match what’s actually deployed.

Sound familiar? This is the documentation debt that kills AI projects. Not because teams don’t document, but because traditional diagramming tools can’t keep up with how fast AI systems evolve.

I’ve watched this play out dozens of times. A team spends hours crafting beautiful architecture diagrams in Lucidchart or draw.io. Two sprints later, they’ve added a semantic router, switched vector databases, and introduced a reflection loop. The diagrams? Still showing the old design, locked in someone’s Google Drive. The fix isn’t better discipline. It’s better tools.

The Real Cost of Screenshot-Driven Documentation

When I started building production AI systems, I followed the standard playbook: design in Figma, export to PNG, paste into docs. The results were predictably bad.

Here’s what actually happens with static diagrams:

They diverge immediately. You add a cross-encoder reranking stage to your RAG pipeline. The diagram still shows simple vector similarity. Nobody updates it because that requires opening another tool, finding the original file, making edits, re-exporting, and re-uploading.

They’re invisible to code review. Your agent architecture changes during PR review—maybe you split one tool into two, or modified the state transition logic. The code diff shows this. Your diagram? Still wrong, and nobody notices because it’s not in the diff.

They break the development flow. Good documentation happens in context. When you’re deep in implementing a multi-agent workflow, the last thing you want is to switch to a visual editor, recreate your mental model, and then switch back.

I hit this wall hard while writing production-ready agentic systems. The architecture was evolving daily. Keeping diagrams synchronized was either impossible or consumed hours I needed for actual engineering.

Enter Diagram-as-Code

The solution isn’t working harder at diagram maintenance. It’s treating diagrams like we treat code: version-controlled, reviewable, and living alongside the implementation.

This is where Mermaid becomes essential infrastructure.

Instead of drawing boxes and arrows, you describe your system’s structure in plain text. The rendering happens automatically, everywhere your documentation lives—GitHub READMEs, technical blogs, internal wikis, even Jupyter notebooks.

Here’s a simple example. This code:

graph LR
    A[User Query] --> B[Semantic Router]
    B -->|factual| C[Vector DB]
    B -->|conversational| D[LLM Direct]
    C --> E[Reranker]
    E --> F[Context Builder]
    F --> G[LLM Generation]
    D --> G
how queries route through different paths in your RAG system

Renders as a clean flowchart showing how queries route through different paths in your RAG system. No exports, no image hosting, no version drift.

The real power emerges when this diagram lives in your repository’s docs/ folder. Now when someone modifies the routing logic, they update both code and diagram in the same commit. Code review catches documentation drift before it happens.

Five Essential Mermaid Patterns for AI Engineers

Let me show you the diagram patterns I use constantly. These aren’t toy examples—they’re templates I’ve refined while building production systems that handle millions of queries.

1. LLM Agent Architecture with Tool Orchestration

Most agent tutorials show you a simple loop. Production agents are messier. They need memory systems, error handling, and complex tool orchestration.

flowchart TD
    Start([User Input]) --> Router{Intent Router}
    Router -->|search needed| ToolSelect[Tool Selection]
    Router -->|direct answer| Memory[Check Memory]
    
    ToolSelect --> Search[Web Search]
    ToolSelect --> DB[Database Query]
    ToolSelect --> Calc[Calculator]
    
    Search --> Validate{Result Valid?}
    DB --> Validate
    Calc --> Validate
    
    Validate -->|yes| Memory
    Validate -->|no| Retry{Retry Count}
    Retry -->|< 3| ToolSelect
    Retry -->|>= 3| Fallback[Fallback Response]
    
    Memory --> Context[Build Context]
    Fallback --> Context
    Context --> LLM[LLM Generation]
    LLM --> Update[Update Memory]
    Update --> End([Response])

This pattern captures what actually happens: tool failures, retry logic, and memory updates. When you’re debugging why your agent keeps hitting API limits, having this documented makes the problem obvious.

2. Multi-Stage RAG Pipeline

Basic RAG is “embed query, search vectors, generate response.” Production RAG has stages for query rewriting, hybrid search, reranking, and context filtering.

graph TB
    Query[User Query] --> Rewrite[Query Rewriter]
    Rewrite --> Parallel{Parallel Search}
    
    Parallel --> Dense[Dense Retrieval<br/>Vector DB]
    Parallel --> Sparse[Sparse Retrieval<br/>BM25/Keyword]
    
    Dense --> Fusion[Reciprocal Rank Fusion]
    Sparse --> Fusion
    
    Fusion --> Rerank[Cross-Encoder Reranking]
    Rerank --> Filter[Context Window Filter]
    
    Filter --> Prompt[Prompt Construction]
    Prompt --> LLM[LLM Generation]
    LLM --> Cite[Citation Extraction]
    Cite --> Response[Final Response]
 Multi-stage rag pipeline

When your retrieval quality drops, this diagram tells you exactly which stage to investigate. Is the query rewriter over-generalizing? Is fusion weighting wrong? Is the reranker actually improving results?

3. Multi-Agent Research System

Research agents need more than simple tool calls. They plan, execute, reflect, and revise. This is LangGraph territory.

stateDiagram-v2
    [*] --> Planning
    Planning --> Research: Plan Created
    
    Research --> ToolExecution: Query Generated
    ToolExecution --> ResultEval: Results Retrieved
    
    ResultEval --> Research: More Info Needed
    ResultEval --> Synthesis: Sufficient Info
    
    Synthesis --> Reflection: Draft Created
    Reflection --> Revision: Gaps Found
    Reflection --> Final: Quality Threshold Met
    
    Revision --> Research: New Questions
    Final --> [*]
Multi-agent research system

State machines are perfect for agent workflows. You can see the loops (research → tool → eval → research) and the exit conditions (quality threshold met). This maps directly to LangGraph’s state management.

4. LLM Inference Pipeline with Fallbacks

Production systems need graceful degradation. When your primary model is down or rate-limited, what happens?

sequenceDiagram
    participant Client
    participant Gateway
    participant Primary as GPT-4
    participant Secondary as Claude
    participant Fallback as Local Model
    participant Cache
    
    Client->>Gateway: Request
    Gateway->>Cache: Check Cache
    
    alt Cache Hit
        Cache-->>Gateway: Cached Response
        Gateway-->>Client: Response (5ms)
    else Cache Miss
        Gateway->>Primary: Generate
        
        alt Primary Success
            Primary-->>Gateway: Response
            Gateway->>Cache: Store
            Gateway-->>Client: Response (800ms)
        else Primary Error
            Gateway->>Secondary: Fallback Request
            
            alt Secondary Success
                Secondary-->>Gateway: Response
                Gateway-->>Client: Response (1200ms)
            else All Failed
                Gateway->>Fallback: Local Generation
                Fallback-->>Gateway: Degraded Response
                Gateway-->>Client: Response (400ms)
            end
        end
    end
LLM Inference Pipeline with Fallbacks

Sequence diagrams excel at showing timing, fallback chains, and interaction patterns. This one shows exactly how your system degrades under load—critical for reliability planning.

5. Agent State Transitions with Error Handling

Real agents don’t just flow forward. They handle errors, timeouts, and invalid states.

stateDiagram-v2
    [*] --> Idle
    
    Idle --> Processing: New Task
    Processing --> ToolCall: Action Required
    
    ToolCall --> Success: Result OK
    ToolCall --> Timeout: No Response
    ToolCall --> Error: API Error
    
    Timeout --> Retry: Attempt < 3
    Error --> Retry: Retriable Error
    Error --> Failed: Fatal Error
    
    Retry --> ToolCall: Backoff Complete
    Success --> Processing: Continue
    
    Processing --> Complete: Task Done
    Complete --> Idle: Reset
    
    Failed --> Idle: Manual Reset
Agent State Transitions with Error Handling

This is the diagram I wish I’d had when debugging why agents were getting stuck. You can trace any execution path and see exactly where state transitions should happen.

Making Mermaid Work in Your Stack

The diagrams are useful, but only if they integrate seamlessly into your workflow. Here’s how I’ve set this up across different contexts.

GitHub Integration

Mermaid renders natively in GitHub. Drop the code in any .md file:

```mermaid
graph LR
    A[Component A] --> B[Component B]
```

That’s it. Your README, PR descriptions, and documentation all render diagrams automatically. No image hosting, no broken links.

The killer feature: diagrams in PR descriptions. When you’re proposing architecture changes, include a Mermaid diagram showing the new flow. Reviewers see the change visually before diving into code.

Documentation Sites

I use Quarto for technical writing, but the pattern works for MkDocs, Docusaurus, and most static site generators.

For Quarto:

format:
  html:
    mermaid:
      theme: neutral

Then diagrams just work in your .qmd files. The theme setting keeps them readable in both light and dark modes.

Jupyter Notebooks

When prototyping AI systems, I document the architecture right in the notebook:

from IPython.display import display, Markdown

mermaid_code = """
```mermaid
graph TD
    A[Data] --> B[Preprocess]
    B --> C[Embed]
    C --> D[Index]
```
"""

display(Markdown(mermaid_code))

This keeps exploration and documentation together. When the experiment becomes production code, the diagram moves with it.

VS Code

The Mermaid Preview extension lets you see diagrams as you write them. Edit your architecture doc, see the diagram update live. This tight feedback loop makes documentation actually enjoyable.

Advanced Patterns I’ve Found Useful

Once you’re comfortable with basic diagrams, these techniques will level up your documentation game.

Custom Styling for Component Types

Different components deserve different visual treatment:

graph LR
    A[User Input]:::input --> B[LLM]:::model
    B --> C[(Vector DB)]:::storage
    C --> D[Results]:::output
    
    classDef input fill:#e1f5ff,stroke:#01579b
    classDef model fill:#fff9c4,stroke:#f57f17
    classDef storage fill:#f3e5f5,stroke:#4a148c
    classDef output fill:#e8f5e9,stroke:#1b5e20
custom styling mermaid diagram

Color coding makes complex diagrams scannable. Blue for inputs, yellow for models, purple for storage, green for outputs. Your brain pattern-matches instantly.

Subgraphs for System Boundaries

When documenting microservices or multi-container deployments:

graph TB
    subgraph "API Layer"
        A[FastAPI] --> B[Auth Middleware]
    end
    
    subgraph "Processing Layer"
        C[Agent Orchestrator]
        D[Tool Manager]
        E[Memory Store]
    end
    
    subgraph "Infrastructure"
        F[(PostgreSQL)]
        G[(Redis)]
        H[Vector DB]
    end
    
    B --> C
    C --> D
    C --> E
    E --> F
    D --> G
    C --> H
Subgraphs for System Boundaries Mermaid Diagrams

Subgraphs make system boundaries explicit. You can see what’s stateful versus stateless, what scales horizontally, where your bottlenecks are.

Links to Code

This is borderline magical. You can make diagram nodes clickable:

graph LR
    A[Agent Router] --> B[Search Tool]
    click A "https://github.com/yourorg/repo/blob/main/agent/router.py"
    click B "https://github.com/yourorg/repo/blob/main/tools/search.py"
Links to Code Mermaid Diagram

Your architecture diagram becomes a navigable map of your codebase. Click a component, jump to its implementation.

When Mermaid Isn’t Enough

I’m bullish on diagram-as-code, but it’s not universal. Know the limits.

Complex visual design. If you’re creating marketing materials or presentation slides with custom branding, use proper design tools. Mermaid is for technical documentation, not visual design.

Extremely large graphs. Once you hit 50+ nodes, Mermaid diagrams become hard to read. At that scale, consider breaking into multiple diagrams or using specialized graph visualization tools.

Real-time monitoring. Mermaid is static. If you need live system visualization—metrics flowing through your pipeline, real-time dependency graphs—you want something like Grafana or custom dashboards.

The sweet spot is architectural documentation, system design, and workflow explanation. That covers 90% of what AI engineers need to document.

Making This Stick

Here’s how I’ve built this into my development workflow so it actually happens:

Diagram-first design. When planning a new feature, I sketch it in Mermaid before writing code. The act of documenting the design forces me to think through edge cases and dependencies.

PR templates with diagram prompts. Our PR template asks: “Does this change affect system architecture? If yes, update or add Mermaid diagrams.” Makes documentation part of the review process.

Living architecture docs. We maintain a docs/architecture/ folder with Mermaid diagrams for each major subsystem. When the system changes, the diff shows both code and diagram updates.

Blog post diagrams as code. When I write technical posts, diagrams are Mermaid by default. This means I can update them easily, and readers can fork the code to customize for their needs.

The Bigger Picture

This isn’t really about Mermaid. It’s about treating documentation as code.

When I look at successful AI engineering teams, they share a pattern: their documentation lives close to the implementation. Design docs in the repo. Architecture diagrams version-controlled. API specs generated from code.

The teams struggling with documentation debt? Their diagrams live in Google Slides. Their architecture docs are in Confluence, last updated six months ago. There’s friction between writing code and updating docs, so docs don’t get updated.

Mermaid removes that friction. Your diagram is a text file in your repo. Updating it is as natural as updating a comment. Code review catches documentation drift. Your architecture is always in sync because the alternative is harder.

For AI systems—where complexity grows fast, and architectures evolve constantly—this matters more than most domains. The difference between a team that can onboard new engineers in days versus weeks often comes down to documentation quality.

And documentation quality comes down to whether updating it is painful or painless.

Getting Started Today

If you’re convinced but not sure where to start:

Pick one system to document. Don’t boil the ocean. Choose one complex workflow—maybe your RAG pipeline or agent orchestration logic—and diagram it in Mermaid.

Put it in your repo. Create a docs/architecture.md file. Diagram goes there. Commit it.

Link from your README. Make the documentation discoverable. “See architecture docs for system design.”

Update it in your next PR. When you modify that system, update the diagram in the same commit. Feel how much easier this is than updating a PowerPoint.

Expand gradually. As you see the value, add more diagrams. Sequence diagrams for complex interactions. State machines for agent workflows. Flowcharts for decision logic.

The goal isn’t comprehensive documentation on day one. It’s building a habit where documentation updates are as natural as code updates.

Resources and Templates

I’ve already provided production-ready Mermaid templates for common AI system patterns above. You can customize it for your needs.

Useful Mermaid resources:

The documentation is surprisingly good. When you need specific syntax, the live editor’s auto-complete helps.

Final Thoughts

Your AI system is going to change. New techniques will emerge. Your architecture will evolve. That’s the nature of working in a fast-moving field.

The question is whether your documentation will keep up.

Static diagrams won’t. Screenshot-driven workflows can’t. The friction is too high.

Diagram-as-code can. When updating documentation is as easy as updating code, it actually happens.

I’ve seen this transform how teams work. Less time in meetings explaining architecture. Faster onboarding. Fewer “wait, how does this actually work?” moments.

The switch isn’t hard. Pick one diagram you currently maintain in a visual tool. Recreate it in Mermaid. Put it in your repo. Update it once. You’ll feel the difference.

That’s when you’ll know this isn’t just another documentation fad. It’s the infrastructure for how modern AI systems should be documented.

Building Production-Ready AI Agents with LangGraph: A Developer’s Guide to Deterministic Workflows

Introduction

If you’ve built AI agents before, you know the frustration: they work great in demos, then fall apart in production. The agent crashes on step 8 of 10, and you start over from scratch. The LLM decides to do something completely different today than yesterday. You can’t figure out why the agent failed because state is hidden somewhere in conversation history.

I spent months wrestling with these problems before discovering LangGraph. Here’s what I learned about building agents that actually work in production.

The Chain Problem: Why Your Agents Keep Breaking

Most developers start with chains—simple sequential workflows where each step runs in order. They look clean:

result = prompt_template | llm | output_parser | tool_executor

But chains have a fatal flaw: no conditional logic. Every step runs regardless of what happened before. If step 3 fails, you can’t retry just that step. If validation fails, you can’t loop back. If you need human approval, you’re stuck.

Graph vs. Chains: Graphs give you conditional routing—the ability to make decisions based on what actually happened.

Figure: Graph vs. Chains

Production systems need:

  • Conditional routing based on results
  • Retry logic for transient failures
  • Checkpointing to resume from crashes
  • Observable state you can inspect
  • Error handling that doesn’t blow up your entire workflow

That’s where graphs come in.

What LangGraph Actually Gives You

LangGraph isn’t just “chains with extra steps.” It’s a fundamentally different approach built around five core concepts:

LangGraph Core Concepts

Figure: LangGraph Core Concepts

1. Explicit State Management

Instead of hiding state in conversation history, you define exactly what your agent tracks:

class AgentState(TypedDict):
    messages: Annotated[list[BaseMessage], add_messages]
    current_stage: str
    retry_count: int
    search_results: list[dict]
    status: str

Now you can inspect state at any point. Debug based on facts, not guesses.

2. Conditional Routing

The killer feature. Your agent can make decisions:

def route_next(state):
    if state["retry_count"] >= 3:
        return "fail"
    elif state["error"]:
        return "retry"  
    else:
        return "continue"

This simple function enables retry loops, error handling, and multi-stage workflows.

3. Checkpointing

Save state after every step. If execution crashes on step 8, resume from step 7:

checkpointer = SqliteSaver.from_conn_string("checkpoints.db")
app = workflow.compile(checkpointer=checkpointer)

# Crashes? Just resume
result = app.invoke(None, config={"thread_id": "123"})

4. Cycles and Loops

Unlike chains, graphs can loop back. Validation failed? Retry. Output quality low? Refine and try again.

5. Full Observability

Stream execution to see exactly what’s happening:

for step in app.stream(state, config):
    print(f"Node: {step['node']}, Stage: {step['stage']}")

No more black boxes.

Building a Real Agent: Research Agent Walkthrough

Let me show you how these concepts work in practice. We’ll build a research agent that:

  1. Plans search queries
  2. Executes searches
  3. Validates results (retries if insufficient)
  4. Extracts key findings
  5. Generates a final report

Here’s the complete flow:

Research Agent Flow—The agent handles retries automatically; if a search fails, it loops back without starting over.

Figure: Research Agent Flow

Step 1: Define Your State

State is your agent’s memory. Everything it knows goes here:

class ResearchAgentState(TypedDict):
    # Conversation
    messages: Annotated[list[BaseMessage], add_messages]
    
    # Task
    research_query: str
    search_queries: list[str]
    
    # Results  
    search_results: list[dict]
    key_findings: list[str]
    report: str
    
    # Control flow
    current_stage: Literal["planning", "searching", "validating", ...]
    retry_count: int
    max_retries: int
Agent State Structure—Group related fields logically. Use reducers to control how updates merge.

Figure: Agent State Structure

Step 2: Create Nodes

Nodes are functions that transform state. Each does one thing well:

def plan_research(state: ResearchAgentState) -> dict:
    """Generate search queries from research question."""
    query = state["research_query"]
    
    response = llm.invoke([
        SystemMessage(content="You are a research planner."),
        HumanMessage(content=f"Create 3-5 search queries for: {query}")
    ])
    
    queries = parse_queries(response.content)
    
    return {
        "search_queries": queries,
        "current_stage": "searching"
    }
Node Anatomy - Node receives state, does work, and returns updates. Keep them focused.

Figure: Node Anatomy

Step 3: Connect with Edges

Edges define flow. Static edges always go to the same node. Conditional edges make decisions:

# Always go from plan to search
workflow.add_edge("plan", "search")

# After validation, decide based on results
def route_validation(state):
    if state["current_stage"] == "processing":
        return "process"
    return "handle_error"

workflow.add_conditional_edges(
    "validate",
    route_validation,
    {"process": "process", "handle_error": "handle_error"}
)

This pattern handles validation failures, retries, and graceful degradation.

Step 4: Add Checkpointing

Production agents need checkpointing. Period.

from langgraph.checkpoint.sqlite import SqliteSaver

checkpointer = SqliteSaver.from_conn_string("agent.db")
app = workflow.compile(checkpointer=checkpointer)

Now state saves after every node. Crash recovery is automatic.

Step 5: Execute with Observability

Stream execution to see what’s happening:

config = {"configurable": {"thread_id": "research-001"}}

for step in app.stream(initial_state, config=config):
    node_name = list(step.keys())[0]
    print(f"Executing: {node_name}")
    print(f"Stage: {step[node_name]['current_stage']}")

Here’s real output from a production run:

14:54:36 - Creating research agent
14:57:30 - Planning: Generated 5 search queries  
14:57:41 - Searching: 3/3 successful
14:57:41 - Validating: 3 valid results
15:03:26 - Processing: Extracted 5 key findings
15:07:32 - Generating: Report complete

Full visibility into what happened, when, and why.

The Power of State Reducers

One subtle but critical concept: reducers. They control how state updates merge.

Reducer Types

Figure: Reducer Types

Default behavior is replace: new value overwrites old. But for lists and counters, you need different logic:

# Replace (default)
status: str  # New status replaces old

# Accumulate  
total_tokens: Annotated[int, add]  # Adds to running total

# Append
messages: Annotated[list, add_messages]  # Appends to history

# Custom
urls: Annotated[list, lambda old, new: list(set(old + new))]  # Dedupes

Getting reducers wrong causes subtle bugs. Two nodes both update messages? Without add_messages, only the last one’s messages survive.

Production Patterns That Actually Work

After building several production agents, here are patterns that saved me:

Pattern 1: Retry with Backoff

Don’t just retry immediately. Use exponential backoff:

def agent_with_backoff(state):
    if state["last_attempt"]:
        wait_time = state["backoff_seconds"]
        time.sleep(wait_time)
    
    try:
        result = risky_operation()
        return {"result": result, "backoff_seconds": 1}
    except Exception:
        return {
            "retry_count": state["retry_count"] + 1,
            "backoff_seconds": min(state["backoff_seconds"] * 2, 60)
        }

First retry: wait 1s. Second: 2s. Third: 4s. Prevents hammering rate-limited APIs.

Pattern 2: Error-Type Routing

Different errors need different handling:

def route_error(state):
    error = state["error_message"]
    
    if "rate_limit" in error:
        return "backoff"  # Wait longer
    elif "auth" in error:  
        return "refresh_credentials"
    elif "not_found" in error:
        return "try_fallback"
    else:
        return "retry"

A 404 error needs a different strategy than a rate limit.

Pattern 3: Validation Loops

Build quality in:

def route_validation(state):
    if validate(state["output"]) and state["retry_count"] < 3:
        return "success"
    elif state["retry_count"] >= 3:
        return "fail"  
    else:
        return "improve"  # Loop back with feedback

Code doesn’t compile? Loop back and fix it. Output quality low? Try again with better context.

Common Pitfalls (And How to Avoid Them)

Pitfall 1: Infinite Loops

Always have an exit condition:

# BAD - loops forever if error persists
def route(state):
    if state["error"]:
        return "retry"
    return "continue"

# GOOD - circuit breaker
def route(state):
    if state["retry_count"] >= 5:
        return "fail"
    elif state["error"]:
        return "retry"  
    return "continue"

Pitfall 2: No Error Handling

Wrap risky operations:

def safe_node(state):
    try:
        result = api_call()
        return {"result": result, "status": "success"}
    except Exception as e:
        return {
            "status": "error",
            "error_message": str(e),
            "retry_count": state["retry_count"] + 1
        }

One unhandled exception crashes your entire graph.

Pitfall 3: Forgetting Checkpointing

Development without checkpointing is fine. Production without checkpointing is disaster. Always compile with a checkpointer:

# Development  
app = workflow.compile(checkpointer=MemorySaver())

# Production
app = workflow.compile(
    checkpointer=SqliteSaver.from_conn_string("agent.db")
)

Pitfall 4: Ignoring State Reducers

Default behavior loses data:

# BAD - second node overwrites first node's messages
messages: list[BaseMessage]

# GOOD - accumulates messages  
messages: Annotated[list[BaseMessage], add_messages]

Test your reducers. Make sure state updates as expected.

Pitfall 5: State Bloat

Don’t store large documents in state:

# BAD - checkpointing writes MBs to disk
documents: list[str]  # Entire documents

# GOOD - store references, fetch on demand  
document_ids: list[str]  # Just IDs

Keep state under 100KB for fast checkpointing.

Visualizing Your Graph

LangGraph generates diagrams automatically:

display(Image(app.get_graph().draw_mermaid_png()))
Workflow Visualization—See exactly how your agent flows, including retry loops and error paths.

Figure: Workflow Visualization

This catches design flaws before you deploy. Missing edge? Unreachable node? You’ll see it immediately.

Real-World Performance Numbers

Here’s what happened when I moved a research agent from chains to graphs:

Before (chains):

  • Network timeout on step 8 → restart from step 1
  • Cost: $0.50 per failure (7 wasted LLM calls)
  • Debugging time: 2 hours (no observability)
  • Success rate: 60% (failures compounded)

After (LangGraph):

  • Network timeout on step 8 → resume from step 8
  • Cost: $0.05 per retry (1 retried call)
  • Debugging time: 10 minutes (full logs)
  • Success rate: 95% (retries work)

The retry logic alone paid for the migration in a week.

Testing Production Agents

Unit test your nodes:

def test_plan_research():
    state = {"research_query": "AI trends"}
    result = plan_research(state)
    
    assert "search_queries" in result
    assert len(result["search_queries"]) > 0

Test your routers:

def test_retry_routing():
    # Should retry
    state = {"retry_count": 1, "max_retries": 3}
    assert route_retry(state) == "retry"
    
    # Should give up
    state = {"retry_count": 3, "max_retries": 3}
    assert route_retry(state) == "fail"

Integration test the full graph:

def test_agent_end_to_end():
    result = app.invoke(initial_state, config)
    
    assert result["current_stage"] == "complete"
    assert result["report"] != ""
    assert result["retry_count"] <= result["max_retries"]

These tests saved me hours of production debugging.

When to Use Graphs vs Chains

Use chains when:

  • Simple sequential workflow
  • No conditional logic needed
  • Single LLM call
  • Prototyping quickly

Use graphs when:

  • Conditional routing required
  • Need retry logic
  • Long-running workflows
  • Production deployment
  • Error handling critical

Rule of thumb: If your agent has more than 3 steps or any branching, use a graph.

Getting Started: Complete Working Example

I’ve packaged everything into a downloadable project:

GitHub: LangGraph Research Agent

The repo includes:

  • Complete source code
  • 3 working examples (basic, streaming, checkpointing)
  • Unit tests
  • Production-ready configuration
  • Comprehensive documentation

Quick start: Read instructions at given github url.

You’ll see the agent plan, search, validate, process, and generate a report—with full observability and automatic retries.

Key Takeaways

Building production agents isn’t about fancy prompts. It’s about engineering reliability into the system:

  1. Explicit state makes agents debuggable
  2. Conditional routing handles real-world complexity
  3. Checkpointing prevents wasted work
  4. Retry logic turns transient failures into eventual success
  5. Observability shows you exactly what happened

LangGraph gives you all of these. The learning curve is worth it.

Start with the research agent example. Modify it for your use case. Add nodes, adjust routing, customize state. The patterns scale from 3-node prototypes to 20-node production systems.

What’s Next

This covers deterministic workflows—agents that follow explicit paths. The next step is self-correction: agents that reason about their own execution and fix mistakes.

That’s Plan → Execute → Reflect → Refine loops, which we’ll cover in Module 4.

But master graphs first. You can’t build agents that improve themselves if you can’t build agents that execute reliably.

Resources

Official Documentation:

Code Examples:

About This Series

This post is part of Building Real-World Agentic AI Systems with LangGraph, a comprehensive guide to production-ready AI agents. The series covers:

Building agents that actually work in production is hard. But with the right patterns, it’s definitely achievable. LangGraph gives you those patterns.

Now go build something real.

Choosing the Right LLM Inference Framework: A Practical Guide

Performance benchmarks, cost analysis, and decision framework for developers worldwide


Here’s something nobody tells you about “open source” AI: the model weights might be free, but running them isn’t.

A developer in San Francisco downloads LLaMA-2 70B. A developer in Bangalore downloads the same model. They both have “open access.” But the San Francisco developer spins up an A100 GPU on AWS for $2.50/hour and starts building. The Bangalore developer looks at their budget, does the math on ₹2 lakhs per month for cloud GPUs, and realizes that “open” doesn’t mean “accessible.”

This is where LLM inference frameworks come in. They’re not just about making models run faster—though they do that. They’re about making the difference between an idea that costs $50,000 a month to run and one that runs on your laptop. Between building something in Singapore that requires data to stay in-region and something that phones home to Virginia with every request. Between a prototype that takes two hours to set up and one that takes two weeks.

The framework you choose determines whether you can actually build what you’re imagining, or whether you’re locked out by hardware requirements you can’t meet. So let’s talk about how to choose one.

What This Guide Covers (And What It Doesn’t)

This guide focuses exclusively on inference and serving constraints for deploying LLMs in production or development environments. It compares frameworks based on performance, cost, setup complexity, and real-world deployment scenarios.

What this guide does NOT cover:

  • Model quality, alignment, or training techniques
  • Fine-tuning or model customization approaches
  • Prompt engineering or application-level optimization
  • Specific model recommendations (LLaMA vs GPT vs others)

If you’re looking for help choosing which model to use, this isn’t the right guide. This is about deploying whatever model you’ve already chosen.

What You Need to Know

Quick Answer: Choose vLLM if you’re deploying at production scale (100+ concurrent users) and need consistently low latency. Choose TensorRT-LLM if you’re on NVIDIA hardware and can invest 1-2 weeks in setup for maximum throughput efficiency. Choose Ollama if you’re prototyping and want something running in 10 minutes. Choose llama.cpp if you don’t have access to GPUs or need to deploy on edge devices.

The Real Question: This isn’t actually about which framework is “best.” It’s about which constraints you’re operating under. A bootstrapped startup in Pune and a funded company in Singapore are solving fundamentally different problems, even if they’re deploying the same model. The “best” framework is the one you can actually use.

Understanding LLM Inference Frameworks

What is an LLM Inference Framework?

An LLM inference framework is specialized software that handles everything involved in getting predictions out of a trained language model. Think of it as the engine that sits between your model weights and your users.

When someone asks your chatbot a question, the framework manages: loading the model into memory, batching requests from multiple users efficiently, managing the key-value cache that speeds up generation, scheduling GPU computation, handling the token-by-token generation process, and streaming responses back to users.

Without an inference framework, you’d need to write all of this yourself. With one, you get years of optimization work from teams at UC Berkeley, NVIDIA, Hugging Face, and others who’ve solved these problems at scale.

Why This Choice Actually Matters

The framework you choose determines three things that directly impact whether your project succeeds:

Cost. A framework that delivers 120 requests per second versus 180 requests per second means the difference between renting 5 GPUs or 3 GPUs. At $2,500 per GPU per month, that’s $5,000 monthly—$60,000 annually. For a startup, that’s hiring another engineer. For a bootstrapped founder, that’s the difference between profitable and broke.

Time. Ollama takes an hour to set up. TensorRT-LLM can take two weeks of expert time. If you’re a solo developer, two weeks is an eternity. If you’re a funded team with ML engineers, it might be worth it for the performance gains. Time-to-market often matters more than theoretical optimization.

What you can build. Some frameworks need GPUs. Others run on CPUs. Some work on any hardware; others are locked to NVIDIA. If you’re in a region where A100s cost 3x what they cost in Virginia, or if your data can’t leave Singapore, these constraints determine what’s possible before you write a single line of code.

The Six Frameworks You Should Know

Let’s cut through the noise. There are dozens of inference frameworks, but six dominate the landscape in 2025. Each makes different trade-offs, and understanding those trade-offs is how you choose.

vLLM: The Production Workhorse

What it is: Open-source inference engine from UC Berkeley’s Sky Computing Lab, now a PyTorch Foundation project. Built for high-throughput serving with two key innovations—PagedAttention and continuous batching.

Performance: In published benchmarks and production deployments, vLLM typically delivers throughput in the 120-160 requests per second range with 50-80ms time to first token. What makes vLLM special isn’t raw speed—TensorRT-LLM can achieve higher peak throughput—but how well it handles concurrency. It maintains consistently low latency even as you scale from 10 users to 100 users.

Setup complexity: 1-2 days for someone comfortable with Python and CUDA. The documentation is solid, the community is active, and it plays nicely with Hugging Face models out of the box.

Best for: Production APIs serving multiple concurrent users. Interactive applications where time-to-first-token matters. Teams that want flexibility without weeks of setup time.

Real-world example: A Bangalore-based SaaS company with Series A funding uses vLLM to power their customer support chatbot. They handle 50-100 concurrent users during business hours, running on 2x A100 GPUs in AWS Mumbai region. Monthly cost: ₹4 lakhs ($4,800). They chose vLLM over TensorRT-LLM because their ML engineer could get it production-ready in a week versus a month.

TensorRT-LLM: Maximum Performance, Maximum Complexity

What it is: NVIDIA’s specialized inference library built on TensorRT. Not a general-purpose tool—this is specifically engineered to extract every possible bit of performance from NVIDIA GPUs through CUDA graph optimizations, fused kernels, and Tensor Core acceleration.

Performance: When properly configured on supported NVIDIA hardware, TensorRT-LLM can achieve throughput in the 180-220 requests per second range with 35-50ms time to first token at lower concurrency levels. Published benchmarks from BentoML show it delivering up to 700 tokens per second when serving 100 concurrent users with LLaMA-3 70B quantized to 4-bit. However, under certain batching configurations or high concurrency patterns, time-to-first-token can degrade significantly—in some deployments, TTFT can exceed several seconds when not properly tuned.

Setup complexity: 1-2 weeks of expert time. You need to convert model checkpoints, build TensorRT engines, configure Triton Inference Server, and tune parameters. The documentation exists but assumes you know what you’re doing. For teams without dedicated ML infrastructure engineers, this can feel like climbing a mountain.

Best for: Organizations deep in the NVIDIA ecosystem willing to invest setup time for maximum efficiency. Enterprise deployments where squeezing 20-30% more throughput from the same hardware justifies weeks of engineering work.

Real-world example: A Singapore fintech company processing legal documents uses TensorRT-LLM on H100 GPUs. They handle 200+ concurrent users and need data to stay in the Singapore region for compliance. The two-week setup time was worth it because the performance gains let them use 3 GPUs instead of 5, saving S$8,000 monthly.

Ollama: Developer-Friendly, Production-Limited

What it is: Built on llama.cpp but wrapped in a polished, developer-friendly interface. Think of it as the Docker of LLM inference—you can get a model running with a single command.

Performance: In typical development scenarios, Ollama handles 1-3 requests per second in concurrent situations. This isn’t a production serving framework—it’s optimized for single-user development environments. But for that use case, it’s exceptionally good.

Setup complexity: 1-2 hours. Install Ollama, run ‘ollama pull llama2’, and you’re running a 7B model on your laptop. It handles model downloads, quantization, and serving automatically.

Best for: Rapid prototyping. Learning how LLMs work without cloud bills. Individual developers building tools for themselves. Any situation where ease of use matters more than serving many concurrent users.

Real-world example: A solo developer in Austin building a personal research assistant uses Ollama on a MacBook Pro. Zero cloud costs. Zero setup complexity. When they’re ready to scale, they’ll migrate to vLLM, but for prototyping, Ollama gets them building immediately instead of fighting infrastructure.

llama.cpp: The CPU Enabler

What it is: Pure C/C++ implementation with no external dependencies, designed to run LLMs on consumer hardware. This is the framework that makes “I don’t have a GPU” stop being a blocker.

Performance: CPU-bound, meaning it depends heavily on your hardware. But with aggressive quantization (down to 2-bit), you can run a 7B model at usable speeds on a decent CPU. Not fast enough for 100 concurrent users, but fast enough for real applications serving moderate traffic.

Setup complexity: Hours to days, depending on your comfort with C++ compilation and quantization. More involved than Ollama, less involved than TensorRT-LLM.

Best for: Edge deployment. Resource-constrained environments. Any scenario where GPU access is impossible or prohibitively expensive. Developers who need maximum control over every optimization.

Real-world example: An ed-tech startup in Pune runs llama.cpp on CPU servers, serving 50,000 queries daily for their AI tutor product. Monthly infrastructure cost: ₹15,000 ($180). They tried GPU options first, but ₹2 lakhs per month wasn’t sustainable at their revenue. CPU inference is slower, but their users don’t notice the difference between 200ms and 800ms response times.

Hugging Face TGI: Ecosystem Integration

What it is: Text Generation Inference from Hugging Face, built for teams already using the HF ecosystem. It’s not the fastest framework, but the integration with Hugging Face’s model hub and tooling makes it valuable for certain workflows.

Performance: In practice, TGI delivers throughput in the 100-140 requests per second range with 60-90ms time to first token. Competitive but not leading-edge.

Best for: Teams already standardized on Hugging Face tooling. Organizations that value comprehensive documentation and established patterns over cutting-edge performance.

SGLang: Structured Generation Specialist

What it is: Framework built around structured generation with a dedicated scripting language for chaining operations. RadixAttention enables efficient cache reuse for sequences with similar prefixes.

Performance: SGLang shows remarkably stable per-token latency (4-21ms) across different load patterns. Not the highest throughput, but notably consistent.

Best for: Multi-step reasoning tasks, agentic applications, integration with vision and retrieval models. Teams building complex LLM workflows beyond simple chat.

Understanding Performance Metrics

When people talk about inference performance, they’re usually talking about three numbers. Understanding what they actually mean helps you choose the right framework.

Performance Benchmark Caveat

Performance metrics vary significantly based on:

  • Model size and quantization level
  • Prompt length and output length
  • Batch size and concurrency patterns
  • GPU memory configuration and hardware specs
  • Framework version and configuration tuning

The figures cited in this guide represent observed ranges from published benchmarks (BentoML, SqueezeBits, Clarifai) and real-world deployment reports from 2024-2025. They are not guarantees and should be validated against your specific workload before making infrastructure decisions.

Time to First Token (TTFT)

This is the delay between when a user sends a request and when they see the first word of the response. For interactive applications—chatbots, coding assistants, anything with humans waiting—this is what determines whether your app feels fast or sluggish.

Think about asking ChatGPT a question. That pause before the first word appears? That’s TTFT. Below 100ms feels instant. Above 500ms starts feeling slow. Above 1 second, users notice and get frustrated.

In published benchmarks, vLLM excels here, maintaining 50-80ms TTFT even with 100 concurrent users. TensorRT-LLM achieves faster times at low concurrency (35-50ms) but can degrade under certain high-load configurations.

Throughput (Requests per Second)

This measures how many requests your system can handle simultaneously. Higher throughput means you need fewer servers to handle the same traffic, which directly translates to lower costs.

In optimized deployments, TensorRT-LLM can achieve 180-220 req/sec, vLLM typically delivers 120-160 req/sec, and TGI manages 100-140 req/sec. At scale, these differences matter. Going from 120 to 180 req/sec means you can serve 50% more users on the same hardware.

But here’s the catch: throughput measured in isolation can be misleading. What matters is sustained throughput while maintaining acceptable latency. A framework that delivers 200 req/sec but with 2-second TTFT isn’t actually useful for interactive applications.

Tokens Per Second (Decoding Speed)

After that first token appears, this measures how fast the model generates the rest of the response. This is what makes responses feel fluid when they’re streaming.

Most modern frameworks deliver 40-60 tokens per second on decent hardware. The differences here are smaller than TTFT or throughput, and honestly, most users don’t notice the difference between 45 and 55 tokens per second when watching a response stream in.

The Real Cost Analysis

Let’s talk about what it actually costs to run these frameworks. The numbers vary wildly depending on where you are and what you’re building.

Pricing Disclaimer

Cloud provider pricing fluctuates based on region, commitment level, and market conditions. The figures below reflect typical 2024-2025 ranges from AWS, GCP, and Azure. Always check current pricing for your specific region and usage pattern before making budget decisions.

Hardware Costs

Purchasing an A100 GPU:

  • United States: $10,000-$15,000
  • Singapore: S$13,500-S$20,000
  • India: ₹8-12 lakhs

Cloud GPU rental (monthly):

  • AWS/GCP US regions: $2,000-3,000/month per A100
  • AWS Singapore: S$2,700-4,000/month per A100
  • AWS Mumbai: ₹1.5-2.5 lakhs/month per A100

That’s just the GPU. You also need CPU, memory, storage, and bandwidth. A realistic production setup costs 20-30% more than just the GPU rental.

The Setup Cost Nobody Talks About

Engineering time is real money, even if it doesn’t show up on your AWS bill.

Ollama: 1-2 hours of developer time. At ₹5,000/hour for a senior developer in India, that’s ₹10,000. At $150/hour in the US, that’s $300. Basically free.

vLLM: 1-2 days of ML engineer time. In India, maybe ₹80,000. In the US, $2,400. In Singapore, S$1,600. Not trivial, but manageable.

TensorRT-LLM: 1-2 weeks of expert time. In India, ₹4-5 lakhs. In the US, $12,000-15,000. In Singapore, S$8,000-10,000. Now you’re talking about real money.

For a bootstrapped startup, that TensorRT-LLM setup cost might be more than their entire monthly runway. For a funded company with dedicated ML infrastructure engineers, it’s a rounding error worth paying for the performance gains.

Regional Considerations

The framework choice looks different depending on where you’re building. Not because the technology is different, but because the constraints are different.

For Developers in India

Primary challenge: Limited GPU access and import costs that make hardware 3x more expensive than in the US.

The llama.cpp advantage: When cloud GPUs cost ₹2 lakhs per month and that’s your entire team’s salary budget, CPU-only inference stops being a compromise and starts being the only viable path. Advanced quantization techniques in llama.cpp can compress models down to 2-4 bits, making a 7B model run acceptably on a ₹15,000/month CPU server.

Real scenario: You’re building a SaaS product for Indian SMEs. Your customers won’t pay enterprise prices, so your margins are tight. Spending ₹2 lakhs monthly on infrastructure when your MRR is ₹8 lakhs doesn’t work. But ₹15,000 on CPU servers? That’s sustainable. You’re not trying to serve Google-scale traffic anyway—you’re trying to build a profitable business.

For Developers in Singapore and Southeast Asia

Primary challenge: Data sovereignty requirements and regional latency constraints.

The deployment reality: Financial services, healthcare, and government sectors in Singapore often require data to stay in-region. That means you can’t just use the cheapest cloud region—you need Singapore infrastructure. AWS Singapore costs about 10% more than US regions, but that’s the cost of compliance.

Framework choice: vLLM or TGI on AWS Singapore or Google Cloud Singapore. The emphasis is less on absolute cheapest and more on reliable, compliant, production-ready serving. Teams here often have the budget for proper GPU infrastructure but need frameworks with enterprise support and proven reliability.

For Developers in the United States

Primary challenge: Competitive pressure for maximum performance and scale.

The optimization game: US companies often compete on features and scale where milliseconds matter and serving 10,000 concurrent users is table stakes. The cost of cloud infrastructure is high, but the cost of being slow or unable to scale is higher. Losing users to a faster competitor hurts more than spending an extra $10,000 monthly on GPUs.

Framework choice: Funded startups tend toward vLLM for the balance of performance and deployment speed. Enterprises with dedicated ML infrastructure teams often invest in TensorRT-LLM for that last 20% of performance optimization. The two-week setup time is justified because the ongoing savings from better GPU utilization pays for itself.

Quick Decision Matrix

Use this table as a starting point for framework selection based on your primary constraint:

Your Primary ConstraintRecommended FrameworkWhy
No GPU accessllama.cppCPU-only inference with aggressive quantization
Prototyping / LearningOllamaZero-config, runs on laptops
10-100 concurrent usersvLLMBest balance of performance and setup complexity
100+ users, NVIDIA GPUsTensorRT-LLMMaximum throughput when properly configured
Hugging Face ecosystemTGISeamless integration with HF tools
Agentic/multi-step workflowsSGLangStructured generation and cache optimization
Tight budget, moderate trafficllama.cppLowest infrastructure cost
Data sovereignty requirementsvLLM or TGIRegional deployment flexibility

How to Actually Choose

Stop looking for the “best” framework. Start asking which constraints matter most to your situation.

Question 1: What’s Your Budget Reality?

Can’t afford GPUs at all: llama.cpp is your path. It’s not a compromise; it’s how you build something rather than nothing. Many successful products run on CPU inference because their users care about reliability and features, not sub-100ms response times.

Can afford 1-2 GPUs: vLLM or TGI. Both will get you production-ready inference serving reasonable traffic. vLLM probably has the edge on performance; TGI has the edge on ecosystem integration if you’re already using Hugging Face.

Can afford a GPU cluster: Now TensorRT-LLM becomes interesting. When you’re running 5+ GPUs, that 20-30% efficiency gain from TensorRT means you might only need 4 GPUs instead of 5. The setup complexity is still painful, but the ongoing savings justify it.

Question 2: How Much Time Do You Have?

Need something running today: Ollama. Install it, pull a model, start building. You’ll migrate to something else later when you need production scale, but Ollama gets you from zero to functional in an afternoon.

Have a week: vLLM or TGI. Both are production-capable and well-documented enough that a competent engineer can get them running in a few days.

Have dedicated ML infrastructure engineers: TensorRT-LLM becomes viable. The complexity only makes sense when you have people whose job is dealing with complexity.

Question 3: What Scale Are You Actually Targeting?

Personal project or internal tool (1-10 users): Ollama or llama.cpp. The overhead of vLLM’s production serving capabilities doesn’t make sense when you have 3 users.

Small SaaS (10-100 concurrent users): vLLM or TGI. You’re in the sweet spot where their optimizations actually matter but you don’t need absolute maximum performance.

Enterprise scale (100+ concurrent users): vLLM or TensorRT-LLM depending on whether you optimize for deployment speed or runtime efficiency. At this scale, the performance differences translate to actual money.

Question 4: What’s Your Hardware Situation?

NVIDIA GPUs available: All options are on the table. If it’s specifically A100/H100 hardware and you have time, TensorRT-LLM will give you the best performance when properly configured.

AMD GPUs or non-NVIDIA hardware: vLLM has broader hardware support. TensorRT-LLM is NVIDIA-only.

CPU only: llama.cpp is your only real option. But it’s a good option—don’t treat it as second-class.

Real-World Deployment Scenarios

Let’s look at how actual teams made these choices.

Scenario 1: Bootstrapped Startup in Bangalore

Company: Ed-tech platform, 5 person team, 50,000 daily users

Budget constraint: ₹8 lakhs monthly revenue, can’t spend ₹2 lakhs on infrastructure

Technical requirement: AI-powered personalized learning recommendations

Choice: llama.cpp on 32-core CPU servers

Outcome: ₹15,000/month infrastructure cost. Response times are 400-800ms, which their users don’t complain about because the recommendations are actually useful. The business is profitable, which wouldn’t be true with GPU costs.

Scenario 2: Series A SaaS Company in Singapore

Company: Financial services platform, 40 employees, serving banks and fintech companies

Regulatory constraint: Data must stay in Singapore region for compliance

Technical requirement: Process 10M financial documents monthly, 200+ concurrent users during business hours

Choice: TensorRT-LLM on 3x H100 GPUs in AWS Singapore region

Outcome: S$12,000/month infrastructure cost. The two-week setup time was painful, but the performance optimization meant they could handle their load on 3 GPUs instead of the 5 GPUs vLLM would have required. Monthly savings of S$8,000 justified the initial investment.

Scenario 3: AI Startup in San Francisco

Company: Developer tools company, 25 employees, $8M Series A

Market constraint: Competing with well-funded incumbents on performance

Technical requirement: Code completion with sub-100ms latency, 500+ concurrent developers

Choice: vLLM on 8x A100 GPUs

Outcome: $20,000/month infrastructure cost. They prioritized getting to market fast over squeezing out maximum performance. vLLM gave them production-quality serving in one week versus the month TensorRT-LLM would have taken. At their stage, speed to market mattered more than 20% better GPU efficiency.

The Uncomfortable Truth About Framework Choice

Here’s what nobody wants to say: for most developers, the framework choice is constrained by things that have nothing to do with the technology.

A developer in San Francisco and a developer in Bangalore might both download the same LLaMA-2 weights. They both have “open access” to the model. But they don’t have the same access to the infrastructure needed to run it at scale. The San Francisco developer can spin up A100 GPUs without thinking about it. The Bangalore developer does the math and realizes it would consume their entire salary budget.

This is why llama.cpp matters so much. Not because it’s the fastest or the most elegant solution, but because it’s the solution that works when GPUs aren’t an option. It’s the difference between building something and building nothing.

We talk about “democratizing AI” by releasing model weights. But if running those models costs $5,000 per month and your monthly income is $1,000, those weights aren’t democratized—they’re just decorative. The framework you can actually use determines whether you can build at all.

This isn’t a technical problem. It’s a structural one. And it’s why framework comparisons that only look at benchmarks miss the point. The “best” framework isn’t the one with the highest throughput. It’s the one that lets you build what you’re trying to build with the constraints you actually face.

Practical Recommendations

Based on everything we’ve covered, here’s how I’d think about the choice:

Start with Ollama for Prototyping

Unless you have unusual constraints, begin with Ollama. Get your idea working, validate that it’s useful, prove to yourself that LLM inference solves your problem. You’ll learn what performance characteristics actually matter to your users.

Don’t optimize prematurely. Don’t spend two weeks setting up TensorRT-LLM before you know if anyone wants what you’re building.

Graduate to vLLM for Production

When you have actual users and actual scale requirements, vLLM is probably your best bet. It’s the sweet spot between performance and deployment complexity. You can get it running in a few days, it handles production loads well, and the community is active if you run into issues.

vLLM’s superpower isn’t being the absolute fastest—it’s being fast enough while remaining deployable by teams without dedicated ML infrastructure engineers.

Consider TensorRT-LLM When Scale Justifies Complexity

If you’re running 5+ GPUs and burning $15,000+ monthly on infrastructure, now the two-week setup time for TensorRT-LLM starts making sense. A 25% performance improvement means you might only need 4 GPUs instead of 5, saving $3,000 monthly. That pays for the setup time in a few months.

But be honest about whether you’re at that scale. Most projects aren’t.

Don’t Dismiss llama.cpp

If your budget is tight or you need edge deployment, llama.cpp isn’t a fallback option—it’s the primary option. Many successful products run on CPU inference. Your users care about whether the product works, not whether it uses GPUs.

A working product on CPU infrastructure beats a hypothetical perfect product that you can’t afford to build.

Frequently Asked Questions

Which LLM inference framework should I choose?

It depends on your constraints. Choose vLLM for production scale (100+ concurrent users) with balanced setup complexity. Choose TensorRT-LLM if you’re on NVIDIA hardware and can invest 1-2 weeks for maximum performance. Choose Ollama for rapid prototyping and getting started quickly. Choose llama.cpp if you don’t have GPU access or need edge deployment.

Can I run LLM inference without a GPU?

Yes. llama.cpp enables CPU-only LLM inference with advanced quantization techniques that reduce memory requirements by up to 75%. While slower than GPU inference, it’s fast enough for many real-world applications, especially those serving moderate traffic rather than thousands of concurrent users. Many successful products run entirely on CPU infrastructure.

How much does LLM inference actually cost?

Cloud GPU rental varies by region: $2,000-3,000/month per A100 in the US, S$2,700-4,000/month in Singapore, ₹1.5-2.5 lakhs/month in India. CPU-only deployment with llama.cpp can cost as little as ₹10-15K/month ($120-180) for moderate workloads. The total cost includes setup time: Ollama takes hours, vLLM takes 1-2 days, TensorRT-LLM takes 1-2 weeks of expert engineering time.

Is vLLM better than TensorRT-LLM?

They optimize for different things. vLLM prioritizes ease of deployment and consistent low latency across varying loads. TensorRT-LLM prioritizes maximum throughput on NVIDIA hardware but requires significantly more setup effort. vLLM is better for teams that need production-ready serving quickly. TensorRT-LLM is better for teams running at massive scale where spending weeks on optimization saves thousands monthly in infrastructure costs.

What’s the difference between Ollama and llama.cpp?

Ollama is built on top of llama.cpp but adds a user-friendly layer with automatic model management, one-command installation, and simplified configuration. llama.cpp is the underlying inference engine that gives you more control but requires more manual setup. Think of Ollama as the Docker of LLM inference—optimized for developer experience. Use Ollama for quick prototyping, use llama.cpp directly when you need fine-grained control or CPU-optimized production deployment.

Which framework is fastest for LLM inference?

TensorRT-LLM can deliver the highest throughput (180-220 req/sec range) and lowest time-to-first-token (35-50ms) on supported NVIDIA hardware when properly configured. However, vLLM maintains better performance consistency under high concurrent load, keeping 50-80ms TTFT even with 100+ users. “Fastest” depends on your workload pattern—peak performance versus sustained performance under load—and proper configuration.

Do I need different frameworks for different regions?

No, the framework choice is the same globally, but regional constraints affect which framework is practical. Data sovereignty requirements in Singapore might push you toward regional cloud deployment. Hardware costs in India might make CPU-only inference with llama.cpp the only viable option. US companies often have easier access to GPU infrastructure but face competitive pressure for maximum performance. The technology is the same; the constraints differ.

How do I choose between cloud and on-premise deployment?

Cloud deployment (AWS, GCP, Azure) offers flexibility and faster scaling but with ongoing costs of $2,000-3,000 per GPU monthly. On-premise makes sense when you have sustained high load that justifies the $10,000-15,000 upfront GPU cost, or when regulatory requirements mandate keeping data in specific locations. Break-even is typically around 4-6 months of sustained usage. For startups and variable workloads, cloud is usually better. For established companies with predictable load, on-premise can be cheaper long-term.

What about quantization—do I need it?

Quantization (reducing model precision from 16-bit to 8-bit, 4-bit, or even 2-bit) is essential for running larger models on limited hardware. It can reduce memory requirements by 50-75% with minimal quality degradation. All modern frameworks support quantization, but llama.cpp has the most aggressive quantization options, making it possible to run 7B models on consumer CPUs. For GPU deployment, 4-bit or 8-bit quantization is standard practice for balancing performance and resource usage.

The Bottom Line

The framework landscape in 2025 is mature enough that you have real choices. vLLM for production serving, TensorRT-LLM for maximum performance, Ollama for prototyping, llama.cpp for resource-constrained deployment—each is legitimately good at what it does.

But the choice isn’t just technical. It’s about which constraints you’re operating under. A developer in Bangalore trying to build something profitable on a tight budget faces different constraints than a funded startup in San Francisco optimizing for scale. The “open” models are the same, but the paths to actually deploying them look completely different.

Here’s what I wish someone had told me when I started: don’t optimize for the perfect framework. Optimize for shipping something that works. Start with Ollama, prove your idea has value, then migrate to whatever framework makes sense for your scale and constraints. The best framework is the one that doesn’t stop you from building.

And if you’re choosing between a framework that requires GPUs you can’t afford versus llama.cpp on hardware you already have—choose llama.cpp. A working product beats a hypothetical perfect one every time.

The weights might be open, but the infrastructure isn’t equal. Choose the framework that works with your reality, not the one that works in someone else’s benchmarks.

Summarised in the following presentation deck:

References & Further Reading

Benchmark Studies & Performance Analysis

  1. BentoML Team. “Benchmarking LLM Inference Backends: vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and Hugging Face TGI.” BentoML Blog. Retrieved from: https://www.bentoml.com/blog/benchmarking-llm-inference-backends
  2. SqueezeBits Team. “[vLLM vs TensorRT-LLM] #1. An Overall Evaluation.” SqueezeBits Blog, October 2024. Retrieved from: https://blog.squeezebits.com/vllm-vs-tensorrtllm-1-an-overall-evaluation-30703
  3. Clarifai. “Comparing SGLang, vLLM, and TensorRT-LLM with GPT-OSS-120B.” Clarifai Blog, September 2025. Retrieved from: https://www.clarifai.com/blog/comparing-sglang-vllm-and-tensorrt-llm-with-gpt-oss-120b
  4. Clarifai. “LLM Inference Optimization Techniques.” Clarifai Guide, October 2025. Retrieved from: https://www.clarifai.com/blog/llm-inference-optimization/
  5. ITECS Online. “vLLM vs Ollama vs llama.cpp vs TGI vs TensorRT-LLM: 2025 Guide.” October 2025. Retrieved from: https://itecsonline.com/post/vllm-vs-ollama-vs-llama.cpp-vs-tgi-vs-tensort

Framework Documentation & Official Sources

  1. vLLM Project. “vLLM: High-throughput and memory-efficient inference and serving engine for LLMs.” GitHub Repository. Retrieved from: https://github.com/vllm-project/vllm
  2. NVIDIA. “TensorRT-LLM Documentation.” NVIDIA Developer Documentation. Retrieved from: https://github.com/NVIDIA/TensorRT-LLM
  3. Ollama Project. “Get up and running with large language models, locally.” Official Documentation. Retrieved from: https://ollama.ai/
  4. llama.cpp Project. “LLM inference in C/C++.” GitHub Repository. Retrieved from: https://github.com/ggml-org/llama.cpp
  5. Hugging Face. “Text Generation Inference Documentation.” Hugging Face Docs. Retrieved from: https://huggingface.co/docs/text-generation-inference/
  6. SGLang Project. “SGLang: Efficient Execution of Structured Language Model Programs.” GitHub Repository. Retrieved from: https://github.com/sgl-project/sglang

Technical Analysis & Comparisons

  1. Northflank. “vLLM vs TensorRT-LLM: Key differences, performance, and how to run them.” Northflank Blog. Retrieved from: https://northflank.com/blog/vllm-vs-tensorrt-llm-and-how-to-run-them
  2. Inferless. “vLLM vs. TensorRT-LLM: In-Depth Comparison for Optimizing Large Language Model Inference.” Inferless Learn. Retrieved from: https://www.inferless.com/learn/vllm-vs-tensorrt-llm-which-inference-library-is-best-for-your-llm-needs
  3. Neural Bits (Substack). “The AI Engineer’s Guide to Inference Engines and Frameworks.” August 2025. Retrieved from: https://multimodalai.substack.com/p/the-ai-engineers-guide-to-inference
  4. The New Stack. “Six Frameworks for Efficient LLM Inferencing.” September 2025. Retrieved from: https://thenewstack.io/six-frameworks-for-efficient-llm-inferencing/
  5. Zilliz Blog. “10 Open-Source LLM Frameworks Developers Can’t Ignore in 2025.” January 2025. Retrieved from: https://zilliz.com/blog/10-open-source-llm-frameworks-developers-cannot-ignore-in-2025

Regional Deployment & Cost Analysis

  1. House of FOSS. “Ollama vs llama.cpp vs vLLM: Local LLM Deployment in 2025.” July 2025. Retrieved from: https://www.houseoffoss.com/post/ollama-vs-llama-cpp-vs-vllm-local-llm-deployment-in-2025
  2. Picovoice. “llama.cpp vs. ollama: Running LLMs Locally for Enterprises.” July 2024. Retrieved from: https://picovoice.ai/blog/local-llms-llamacpp-ollama/
  3. AWS Pricing. “Amazon EC2 P4d Instances (A100 GPU).” Retrieved Q4 2024 from: https://aws.amazon.com/ec2/instance-types/p4/
  4. Google Cloud Pricing. “A2 VMs and GPUs pricing.” Retrieved Q4 2024 from: https://cloud.google.com/compute/gpus-pricing

Research Papers & Academic Sources

  1. Kwon, Woosuk et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” SOSP 2023. arXiv:2309.06180
  2. Yu, Gyeong-In et al. “Orca: A Distributed Serving System for Transformer-Based Generative Models.” OSDI 2022.
  3. NVIDIA Research. “TensorRT: High Performance Deep Learning Inference.” NVIDIA Technical Blog.

Community Resources & Tools

  1. Awesome LLM Inference (GitHub). “A curated list of Awesome LLM Inference Papers with Codes.” Retrieved from: https://github.com/xlite-dev/Awesome-LLM-Inference
  2. Hugging Face. “Introducing multi-backends (TRT-LLM, vLLM) support for Text Generation Inference.” Hugging Face Blog. Retrieved from: https://huggingface.co/blog/tgi-multi-backend
  3. Sebastian Raschka. “Noteworthy LLM Research Papers of 2024.” January 2025. Retrieved from: https://sebastianraschka.com/blog/2025/llm-research-2024.html

Additional Technical Resources

  1. Label Your Data. “LLM Inference: Techniques for Optimized Deployment in 2025.” December 2024. Retrieved from: https://labelyourdata.com/articles/llm-inference
  2. Medium (Zain ul Abideen). “Best LLM Inference Engine? TensorRT vs vLLM vs LMDeploy vs MLC-LLM.” July 2024. Retrieved from: https://medium.com/@zaiinn440/best-llm-inference-engine-tensorrt-vs-vllm-vs-lmdeploy-vs-mlc-llm-e8ff033d7615
  3. Rafay Documentation. “Choosing Your Engine for LLM Inference: The Ultimate vLLM vs. TensorRT LLM Guide.” April 2025. Retrieved from: https://docs.rafay.co/blog/2025/04/28/choosing-your-engine-for-llm-inference-the-ultimate-vllm-vs-tensorrt-llm-guide/
  4. Hivenet Compute. “vLLM vs TGI vs TensorRT‑LLM vs Ollama.” Retrieved from: https://compute.hivenet.com/post/vllm-vs-tgi-vs-tensorrt-llm-vs-ollama

Survey Papers & Comprehensive Guides

  1. Heisler, Morgan Lindsay et al. “LLM Inference Scheduling: A Survey of Techniques, Frameworks, and Trade-offs.” Huawei Technologies, 2025. Retrieved from: https://www.techrxiv.org/
  2. Search Engine Land. “International SEO: Everything you need to know in 2025.” January 2025. Retrieved from: https://searchengineland.com/international-seo-everything-you-need-to-know-450866

Note on Sources

All benchmark figures, performance metrics, and pricing data cited in this guide were retrieved during Q4 2024 and early 2025. Framework capabilities, cloud pricing, and performance characteristics evolve rapidly in the LLM infrastructure space.

For the most current information:

  • Check official framework documentation for latest features
  • Verify cloud provider pricing in your specific region
  • Run your own benchmarks with your specific workload
  • Consult community forums (Reddit r/LocalLLaMA, Hacker News) for recent real-world experiences

Benchmark Reproducibility Note: Performance varies significantly based on:

  • Exact framework versions used
  • Model architecture and size
  • Quantization settings
  • Hardware configuration
  • Batch size and concurrency patterns
  • Prompt and completion lengths

The figures in this guide represent typical ranges observed across multiple independent benchmark studies. Your mileage will vary.

Acknowledgments

This guide benefited from:

  • Public benchmark studies from BentoML, SqueezeBits, and Clarifai teams
  • Open discussions in the vLLM, llama.cpp, and broader LLM communities
  • Real-world deployment experiences shared by developers in India, Singapore, and US tech communities
  • Technical documentation from framework maintainers and NVIDIA research

Special thanks to the open-source maintainers of vLLM, llama.cpp, Ollama, SGLang, and related projects who make this ecosystem possible.

Agent Building Blocks: Build Production-Ready AI Agents with LangChain | Complete Developer Guide

From understanding concepts to building systems – this comprehensive guide takes you through every component needed to build reliable, production-ready AI agents.

Introduction: From Theory to Implementation

Building an AI agent isn’t about chaining a few LLM calls together and hoping for the best. It’s about understanding the fundamental mechanics that make agents actually work in production environments.

If you’ve experimented with agent frameworks like LangChain or AutoGPT, you’ve probably noticed something: they make agents look simple on the surface, but when things break (and they will), you’re left debugging a black box. The agent gets stuck in loops, picks wrong tools, forgets context, or hallucinates operations that don’t exist.

The problem? Most developers treat agents as magical systems without understanding what’s happening under the hood.

This guide changes that. We’re deconstructing agents into their core building blocks – the execution loop, tool interfaces, memory architecture, and state transitions. By the end, you’ll not only understand how agents work, but you’ll be able to build robust, debuggable systems that handle real-world tasks.

What makes this different from other agent tutorials?

Instead of showing you how to call agent.run() and praying it works, we’re breaking down each component with production-grade implementations. You’ll see working code for every pattern, understand why each piece matters, and learn where systems typically fail.

Who is this guide for?

AI engineers and software developers who want to move beyond toy examples. If you’ve built demos that work 70% of the time but can’t figure out why they fail the other 30%, this is for you. If you need agents that handle errors gracefully, maintain context across conversations, and execute tools reliably, keep reading.

The Fundamental Truth About Agents

Here’s what most tutorials won’t tell you upfront: An agent is not a monolith – it’s a loop with state, tools, and memory.

Every agent system, regardless of complexity, follows the same pattern:

Figure 1: The canonical agent execution loop showing Observe → Think → Decide → Act → Update State → Termination Check cycle

This six-step pattern (five actions plus termination check) appears everywhere:

  • Simple chatbots implement it minimally
  • Complex multi-agent systems run multiple instances simultaneously
  • Production systems add error handling and recovery to each step

The sophistication varies, but the structure stays constant.

Why this matters for production systems:

When you call agent.run() in LangChain or set up a workflow in LangGraph, this loop executes behind the scenes. When something breaks – the agent loops infinitely, selects wrong tools, or loses context – you need to know which step failed:

  • Observe: Did it lack necessary context?
  • Think: Was the prompt unclear or misleading?
  • Decide: Were tool descriptions ambiguous?
  • Act: Did the tool crash or return unexpected data?
  • Update State: Did memory overflow or lose information?

Without understanding the loop, you’re debugging blind.

Anatomy of the Agent Execution Loop

Let’s examine the agent loop with precision. This isn’t pseudocode – this is the actual pattern every agent implements:

def agent_loop(task: str, max_iterations: int = 10) -> str:
    """
    The canonical agent execution loop.
    This foundation appears in every agent system.
    """
    # Initialize state
    state = {
        "task": task,
        "conversation_history": [],
        "iteration": 0,
        "completed": False
    }
    
    while not state["completed"] and state["iteration"] < max_iterations:
        # STEP 1: OBSERVE
        # Gather current context: task, history, available tools, memory
        observation = observe(state)
        
        # STEP 2: THINK
        # LLM reasons about what to do next
        reasoning = llm_think(observation)
        
        # STEP 3: DECIDE
        # Choose an action based on reasoning
        action = decide_action(reasoning)
        
        # STEP 4: ACT
        # Execute the chosen action (tool call or final answer)
        result = execute_action(action)
        
        # STEP 5: UPDATE STATE
        # Store the outcome and update memory
        state = update_state(state, action, result)
        
        # TERMINATION CHECK
        if is_task_complete(state):
            state["completed"] = True
        
        state["iteration"] += 1
    
    return extract_final_answer(state)

The state dictionary is the agent’s working memory. It tracks everything: the original task, conversation history, current iteration, and completion status. This state persists across iterations, accumulating context as the agent progresses.

The while condition has two critical parts:

  1. not state["completed"] – checks if the task is finished
  2. state["iteration"] < max_iterations – safety valve preventing infinite loops

Without the second condition, a logic error or unclear task makes your agent run forever, burning through API credits and compute resources.

The five steps must execute in order:

  • You can’t decide without observing
  • You can’t act without deciding
  • You can’t update state without seeing results

This sequence is fundamental, not arbitrary.

Step 1: Observe – Information Gathering

Purpose: Assemble all relevant information for decision-making

def observe(state: dict) -> dict:
    """
    Observation packages everything the LLM needs:
    - Original task/goal
    - Conversation history
    - Available tools
    - Current memory/context
    - Previous action outcomes
    """
    return {
        "task": state["task"],
        "history": state["conversation_history"][-5:],  # Last 5 turns
        "available_tools": get_available_tools(),
        "iteration": state["iteration"],
        "previous_result": state.get("last_result")
    }

Why observation matters:

The LLM can’t see your entire system state – you must explicitly package relevant information. Think of it as preparing a briefing document before a meeting. Miss critical context, and decisions suffer.

Key considerations:

  • Context window management: You can’t include unlimited history. The code above keeps the last 5 conversation turns. This prevents token overflow while maintaining recent context. For longer conversations, implement summarization or semantic filtering.
  • Tool visibility: The agent needs to know what actions it can take. This seems obvious until you’re debugging why an agent doesn’t use a tool you just added. Make tool descriptions visible in every observation.
  • Iteration tracking: Including the current iteration helps the LLM understand how long it’s been working. After iteration 8 of 10, it might change strategy or provide intermediate results.
  • Previous results: The outcome of the last action directly influences the next decision. Did the API call succeed? What data came back? This feedback is essential.

Common failures:

  • Token limit exceeded because you included entire conversation history
  • Missing tool descriptions causing the agent to ignore available functions
  • No previous result context making the agent repeat failed actions
  • Task description missing causing goal drift over multiple iterations

Step 2: Think – LLM Reasoning

Purpose: Process observations and reason about next steps

def llm_think(observation: dict) -> str:
    """
    The LLM receives context and generates reasoning.
    This is where the intelligence happens.
    """
    prompt = f"""
    Task: {observation['task']}
    
    Previous conversation:
    {format_history(observation['history'])}
    
    Available tools:
    {format_tools(observation['available_tools'])}
    
    Previous result: {observation.get('previous_result', 'None')}
    
    Based on this context, what should you do next?
    Think step-by-step about:
    1. What information do you have?
    2. What information do you need?
    3. Which tool (if any) should you use?
    4. Can you provide a final answer?
    """
    
    return llm.generate(prompt)

This is where reasoning happens. The LLM analyzes the current situation and determines the next action. Quality of thinking depends entirely on prompt design.

Prompt engineering for agents:

  • Structure matters: Notice the prompt breaks down reasoning into steps. “What should you do next?” is too vague. “Think step-by-step about information you have, information you need, tools to use, and whether you can answer” guides better reasoning.
  • Context ordering: Put the most important information first. Task description comes before history. Tool descriptions come before previous results. LLMs perform better with well-structured input.
  • Tool descriptions in reasoning: The agent needs clear descriptions of each tool’s purpose, inputs, and outputs. Ambiguous descriptions lead to wrong tool selection.
  • ReAct pattern: Many production systems use “Reason + Act” prompting. The LLM explicitly writes its reasoning (“I need weather data, so I’ll use the weather tool”) before selecting actions. This improves decision quality and debuggability.

Common reasoning failures:

  • Generic prompts that don’t guide step-by-step thinking
  • Missing tool descriptions causing the agent to hallucinate functions
  • Unclear task specifications leading to goal confusion
  • No explicit reasoning step making decisions opaque

Step 3: Decide – Action Selection

Purpose: Convert reasoning into a specific, executable action

def decide_action(reasoning: str) -> dict:
    """
    Parse the LLM's reasoning and extract a structured action.
    This bridges thinking and execution.
    """
    # Parse LLM output for tool calls or final answers
    if "Tool:" in reasoning:
        tool_name = extract_tool_name(reasoning)
        tool_args = extract_tool_arguments(reasoning)
        return {
            "type": "tool_call",
            "tool": tool_name,
            "arguments": tool_args
        }
    elif "Final Answer:" in reasoning:
        answer = extract_final_answer(reasoning)
        return {
            "type": "final_answer",
            "content": answer
        }
    else:
        # Unclear reasoning - request clarification
        return {
            "type": "continue",
            "message": "Need more information"
        }

Decision making converts reasoning to structure. The LLM output is text. Execution requires structured data. This step parses reasoning into actionable commands.

Structured output formats:

Modern LLMs support structured outputs via function calling or JSON mode. Instead of parsing text, you can get typed responses:

# Using OpenAI function calling
action = llm.generate(
    messages=messages,
    tools=[
        {
            "type": "function",
            "function": {
                "name": "calculator",
                "description": "Perform mathematical calculations",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "expression": {"type": "string"}
                    },
                    "required": ["expression"]
                }
            }
        }
    ],
    tool_choice="auto"
)

This approach eliminates parsing errors and guarantees valid tool calls.

Decision validation:

Before executing, validate the decision:

  • Does the requested tool exist?
  • Are all required arguments provided?
  • Do argument types match the schema?
  • Are argument values reasonable?

Failure handling:

What happens when the LLM generates invalid output? You need fallback logic:

def decide_action_safe(reasoning: str) -> dict:
    try:
        action = decide_action(reasoning)
        validate_action(action)
        return action
    except ParseError:
        return {
            "type": "error",
            "message": "Could not parse action from reasoning"
        }
    except ValidationError as e:
        return {
            "type": "error",
            "message": f"Invalid action: {str(e)}"
        }

Common decision failures:

  • LLM hallucinates non-existent tools
  • Missing required arguments in tool calls
  • Type mismatches between provided and expected arguments
  • No validation before execution causing downstream crashes

Step 4: Act – Execution

Purpose: Execute the decided action and return results

def execute_action(action: dict) -> dict:
    """
    Execute tool calls or generate final answers.
    This is where the agent interacts with the world.
    """
    if action["type"] == "tool_call":
        tool = get_tool(action["tool"])
        try:
            result = tool.execute(**action["arguments"])
            return {
                "success": True,
                "result": result,
                "tool": action["tool"]
            }
        except Exception as e:
            return {
                "success": False,
                "error": str(e),
                "tool": action["tool"]
            }
    
    elif action["type"] == "final_answer":
        return {
            "success": True,
            "result": action["content"],
            "final": True
        }
    
    elif action["type"] == "error":
        return {
            "success": False,
            "error": action["message"]
        }

This is where theory meets reality. Tools interact with external systems: APIs, databases, file systems, calculators. External systems fail, timeout, return unexpected data, or change their interfaces.

Production execution considerations:

Error handling is mandatory: Every external call can fail. Network issues, API rate limits, authentication failures, malformed responses – expect everything.

def safe_tool_execution(tool, arguments, timeout=30):
    """Production-grade tool execution with comprehensive error handling"""
    try:
        # Set timeout to prevent hanging
        with time_limit(timeout):
            result = tool.execute(**arguments)
        
        # Validate result format
        validate_result_schema(result)
        
        return {"success": True, "result": result}
    
    except TimeoutError:
        return {"success": False, "error": "Tool execution timeout"}
    except ValidationError as e:
        return {"success": False, "error": f"Invalid result format: {e}"}
    except APIError as e:
        return {"success": False, "error": f"API error: {e}"}
    except Exception as e:
        # Log unexpected errors for debugging
        logger.exception(f"Unexpected error in {tool.name}")
        return {"success": False, "error": "Tool execution failed"}

Retry logic: Transient failures (network issues, temporary API problems) should trigger retries with exponential backoff:

def execute_with_retry(tool, arguments, max_retries=3):
    for attempt in range(max_retries):
        result = tool.execute(**arguments)
        if result["success"]:
            return result
        
        if not is_retryable_error(result["error"]):
            return result
        
        # Exponential backoff: 1s, 2s, 4s
        time.sleep(2 ** attempt)
    
    return result

Result formatting: Tools should return consistent result structures. Standardize on success/error patterns:

# Good: Consistent structure
{
    "success": True,
    "result": "42",
    "metadata": {"tool": "calculator", "execution_time": 0.01}
}

# Bad: Inconsistent structure
"42"  # Just a string - no error information

Common execution failures:

  • Missing timeout handling causing agents to hang
  • No retry logic for transient failures
  • Poor error messages making debugging impossible
  • Inconsistent result formats breaking downstream processing

Step 5: Update State – Memory Management

Purpose: Incorporate action results into agent state

def update_state(state: dict, action: dict, result: dict) -> dict:
    """
    Update state with action outcomes.
    This is how the agent learns and remembers.
    """
    # Add to conversation history
    state["conversation_history"].append({
        "iteration": state["iteration"],
        "action": action,
        "result": result,
        "timestamp": datetime.now()
    })
    
    # Update last result for next observation
    state["last_result"] = result
    
    # Check for completion
    if result.get("final"):
        state["completed"] = True
        state["final_answer"] = result["result"]
    
    # Trim history if too long
    if len(state["conversation_history"]) > 20:
        state["conversation_history"] = state["conversation_history"][-20:]
    
    return state

State management is how agents remember. Without proper updates, agents repeat actions, forget results, and lose context.

What to store:

  • Conversation history: Every action and result. This creates the narrative of what happened. Essential for debugging and providing context in future observations.
  • Last result: The most recent outcome directly influences the next decision. Store it separately for easy access.
  • Metadata: Timestamps, iteration numbers, execution times. Useful for debugging and performance analysis.

State trimming strategies:

States grow indefinitely if not managed. Implement strategies:

  • Fixed window: Keep last N interactions (shown above)
  • Summarization: Use an LLM to summarize old history into concise context
  • Semantic filtering: Keep only relevant interactions based on similarity to current task
  • Hierarchical storage: Recent items in full detail, older items summarized, ancient items removed

Memory types explained:

Figure 2: Three types of agent memory – Short-term (conversation), Long-term (persistent), and Episodic (learning from past interactions)

Short-term memory:

  • Current conversation context
  • Lasts for a single session
  • Stored in the state dictionary
  • Used for maintaining coherence within a task

Long-term memory:

  • Persistent information across sessions
  • User preferences, learned facts, configuration
  • Stored in databases or vector stores
  • Requires explicit saving and loading

Episodic memory:

  • Past successful/failed strategies
  • Patterns of what works in specific situations
  • Used for learning and improvement
  • Stored as embeddings of past interactions

Common state management failures:

  • Unbounded state growth causing memory issues
  • Not trimming history leading to token limit errors
  • Missing metadata making debugging impossible
  • No persistent storage losing context between sessions

Termination Check – Knowing When to Stop

Purpose: Determine if the agent should continue or finish

def is_task_complete(state: dict) -> bool:
    """
    Multiple termination conditions for safety and correctness.
    Never rely on a single condition.
    """
    # Success: Explicit completion
    if state.get("completed"):
        return True
    
    # Safety: Maximum iterations
    if state["iteration"] >= MAX_ITERATIONS:
        logger.warning("Max iterations reached")
        return True
    
    # Safety: Cost limits
    if calculate_cost(state) >= MAX_COST:
        logger.warning("Cost budget exceeded")
        return True
    
    # Safety: Time limits
    if time_elapsed(state) >= MAX_TIME:
        logger.warning("Time limit exceeded")
        return True
    
    # Detection: Loop/stuck state
    if detect_loop(state):
        logger.warning("Loop detected")
        return True
    
    return False

Termination is critical and complex. A single condition isn’t enough. You need multiple safety valves.

Termination conditions explained:

  • Task completion (success): The agent explicitly generated a final answer and marked itself complete. This is the happy path.
  • Max iterations (safety): After N iterations, stop regardless. Prevents infinite loops from logic errors or unclear tasks. Set this based on task complexity – simple tasks might need 5 iterations, complex ones might need 20.
  • Cost limits (budget): Each LLM call costs money. Set a budget (in dollars or tokens) and stop when exceeded. Protects against runaway costs.
  • Time limits (performance): User-facing agents need responsiveness. If execution exceeds time budget, return partial results rather than making users wait indefinitely.
  • Loop detection (stuck states): If the agent repeats the same action multiple times or cycles through the same states, it’s stuck. Detect this and terminate.

Loop detection implementation:

def detect_loop(state: dict, window=3) -> bool:
    """
    Detect if agent is repeating actions.
    Compares last N actions for similarity.
    """
    if len(state["conversation_history"]) < window:
        return False
    
    recent_actions = [
        h["action"] for h in state["conversation_history"][-window:]
    ]
    
    # Check if all recent actions are identical
    if all(a == recent_actions[0] for a in recent_actions):
        return True
    
    # Check if cycling through same set of actions
    if len(set(str(a) for a in recent_actions)) < window / 2:
        return True
    
    return False

Graceful degradation:

When terminating due to safety conditions, provide useful output:

def extract_final_answer(state: dict) -> str:
    """
    Extract final answer, handling different termination reasons.
    """
    if state.get("final_answer"):
        return state["final_answer"]
    
    # Terminated due to safety condition
    if state["iteration"] >= MAX_ITERATIONS:
        return "Could not complete task within iteration limit. " + \
               summarize_progress(state)
    
    if detect_loop(state):
        return "Task appears stuck. Last attempted: " + \
               describe_last_action(state)
    
    # Fallback
    return "Task incomplete. Progress: " + summarize_progress(state)

Common termination failures:

  • Single termination condition causing infinite loops
  • No cost limits burning through API budgets
  • Missing timeout making user-facing agents unresponsive
  • Poor loop detection allowing stuck states to continue

Tool Calling: The Action Interface

Tools are how agents interact with the world. Without properly designed tools, agents are just chatbots. With them, agents can query databases, call APIs, perform calculations, and manipulate systems.

The three-part tool structure:

Every production tool needs three components:

1. Function Implementation:

def search_web(query: str, num_results: int = 5) -> str:
    """
    Search the web and return results.
    
    Args:
        query: Search query string
        num_results: Number of results to return (default: 5)
    
    Returns:
        Formatted search results
    """
    try:
        # Implementation
        results = web_search_api.search(query, num_results)
        return format_results(results)
    except Exception as e:
        return f"Search failed: {str(e)}"

2. Schema Definition:

search_tool_schema = {
    "name": "search_web",
    "description": "Search the web for current information. Use this when you need recent data, news, or information not in your training data.",
    "parameters": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "The search query"
            },
            "num_results": {
                "type": "integer",
                "description": "Number of results (1-10)",
                "default": 5
            }
        },
        "required": ["query"]
    }
}

3. Wrapper Class:

class Tool:
    """Base tool interface"""
    def __init__(self, name: str, description: str, function: callable, schema: dict):
        self.name = name
        self.description = description
        self.function = function
        self.schema = schema
    
    def execute(self, **kwargs) -> dict:
        """Execute tool with validation and error handling"""
        # Validate inputs against schema
        self._validate_inputs(kwargs)
        
        # Execute with error handling
        try:
            result = self.function(**kwargs)
            return {"success": True, "result": result}
        except Exception as e:
            return {"success": False, "error": str(e)}
    
    def _validate_inputs(self, kwargs: dict):
        """Validate inputs match schema"""
        required = self.schema["parameters"].get("required", [])
        for param in required:
            if param not in kwargs:
                raise ValueError(f"Missing required parameter: {param}")

Why all three components matter:

  • Function implementation does the actual work. This is where you integrate with external systems.
  • Schema definition tells the LLM how to use the tool. Clear descriptions and parameter documentation are essential. The LLM decides which tool to use based entirely on this information.
  • Wrapper class provides standardization. All tools follow the same interface, simplifying agent logic and error handling.

Tool description best practices:

# Bad description
"search_web: Searches the web"

# Good description
"search_web: Search the internet for current information, news, and recent events. Use this when you need information published after your knowledge cutoff or want to verify current facts. Returns the top search results with titles and snippets."

Good descriptions answer:

  • What does it do?
  • When should you use it?
  • What does it return?
Tool calling flow - LLM generates tool call → Schema validation → Function execution → Result formatting → State update

Figure 3: Tool calling flow – LLM generates tool call → Schema validation → Function execution → Result formatting → State update

Real-world tool examples:

Calculator tool:

def calculator(expression: str) -> str:
    """
    Evaluate mathematical expressions safely.
    Supports: +, -, *, /, **, (), and common functions.
    """
    try:
        # Safe evaluation without exec/eval
        from ast import literal_eval
        result = eval_expression_safe(expression)
        return f"Result: {result}"
    except Exception as e:
        return f"Error: {str(e)}"

calculator_schema = {
    "name": "calculator",
    "description": "Perform mathematical calculations. Supports arithmetic, exponents, and parentheses. Use for any computation.",
    "parameters": {
        "type": "object",
        "properties": {
            "expression": {
                "type": "string",
                "description": "Mathematical expression (e.g., '2 + 2', '(10 * 5) / 2')"
            }
        },
        "required": ["expression"]
    }
}

Database query tool:

def query_database(query: str, table: str) -> str:
    """
    Execute SQL query on specified table.
    Supports: SELECT statements only (read-only).
    """
    # Validate query is SELECT only
    if not query.strip().upper().startswith("SELECT"):
        return "Error: Only SELECT queries allowed"
    
    try:
        results = db.execute(query, table)
        return format_db_results(results)
    except Exception as e:
        return f"Query error: {str(e)}"

database_schema = {
    "name": "query_database",
    "description": "Query the database for stored information. Use this to retrieve user data, preferences, past orders, or historical records. Read-only access.",
    "parameters": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "SQL SELECT query"
            },
            "table": {
                "type": "string",
                "description": "Table name to query",
                "enum": ["users", "orders", "products", "preferences"]
            }
        },
        "required": ["query", "table"]
    }
}

API call tool:

def api_call(endpoint: str, method: str = "GET", data: dict = None) -> str:
    """
    Make API requests to external services.
    Handles authentication and error responses.
    """
    try:
        response = requests.request(
            method=method,
            url=f"{API_BASE_URL}/{endpoint}",
            json=data,
            headers={"Authorization": f"Bearer {API_KEY}"},
            timeout=30
        )
        response.raise_for_status()
        return response.json()
    except requests.Timeout:
        return "Error: Request timeout"
    except requests.RequestException as e:
        return f"Error: {str(e)}"

api_call_schema = {
    "name": "api_call",
    "description": "Call external APIs for real-time data. Use for weather, stock prices, exchange rates, or other external services.",
    "parameters": {
        "type": "object",
        "properties": {
            "endpoint": {
                "type": "string",
                "description": "API endpoint path (e.g., 'weather', 'stocks/AAPL')"
            },
            "method": {
                "type": "string",
                "enum": ["GET", "POST"],
                "default": "GET"
            },
            "data": {
                "type": "object",
                "description": "Request body for POST requests"
            }
        },
        "required": ["endpoint"]
    }
}

Tool error handling patterns:

class ToolExecutionError(Exception):
    """Base exception for tool errors"""
    pass

class ToolTimeoutError(ToolExecutionError):
    """Tool execution exceeded timeout"""
    pass

class ToolValidationError(ToolExecutionError):
    """Tool inputs failed validation"""
    pass

def execute_tool_safe(tool: Tool, arguments: dict) -> dict:
    """
    Production-grade tool execution with comprehensive error handling.
    """
    try:
        # Validate inputs
        tool._validate_inputs(arguments)
        
        # Execute with timeout
        with timeout(30):
            result = tool.execute(**arguments)
        
        # Validate output
        validate_tool_output(result)
        
        return result
    
    except ToolValidationError as e:
        logger.error(f"Tool validation failed: {e}")
        return {
            "success": False,
            "error": f"Invalid input: {str(e)}",
            "recoverable": True
        }
    
    except ToolTimeoutError:
        logger.error(f"Tool timeout: {tool.name}")
        return {
            "success": False,
            "error": "Tool execution timeout",
            "recoverable": True
        }
    
    except Exception as e:
        logger.exception(f"Tool error: {tool.name}")
        return {
            "success": False,
            "error": f"Execution failed: {str(e)}",
            "recoverable": False
        }

Common tool implementation mistakes:

  • Vague descriptions causing the LLM to misuse tools
  • Missing input validation allowing invalid data through
  • No timeout handling causing hung executions
  • Poor error messages making debugging impossible
  • Inconsistent return formats breaking state updates

Memory Architecture: Short-term, Long-term, and Episodic

Memory separates toy demos from production systems. Conversation without memory frustrates users. But not all memory is the same – different types serve different purposes.

Three tier memory architecture

Figure 4: Three-tier memory architecture showing Short-term memory (current session), Long-term memory (persistent storage), and Episodic memory (past interaction patterns)

Short-term Memory: Conversation Context

Purpose: Maintain coherence within a single conversation

Implementation:

class ShortTermMemory:
    """
    Manages conversation context for current session.
    Stored in-memory, not persisted.
    """
    def __init__(self, max_messages: int = 20):
        self.messages = []
        self.max_messages = max_messages
    
    def add_message(self, role: str, content: str):
        """Add message to history"""
        self.messages.append({
            "role": role,
            "content": content,
            "timestamp": datetime.now()
        })
        
        # Trim if too long
        if len(self.messages) > self.max_messages:
            self.messages = self.messages[-self.max_messages:]
    
    def get_context(self) -> list:
        """Get recent conversation context"""
        return self.messages
    
    def clear(self):
        """Clear conversation history"""
        self.messages = []

Use cases:

  • Current conversation flow
  • Immediate context for next response
  • Temporary task state
  • Within-session coherence

Limitations:

  • Lost when session ends
  • Grows unbounded without trimming
  • Token limits force summarization

Long-term Memory: Persistent Storage

Purpose: Remember information across sessions

Implementation:

class LongTermMemory:
    """
    Persistent storage for facts and preferences.
    Uses database or key-value store.
    """
    def __init__(self, user_id: str, db_connection):
        self.user_id = user_id
        self.db = db_connection
    
    def store_fact(self, key: str, value: str, category: str = "general"):
        """Store a fact about the user"""
        self.db.upsert(
            table="user_facts",
            data={
                "user_id": self.user_id,
                "key": key,
                "value": value,
                "category": category,
                "updated_at": datetime.now()
            }
        )
    
    def retrieve_fact(self, key: str) -> str:
        """Retrieve a stored fact"""
        result = self.db.query(
            f"SELECT value FROM user_facts WHERE user_id = ? AND key = ?",
            (self.user_id, key)
        )
        return result["value"] if result else None
    
    def get_all_facts(self, category: str = None) -> dict:
        """Get all facts, optionally filtered by category"""
        query = "SELECT key, value FROM user_facts WHERE user_id = ?"
        params = [self.user_id]
        
        if category:
            query += " AND category = ?"
            params.append(category)
        
        results = self.db.query(query, params)
        return {r["key"]: r["value"] for r in results}

Use cases:

  • User preferences (communication style, format preferences)
  • Personal information (name, location, job title)
  • Learned facts (favorite tools, common tasks)
  • Configuration (default parameters, enabled features)

Storage considerations:

Database: Structured facts work well in relational databases

# Schema
CREATE TABLE user_facts (
    user_id TEXT,
    key TEXT,
    value TEXT,
    category TEXT,
    updated_at TIMESTAMP,
    PRIMARY KEY (user_id, key)
);

Vector database: Semantic retrieval for unstructured information

class VectorMemory:
    """Store and retrieve memories by semantic similarity"""
    def __init__(self, user_id: str, vector_db):
        self.user_id = user_id
        self.vector_db = vector_db
    
    def store(self, content: str, metadata: dict = None):
        """Store content with embeddings"""
        embedding = generate_embedding(content)
        self.vector_db.upsert(
            user_id=self.user_id,
            embedding=embedding,
            content=content,
            metadata=metadata or {}
        )
    
    def search(self, query: str, top_k: int = 5) -> list:
        """Find similar memories"""
        query_embedding = generate_embedding(query)
        return self.vector_db.search(
            user_id=self.user_id,
            embedding=query_embedding,
            top_k=top_k
        )

Episodic Memory: Learning from Past Interactions

Purpose: Remember and learn from past episodes (complete task sequences)

Implementation:

class EpisodicMemory:
    """
    Stores complete interaction episodes for learning.
    Captures successful strategies and failure patterns.
    """
    def __init__(self, user_id: str, vector_db):
        self.user_id = user_id
        self.vector_db = vector_db
    
    def store_episode(self, task: str, actions: list, outcome: dict):
        """Store a complete task episode"""
        episode = {
            "task": task,
            "actions": actions,
            "outcome": outcome,
            "success": outcome.get("success", False),
            "timestamp": datetime.now()
        }
        
        # Create embeddings for semantic search
        episode_text = f"{task} | {format_actions(actions)}"
        embedding = generate_embedding(episode_text)
        
        self.vector_db.upsert(
            collection="episodes",
            user_id=self.user_id,
            embedding=embedding,
            data=episode
        )
    
    def retrieve_similar_episodes(self, task: str, top_k: int = 3) -> list:
        """Find similar past episodes"""
        query_embedding = generate_embedding(task)
        return self.vector_db.search(
            collection="episodes",
            user_id=self.user_id,
            embedding=query_embedding,
            top_k=top_k
        )
    
    def get_successful_strategies(self, task_type: str) -> list:
        """Get successful strategies for similar tasks"""
        episodes = self.retrieve_similar_episodes(task_type, top_k=10)
        successful = [e for e in episodes if e["success"]]
        return [e["actions"] for e in successful]

Use cases:

  • Learning which approaches work for specific task types
  • Avoiding previously failed strategies
  • Transferring successful patterns to similar tasks
  • Building user-specific behavior models

Episode structure:

episode = {
    "task": "Find weather in San Francisco",
    "actions": [
        {
            "type": "tool_call",
            "tool": "weather_api",
            "arguments": {"city": "San Francisco"},
            "result": {"success": True, "temp": 68}
        }
    ],
    "outcome": {
        "success": True,
        "user_satisfied": True,
        "execution_time": 1.2
    },
    "metadata": {
        "context": "user planning trip",
        "tools_available": ["weather_api", "search_web"],
        "strategy": "direct_api_call"
    }
}

Hybrid Memory System

Production systems combine all three types:

class HybridMemory:
    """
    Complete memory system combining short-term, long-term, and episodic.
    """
    def __init__(self, user_id: str):
        self.short_term = ShortTermMemory()
        self.long_term = LongTermMemory(user_id, get_db())
        self.episodic = EpisodicMemory(user_id, get_vector_db())
    
    def prepare_context(self, task: str) -> dict:
        """Prepare complete context for agent"""
        return {
            # Current conversation
            "recent_messages": self.short_term.get_context(),
            
            # User facts and preferences
            "user_facts": self.long_term.get_all_facts(),
            
            # Similar past successes
            "similar_episodes": self.episodic.retrieve_similar_episodes(task),
            
            # Learned strategies
            "successful_strategies": self.episodic.get_successful_strategies(task)
        }
    
    def update(self, role: str, content: str, metadata: dict = None):
        """Update all memory types"""
        # Update short-term
        self.short_term.add_message(role, content)
        
        # Extract and store facts
        if facts := extract_facts(content):
            for key, value in facts.items():
                self.long_term.store_fact(key, value)
    
    def finalize_episode(self, task: str, outcome: dict):
        """Store complete episode after task completion"""
        actions = self.short_term.get_context()
        self.episodic.store_episode(task, actions, outcome)

Memory selection guide:

NeedMemory Type
Current conversationShort-term
User preferencesLong-term
Past successful strategiesEpisodic
Temporary task stateShort-term
Learned behaviorsLong-term + Episodic
Session-specific contextShort-term
Cross-session factsLong-term
Strategy learningEpisodic

Observations vs Actions: The Critical Distinction

This seems obvious until you’re debugging a broken agent. Did it fail because it didn’t observe the right information, or because it took the wrong action based on correct observations?

The distinction:

Observations are information inputs:

  • Current task description
  • Conversation history
  • Available tools
  • Previous results
  • Memory context
  • System state

Actions are operations:

  • Tool calls
  • Final answer generation
  • Follow-up questions
  • State updates
  • Memory writes

Why this matters for debugging:

# Example debugging scenario
task = "Find weather in San Francisco and convert temperature to Celsius"

# Agent fails - but where?

# Possibility 1: Observation failure
# - Task not in context
# - Tool description missing
# - Previous result not included

# Possibility 2: Action failure
# - Selected wrong tool
# - Provided invalid arguments
# - Didn't chain actions properly

Systematic debugging approach:

1. Check observations:

def debug_observations(state: dict):
    """Verify observation quality"""
    observation = observe(state)
    
    checks = {
        "task_present": "task" in observation,
        "tools_described": len(observation.get("available_tools", [])) > 0,
        "history_included": len(observation.get("history", [])) > 0,
        "previous_result": "previous_result" in observation
    }
    
    print("Observation Quality:")
    for check, passed in checks.items():
        status = "âś“" if passed else "âś—"
        print(f"  {status} {check}")
    
    return observation

2. Check reasoning:

def debug_reasoning(observation: dict, reasoning: str):
    """Verify reasoning quality"""
    checks = {
        "task_referenced": observation["task"] in reasoning,
        "tools_considered": any(tool["name"] in reasoning 
                                for tool in observation["available_tools"]),
        "explicit_decision": any(marker in reasoning 
                                 for marker in ["I will", "I should", "Next step"]),
        "reasoning_present": len(reasoning) > 100
    }
    
    print("Reasoning Quality:")
    for check, passed in checks.items():
        status = "âś“" if passed else "âś—"
        print(f"  {status} {check}")

3. Check actions:

def debug_action(action: dict):
    """Verify action validity"""
    checks = {
        "valid_type": action["type"] in ["tool_call", "final_answer"],
        "tool_exists": action.get("tool") in get_available_tools(),
        "has_arguments": "arguments" in action if action["type"] == "tool_call" else True,
        "arguments_valid": validate_arguments(action) if action["type"] == "tool_call" else True
    }
    
    print("Action Quality:")
    for check, passed in checks.items():
        status = "âś“" if passed else "âś—"
        print(f"  {status} {check}")

Common failure patterns:

Observation failures:

  • Missing tool descriptions → Agent doesn’t know what’s available
  • Truncated history → Lost context from earlier conversation
  • No previous result → Repeats failed actions
  • Task not included → Goal drift

Reasoning failures:

  • Generic thinking → No specific strategy
  • Ignores tools → Tries to answer without external data
  • No step-by-step breakdown → Jumps to conclusions
  • Contradictory logic → Internal inconsistency

Action failures:

  • Hallucinated tools → Tries to call non-existent functions
  • Invalid arguments → Wrong types or missing required parameters
  • Wrong tool selection → Has right tools but picks wrong one
  • No action → Gets stuck in analysis paralysis

The debugging workflow:

Agent Failure Debugging Flow

Figure 5: Agent Failure Debugging Flow

Building a Production Agent: Complete Implementation

Let’s tie everything together with a complete, production-ready agent implementation:

import logging
from datetime import datetime
from typing import Dict, List, Any
import json

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class ProductionAgent:
    """
    Complete agent implementation with:
    - Multiple tools
    - Conversation memory
    - Error handling
    - Execution tracking
    - Debug capabilities
    """
    
    def __init__(
        self,
        llm,
        tools: List[Tool],
        max_iterations: int = 10,
        max_cost: float = 1.0
    ):
        self.llm = llm
        self.tools = {tool.name: tool for tool in tools}
        self.max_iterations = max_iterations
        self.max_cost = max_cost
        self.memory = ShortTermMemory()
        
        # Execution tracking
        self.stats = {
            "total_iterations": 0,
            "successful_completions": 0,
            "tool_calls": 0,
            "errors": 0,
            "total_cost": 0.0
        }
    
    def run(self, task: str, debug: bool = False) -> Dict[str, Any]:
        """
        Execute agent loop for given task.
        
        Args:
            task: The task to accomplish
            debug: Enable debug output
            
        Returns:
            Result dictionary with answer and metadata
        """
        # Initialize state
        state = {
            "task": task,
            "iteration": 0,
            "completed": False,
            "start_time": datetime.now()
        }
        
        logger.info(f"Starting task: {task}")
        
        try:
            # Main agent loop
            while not self._should_terminate(state):
                if debug:
                    print(f"\n=== Iteration {state['iteration']} ===")
                
                # OBSERVE
                observation = self._observe(state)
                if debug:
                    print(f"Observation: {json.dumps(observation, indent=2)}")
                
                # THINK
                reasoning = self._think(observation)
                if debug:
                    print(f"Reasoning: {reasoning[:200]}...")
                
                # DECIDE
                action = self._decide(reasoning)
                if debug:
                    print(f"Action: {action}")
                
                # ACT
                result = self._act(action)
                if debug:
                    print(f"Result: {result}")
                
                # UPDATE STATE
                state = self._update_state(state, action, result)
                
                # Check completion
                if result.get("final"):
                    state["completed"] = True
                    state["final_answer"] = result["result"]
                
                state["iteration"] += 1
                self.stats["total_iterations"] += 1
            
            # Extract final answer
            answer = self._extract_answer(state)
            
            if state["completed"]:
                self.stats["successful_completions"] += 1
            
            return {
                "success": True,
                "answer": answer,
                "iterations": state["iteration"],
                "execution_time": (datetime.now() - state["start_time"]).total_seconds(),
                "termination_reason": self._get_termination_reason(state)
            }
        
        except Exception as e:
            logger.exception("Agent execution failed")
            self.stats["errors"] += 1
            return {
                "success": False,
                "error": str(e),
                "iterations": state["iteration"]
            }
    
    def _observe(self, state: dict) -> dict:
        """Gather context for decision making"""
        return {
            "task": state["task"],
            "conversation": self.memory.get_context(),
            "available_tools": [
                {
                    "name": tool.name,
                    "description": tool.description,
                    "parameters": tool.schema["parameters"]
                }
                for tool in self.tools.values()
            ],
            "iteration": state["iteration"],
            "max_iterations": self.max_iterations,
            "previous_result": state.get("last_result")
        }
    
    def _think(self, observation: dict) -> str:
        """LLM reasoning step"""
        prompt = self._build_prompt(observation)
        
        # Track cost
        response = self.llm.generate(prompt)
        self.stats["total_cost"] += estimate_cost(response)
        
        return response
    
    def _build_prompt(self, observation: dict) -> str:
        """Construct prompt for LLM"""
        tools_desc = "\n".join([
            f"- {t['name']}: {t['description']}"
            for t in observation["available_tools"]
        ])
        
        history = "\n".join([
            f"{msg['role']}: {msg['content']}"
            for msg in observation["conversation"][-5:]
        ])
        
        return f"""You are a helpful agent that can use tools to accomplish tasks.

Task: {observation['task']}

Available tools:
{tools_desc}

Conversation history:
{history}

Previous result: {observation.get('previous_result', 'None')}

You are on iteration {observation['iteration']} of {observation['max_iterations']}.

Think step by step:
1. What is the current situation?
2. What information do I have?
3. What information do I need?
4. Should I use a tool or provide a final answer?

If using a tool, respond with:
Tool: <tool_name>
Arguments: <arguments_as_json>

If providing final answer, respond with:
Final Answer: <your_answer>

Your reasoning:"""
    
    def _decide(self, reasoning: str) -> dict:
        """Parse reasoning into structured action"""
        try:
            if "Tool:" in reasoning:
                # Extract tool call
                tool_line = [l for l in reasoning.split("\n") if l.startswith("Tool:")][0]
                tool_name = tool_line.split("Tool:")[1].strip()
                
                args_line = [l for l in reasoning.split("\n") if l.startswith("Arguments:")][0]
                args_json = args_line.split("Arguments:")[1].strip()
                arguments = json.loads(args_json)
                
                return {
                    "type": "tool_call",
                    "tool": tool_name,
                    "arguments": arguments
                }
            
            elif "Final Answer:" in reasoning:
                # Extract final answer
                answer = reasoning.split("Final Answer:")[1].strip()
                return {
                    "type": "final_answer",
                    "content": answer
                }
            
            else:
                return {
                    "type": "continue",
                    "message": "No clear action determined"
                }
        
        except Exception as e:
            logger.error(f"Failed to parse action: {e}")
            return {
                "type": "error",
                "message": f"Could not parse action: {str(e)}"
            }
    
    def _act(self, action: dict) -> dict:
        """Execute action"""
        try:
            if action["type"] == "tool_call":
                # Validate tool exists
                if action["tool"] not in self.tools:
                    return {
                        "success": False,
                        "error": f"Tool '{action['tool']}' not found"
                    }
                
                # Execute tool
                tool = self.tools[action["tool"]]
                result = tool.execute(**action["arguments"])
                
                self.stats["tool_calls"] += 1
                
                return result
            
            elif action["type"] == "final_answer":
                return {
                    "success": True,
                    "result": action["content"],
                    "final": True
                }
            
            elif action["type"] == "continue":
                return {
                    "success": False,
                    "error": "No action taken - agent is uncertain"
                }
            
            elif action["type"] == "error":
                return {
                    "success": False,
                    "error": action["message"]
                }
        
        except Exception as e:
            logger.exception("Action execution failed")
            return {
                "success": False,
                "error": str(e)
            }
    
    def _update_state(self, state: dict, action: dict, result: dict) -> dict:
        """Update state with action outcome"""
        # Add to memory
        self.memory.add_message(
            role="assistant",
            content=f"Action: {action['type']} | Result: {result.get('result', result.get('error'))}"
        )
        
        # Store last result
        state["last_result"] = result
        
        return state
    
    def _should_terminate(self, state: dict) -> bool:
        """Check termination conditions"""
        # Success
        if state.get("completed"):
            return True
        
        # Max iterations
        if state["iteration"] >= self.max_iterations:
            logger.warning("Max iterations reached")
            return True
        
        # Cost limit
        if self.stats["total_cost"] >= self.max_cost:
            logger.warning("Cost limit exceeded")
            return True
        
        # Time limit (5 minutes)
        elapsed = (datetime.now() - state["start_time"]).total_seconds()
        if elapsed > 300:
            logger.warning("Time limit exceeded")
            return True
        
        return False
    
    def _extract_answer(self, state: dict) -> str:
        """Extract final answer from state"""
        if "final_answer" in state:
            return state["final_answer"]
        
        # Fallback for incomplete tasks
        last_result = state.get("last_result", {})
        if last_result.get("success"):
            return f"Task incomplete. Last result: {last_result['result']}"
        else:
            return f"Task incomplete. Last error: {last_result.get('error', 'Unknown')}"
    
    def _get_termination_reason(self, state: dict) -> str:
        """Determine why execution terminated"""
        if state.get("completed"):
            return "task_completed"
        elif state["iteration"] >= self.max_iterations:
            return "max_iterations"
        elif self.stats["total_cost"] >= self.max_cost:
            return "cost_limit"
        else:
            return "unknown"
    
    def get_stats(self) -> dict:
        """Get execution statistics"""
        return self.stats.copy()
    
    def reset_stats(self):
        """Reset execution statistics"""
        for key in self.stats:
            self.stats[key] = 0 if isinstance(self.stats[key], (int, float)) else 0.0

Usage example:

# Define tools
calculator = Tool(
    name="calculator",
    description="Perform mathematical calculations",
    function=calculator_function,
    schema=calculator_schema
)

weather = Tool(
    name="weather",
    description="Get current weather for a location",
    function=weather_function,
    schema=weather_schema
)

search = Tool(
    name="search_web",
    description="Search the web for information",
    function=search_function,
    schema=search_schema
)

# Create agent
agent = ProductionAgent(
    llm=get_llm(),
    tools=[calculator, weather, search],
    max_iterations=10,
    max_cost=0.50
)

# Run task
result = agent.run(
    task="What's the weather in San Francisco? Convert the temperature to Celsius.",
    debug=True
)

print(f"Answer: {result['answer']}")
print(f"Iterations: {result['iterations']}")
print(f"Time: {result['execution_time']:.2f}s")
print(f"Reason: {result['termination_reason']}")

# Check stats
print("\nExecution Statistics:")
print(json.dumps(agent.get_stats(), indent=2))

This implementation includes:

  • âś… Complete agent loop
  • âś… Multiple tools with validation
  • âś… Conversation memory
  • âś… Error handling at every step
  • âś… Execution tracking and statistics
  • âś… Debug mode for development
  • âś… Multiple termination conditions
  • âś… Cost tracking
  • âś… Comprehensive logging

Testing and Debugging Strategies

Production agents require systematic testing. Here’s how to validate each component:

Unit Tests

Test individual functions:

def test_observation():
    """Test observation gathering"""
    state = {
        "task": "Test task",
        "conversation_history": [
            {"role": "user", "content": "Hello"}
        ],
        "iteration": 1
    }
    
    observation = observe(state)
    
    assert "task" in observation
    assert observation["task"] == "Test task"
    assert len(observation["history"]) == 1
    assert "available_tools" in observation

def test_tool_execution():
    """Test tool execution"""
    tool = calculator_tool
    result = tool.execute(expression="2 + 2")
    
    assert result["success"] == True
    assert "4" in result["result"]

def test_memory():
    """Test memory operations"""
    memory = ShortTermMemory()
    memory.add_message("user", "My name is Alice")
    
    context = memory.get_context()
    assert len(context) == 1
    assert "Alice" in str(context)

Integration Tests

Test component interactions:

def test_agent_with_calculator():
    """Test agent executing calculator tool"""
    agent = ProductionAgent(
        llm=get_test_llm(),
        tools=[calculator_tool],
        max_iterations=5
    )
    
    result = agent.run("What is 15 * 23?")
    
    assert result["success"] == True
    assert "345" in result["answer"]
    assert result["iterations"] <= 3

def test_agent_multi_step():
    """Test multi-step reasoning"""
    agent = ProductionAgent(
        llm=get_test_llm(),
        tools=[calculator_tool, weather_tool],
        max_iterations=10
    )
    
    result = agent.run(
        "Get weather in Boston. If temperature is above 20C, calculate 20 * 3."
    )
    
    assert result["success"] == True
    stats = agent.get_stats()
    assert stats["tool_calls"] >= 2  # Weather + calculator

End-to-End Tests

Test complete user flows:

def test_conversation_memory():
    """Test memory across multiple turns"""
    agent = ProductionAgent(
        llm=get_test_llm(),
        tools=[],
        max_iterations=5
    )
    
    # First turn
    result1 = agent.run("My name is Alice")
    assert result1["success"] == True
    
    # Second turn - should remember name
    result2 = agent.run("What's my name?")
    assert result2["success"] == True
    assert "Alice" in result2["answer"]

def test_error_recovery():
    """Test agent handling tool errors"""
    faulty_tool = Tool(
        name="faulty",
        description="A tool that fails",
        function=lambda x: raise_exception(),
        schema={"parameters": {"properties": {}}}
    )
    
    agent = ProductionAgent(
        llm=get_test_llm(),
        tools=[faulty_tool, calculator_tool],
        max_iterations=10
    )
    
    result = agent.run("Try the faulty tool, then calculate 2+2")
    
    assert result["success"] == True  # Should recover and complete
    assert "4" in result["answer"]

Performance Tests

Test under load and edge cases:

def test_max_iterations():
    """Test iteration limit enforcement"""
    agent = ProductionAgent(
        llm=get_test_llm(),
        tools=[],
        max_iterations=3
    )
    
    result = agent.run("Keep thinking forever")
    
    assert result["iterations"] == 3
    assert result["termination_reason"] == "max_iterations"

def test_cost_limit():
    """Test cost limit enforcement"""
    agent = ProductionAgent(
        llm=get_expensive_test_llm(),
        tools=[],
        max_iterations=100,
        max_cost=0.01
    )
    
    result = agent.run("Complex task")
    
    assert result["termination_reason"] == "cost_limit"
    assert agent.get_stats()["total_cost"] <= 0.01

def test_concurrent_execution():
    """Test multiple agents running simultaneously"""
    agent1 = ProductionAgent(llm=get_test_llm(), tools=[calculator_tool])
    agent2 = ProductionAgent(llm=get_test_llm(), tools=[weather_tool])
    
    with ThreadPoolExecutor(max_workers=2) as executor:
        future1 = executor.submit(agent1.run, "Calculate 5 * 5")
        future2 = executor.submit(agent2.run, "Weather in NYC")
        
        result1 = future1.result()
        result2 = future2.result()
    
    assert result1["success"] == True
    assert result2["success"] == True

Common Pitfalls and Solutions

Pitfall 1: Infinite loops

Problem: Agent repeats same action indefinitely

Solution:

def detect_loop(state: dict, window: int = 3) -> bool:
    """Detect repeated actions"""
    if len(state["history"]) < window:
        return False
    
    recent = state["history"][-window:]
    actions = [h["action"] for h in recent]
    
    # All identical
    if all(a == actions[0] for a in actions):
        return True
    
    return False

Pitfall 2: Context window overflow

Problem: Too much history exceeds token limits

Solution:

def manage_context(history: list, max_tokens: int = 4000) -> list:
    """Keep context within token limits"""
    while estimate_tokens(history) > max_tokens:
        if len(history) <= 2:  # Keep minimum context
            break
        
        # Remove oldest message
        history = history[1:]
    
    return history

Pitfall 3: Tool hallucination

Problem: LLM invents non-existent tools

Solution:

def validate_tool_call(tool_name: str, available_tools: list) -> bool:
    """Validate tool exists before execution"""
    if tool_name not in [t.name for t in available_tools]:
        logger.warning(f"Attempted to call non-existent tool: {tool_name}")
        return False
    return True

Pitfall 4: Poor error messages

Problem: Generic errors make debugging impossible

Solution:

class ToolError(Exception):
    """Rich error with context"""
    def __init__(self, tool_name: str, error: str, context: dict):
        self.tool_name = tool_name
        self.error = error
        self.context = context
        super().__init__(f"Tool '{tool_name}' failed: {error}")

Production Deployment Checklist

Before deploying agents to production:

Code Quality:

  • [ ] All functions have type hints
  • [ ] Comprehensive error handling
  • [ ] Logging at appropriate levels
  • [ ] Unit tests for all components
  • [ ] Integration tests for workflows
  • [ ] Code review completed

Performance:

  • [ ] Token usage optimized
  • [ ] Cost limits configured
  • [ ] Timeout handling implemented
  • [ ] Concurrent execution tested
  • [ ] Load testing completed

Reliability:

  • [ ] Retry logic for transient failures
  • [ ] Circuit breakers for external services
  • [ ] Graceful degradation strategies
  • [ ] Monitoring and alerting configured
  • [ ] Incident response procedures documented

Security:

  • [ ] Input validation on all tools
  • [ ] SQL injection prevention
  • [ ] API key management
  • [ ] Rate limiting implemented
  • [ ] Audit logging enabled

Observability:

  • [ ] Structured logging
  • [ ] Metrics collection
  • [ ] Distributed tracing
  • [ ] Debug mode for development
  • [ ] Performance profiling

Key Takeaways

The agent loop is fundamental: Every agent implements observe → think → decide → act → update state. Understanding this pattern helps you work with any framework.

Tools enable action: Without properly designed tools, agents are just chatbots. Invest time in clear descriptions, robust schemas, and comprehensive error handling.

Memory separates demos from production: Short-term memory maintains conversations. Long-term memory persists facts. Episodic memory enables learning.

Observations ≠ Actions: When debugging, distinguish between information gathering failures and execution failures. They require different fixes.

Production requires robustness: Max iterations, cost limits, timeouts, error handling, and logging aren’t optional – they’re essential.

Start simple, add complexity: Build single-loop agents first. Master the basics before moving to multi-agent systems and complex workflows.

What’s Next: LangGraph and Deterministic Flows

You now understand agent building blocks. But there’s a problem: the loop we built is still somewhat opaque.

Questions remain:

  • How do you guarantee certain steps happen in order?
  • How do you create branches (if-then logic)?
  • How do you make agent behavior deterministic and testable?
  • How do you visualize complex workflows?

The next blog will introduce LangGraph – a framework for building agents as explicit state machines. You’ll learn:

  • Why graphs beat loops for complex agents
  • How to define states, nodes, and edges
  • Conditional routing and branching logic
  • Checkpointing and retry mechanisms
  • Building deterministic, debuggable workflows

The key shift: From implicit loops to explicit state graphs

Instead of a while loop where logic is hidden in functions, you’ll define explicit graphs showing exactly how the agent moves through states. This makes complex behaviors clear, testable, and debuggable.

Conclusion: From Components to Systems

Building production-ready agents isn’t about calling agent.run() and hoping for the best. It’s about understanding each component – the execution loop, tool interfaces, memory architecture, and state management – and how they work together.

This guide gave you working implementations of every pattern. You’ve seen:

  • The canonical agent loop with all five steps
  • Tool design with schemas, validation, and error handling
  • Memory systems for short-term, long-term, and episodic storage
  • The observation-action distinction for systematic debugging
  • A complete production agent with tracking and statistics

The code isn’t pseudocode or simplified examples. It’s production-grade implementation you can adapt for real systems.

Start building: Take the patterns here and apply them to your problems. Build tools for your APIs. Implement memory for your users. Create agents that handle real tasks reliably.

The fundamentals transfer across frameworks. Whether you use LangChain, LangGraph, or custom solutions, you’ll recognize these patterns. More importantly, you’ll know how to debug them when they break.

Next up: LangGraph for deterministic, visual workflows. But first, implement the patterns here. Build a single-loop agent. Add tools. Test memory. Experience the challenges firsthand.

That’s how you master agent development.

Additional Resources

LangChain Documentation:

Research Papers:

  • ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2023) – The foundational paper on ReAct prompting
  • Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al., 2023) – How LLMs learn tool usage

Code Repository: Full working code that you can extend: https://github.com/ranjankumar-gh/building-real-world-agentic-ai-systems-with-langgraph-codebase/tree/main/module-02

About This Series

This post is part of Building Real-World Agentic AI Systems with LangGraph, a comprehensive guide to production-ready AI agents. The series covers:

Now go build something real.

When Your Chatbot Needs to Actually Do Something: Understanding AI Agents

Introduction

I’ve been building AI systems for production. The shift from LLMs to agents seemed small at first. Then I hit an infinite loop bug and watched costs spike. These aren’t the same thing at all.

Here’s what nobody tells you upfront: an LLM responds to you. An agent decides what to do. That difference is everything.

The Weather Test

Look at these two conversations:

System A:

You: "What's the weather in San Francisco?"
Bot: "I don't have access to real-time weather data, but San Francisco 
typically has mild temperatures year-round..."

System B:

You: "What's the weather in San Francisco?"
Agent: "It's currently 62°F and cloudy in San Francisco."

What happened differently? System B looked at your question, decided it needed weather data, called an API, got the result, and gave you an answer. Four distinct steps that System A couldn’t do.

That’s the line between generation and action. Between completing text and making decisions.

What LLMs Actually Do (And Don’t)

When you call GPT-4 or Claude, you’re using a completion engine. Feed it text, get text back. That’s it.

LLMs are genuinely impressive at pattern completion, synthesizing knowledge from training data, and understanding context. But they can’t remember your last conversation, access current information, execute code, or fix their own mistakes. Each API call is independent. No state. No feedback loop.

This isn’t a flaw. It’s what they are. A calculator doesn’t fail at making coffee. LLMs don’t fail at taking actions. They were built for a different job.

Five Things That Make Something an Agent

After building enough systems, you start seeing patterns. Real agents have these five capabilities:

Agentic Capabilities

Figure 1: Agentic Capabilities

Autonomy – The system figures out what to do, not just how to phrase things. You say “find last year’s sales data” and it determines which database to query, what filters to apply, and how to format results.

Planning – Breaking “analyze this dataset” into actual executable steps. Find the file, check the schema, run calculations, generate visualizations. Multi-step reasoning that adapts based on what it discovers.

Tool Use – APIs, databases, code execution. Ways to actually do things in the world beyond generating tokens.

Memory – Remembering the conversation two messages ago. Keeping track of what worked and what failed. Building context across interactions.

Feedback Loops – When something breaks, the agent sees the error and tries a different approach. Observation, action, observation, adaptation.

Strip away any of these and you’re back to an LLM with extra steps.

How Agents Actually Work

The core mechanism is simpler than you’d expect. It’s an observe-plan-act-observe loop:

  1. Observe – Process the user’s request and current state
  2. Plan – Decide what actions to take
  3. Act – Execute those actions (call tools, run code)
  4. Observe again – See what happened, decide next step

Let’s trace a real interaction:

User: "Book me a flight to NYC next Tuesday and add it to my calendar"

OBSERVATION:
- Two tasks: book flight + calendar entry
- Need: destination (NYC), date (next Tuesday), available tools

PLANNING:
1. Search flights to NYC for next Tuesday
2. Present options to user
3. Wait for user selection
4. Book selected flight
5. Add to calendar with flight details

ACTION:
- Execute: flight_search(destination="NYC", date="2025-12-17")

OBSERVATION (Result):
- Received 3 flight options with prices
- Status: Success

DECISION:
- Present options, wait for selection
- Update state: awaiting_user_selection

The agent isn’t just completing text. It’s making a decision at each step about what to do next based on what it observes.

The Spectrum of Agency

Not everything needs full autonomy. There’s a spectrum:

The Spectrum of Agency

Figure 2: The Spectrum of Agency

Chatbots (low autonomy) – No tools, no state. Pure conversation. This is fine for FAQ bots where all you need is text generation.

Tool-using assistants – Fixed set of tools, simple state. The assistant can call your CRM API or check documentation, but it’s not planning multi-step operations.

Planning agents – Dynamic plans, complex state management. These can break down “analyze Q3 sales and generate a presentation” into actual steps that adapt based on intermediate results.

Multi-agent systems – Multiple agents coordinating, shared state. One agent handles research, another writes, another fact-checks. They communicate and negotiate task division.

Fully autonomous systems – Long-running, open-ended goals. These operate for extended periods with minimal supervision.

Most production systems land somewhere in the middle. You rarely need full autonomy. Often, you just need tools and basic memory.

Where Agents Break in Production

These six failure modes show up constantly in production:

Agent Faillures in Production

Figure 3: Agent Faillures in Production

Infinite loops – Agent calls web_search, doesn’t find what it needs, calls web_search again with slightly different parameters, repeats forever. Solution: set max iterations.

Tool hallucination – Agent tries to call send_email_to_team() which doesn’t exist. The LLM confidently invents plausible-sounding tool names. Solution: strict tool validation.

Context overflow – After 50 messages, the conversation history exceeds the context window. Agent forgets the original goal. Solution: smart context management and summarization.

Cost explosion – No cost caps, agent makes 200 API calls trying to debug something. Your bill hits $10,000 before you notice. Solution: per-request budget limits.

Non-deterministic failures – Same input, different outputs. Sometimes it works, sometimes it doesn’t. Hard to debug. Solution: extensive logging and trace analysis.

Silent failures – Tool call fails, agent doesn’t handle the error, just continues. User gets incorrect results with no indication that something went wrong. Solution: explicit error handling everywhere.

The common thread? These all happen because the agent is making decisions, and decisions can be wrong. With pure LLMs, you can validate outputs. With agents, you need to validate the entire decision-making process.

Memory: Short-term, Long-term, and Procedural

Memory turns out to be more nuanced than “remember the conversation.”

Agent Memory

Figure 4: Agent Memory

Short-term memory (working memory) – Holds the current conversation and immediate context. This is what keeps the agent from asking “what’s your name?” after you just told it.

Long-term memory (episodic) – Stores information across sessions. “Last time we talked, you mentioned you preferred Python over JavaScript.” This is harder to implement but crucial for personalized experiences.

Procedural memory – Learned patterns and behaviors. “When the user asks about sales data, they usually want year-over-year comparisons, not raw numbers.” This often comes from fine-tuning or RLHF (Reinforcement Learning from Human Feedback).

Most systems implement short-term memory (conversation history) and skip the rest. That’s often fine. Long-term memory adds complexity quickly, especially around retrieval and relevance.

Tools: The Actual Interface to the World

Tool calling is how agents affect reality. The LLM generates structured output that your code executes:

# LLM generates this structured decision
{
  "tool": "send_email",
  "arguments": {
    "to": "team@company.com",
    "subject": "Q3 Results",
    "body": "Attached are the Q3 metrics we discussed."
  }
}

# Your code executes it
result = tools["send_email"](**arguments)

# Agent sees the result and decides what to do next

The critical part is validation. Before executing any tool call, check that the tool exists, the parameters are valid, and you have permission to run it. Tool hallucination is common and dangerous.

Also, most tools fail sometimes. APIs timeout, databases lock, network connections drop. Your agent needs explicit error handling for every tool call. Assume failure. Build retry logic. Log everything.

The Planning Problem

“Book a flight and add it to my calendar” sounds simple until you break it down:

  1. Extract destination and date from natural language
  2. Check if you have enough context (do you know which airport? which calendar?)
  3. Search for flights
  4. Evaluate options based on unstated preferences
  5. Present choices without overwhelming the user
  6. Wait for selection (this is a state transition)
  7. Execute booking (this might fail)
  8. Extract flight details from booking confirmation
  9. Format for calendar API
  10. Add to calendar (this might also fail)
  11. Confirm success to user

That’s 11 steps with multiple decision points and error states. An LLM can’t do this. It can generate text that looks like it did this, but it can’t actually execute the steps and adapt based on outcomes.

Planning means breaking fuzzy goals into executable steps and handling the inevitable failures along the way.

When You Actually Need an Agent

Not every problem needs an agent. Most don’t. Here’s a rough guide:

Use an LLM directly when:

  • You just need text generation (summaries, rewrites, explanations)
  • The task is single-shot (one input, one output)
  • You don’t need current data or external actions
  • Latency matters (agents add overhead)

Use an agent when:

  • You need to call multiple APIs based on conditions
  • The task requires multi-step reasoning
  • You need error recovery and retry logic
  • Users expect the system to “figure it out” rather than follow explicit instructions

The deciding factor is usually decision-making under uncertainty. If you can write a script with if-statements that handles all cases, use the script. If you need the system to figure out what to do based on context, that’s when agents make sense.

Three Real Examples

Customer support bot – Most don’t need to be agents. They’re fine at looking up articles and answering questions. But if you want them to check order status, process refunds, and escalate to humans when needed? Now you need autonomy, tools, and decision-making.

Research assistant – A system that searches papers, extracts key findings, and generates summaries is perfect for agents. It needs to decide which papers are relevant, adapt search strategies based on initial results, and synthesize information from multiple sources.

Code reviewer – Analyzing pull requests, running tests, checking style guides, and posting comments. This needs tools (Git API, test runners), multi-step planning, and error handling. Classic agent territory.

Starting Simple

When you build your first agent, resist the temptation to add everything at once. Start with:

  1. One tool (maybe a web search or database query)
  2. Basic conversation memory (just track the last few messages)
  3. Simple decision logic (if user asks about X, call tool Y)
  4. Explicit error handling (what happens when the tool fails?)

Get that working reliably before adding planning, reflection, or multi-agent coordination. The complexity multiplies fast.

I learned this the hard way. Built a “research agent” with 12 tools, complex planning logic, and multi-step reasoning. Spent three weeks debugging edge cases. Rebuilt it with 3 tools and simple logic. Worked better and shipped in two days.

Production Realities

Running agents in production means dealing with issues you don’t face with static LLM calls:

Observability – You need to see what the agent is doing. Log every LLM call, every tool invocation, every decision point. When something breaks (and it will), you need to reconstruct exactly what happened.

Cost control – Set maximum token budgets per request. Cap the number of tool calls. Use caching aggressively for repeated operations. An agent can burn through thousands of tokens if it gets stuck in a loop.

Safety guardrails – Which tools can execute automatically vs requiring human approval? What actions are never allowed? How do you handle sensitive data in tool arguments?

Graceful degradation – When a tool fails, can the agent accomplish the goal another way? Or should it just tell the user it can’t help? Design for partial success, not just all-or-nothing.

These aren’t optional. They’re the difference between a demo and a production system.

The Mental Model Shift

The hardest part isn’t the code. It’s changing how you think about the system.

With LLMs, you’re optimizing prompts to get better completions. With agents, you’re designing decision spaces and constraining behavior. It’s closer to building an API than writing a prompt.

You stop asking “how do I get it to say this?” and start asking “what decisions does it need to make?” and “how do I prevent bad decisions?”

This shift took me longer than learning the technical pieces. I kept trying to solve agent problems with better prompts when I needed better architecture.

What I Wish I’d Known

Before building my first production agent, I wish someone had told me:

Logging is not optional. You will spend hours debugging. Good logs make the difference between “I have no idea what happened” and “oh, it’s calling the wrong tool on step 7.”

Start with deterministic baselines. Before building the agent, write a script that solves the problem with if-statements. This gives you something to compare against and helps you understand the decision logic.

Most complexity is not AI complexity. It’s error handling, state management, API retries, and data validation. The LLM is often the simplest part.

Users don’t care about your architecture. They care whether it works. A simple agent that reliably solves their problem beats a sophisticated agent that’s impressive but breaks often.

Building Your First Agent

If you’re ready to try this, here’s what I’d recommend:

Start with a weather agent. It’s simple enough to finish but complex enough to teach you the core concepts:

Tools:

  • get_weather(location) – fetches current weather
  • geocode(city_name) – converts city names to coordinates

Decision logic:

  • Does user query include a location?
  • If yes, call get_weather directly
  • If no, ask for location or use default
  • Handle API failures gracefully

Memory:

  • Remember the user’s last location
  • Don’t ask again if they query weather multiple times

Build this and you’ll hit most of the core challenges. Tool calling, error handling, state management, and decision logic. It’s a good litmus test for whether you understand the fundamentals.

Where This Goes Next

Once you have a basic agent working, the natural progression is:

  • Better planning algorithms (ReAct, Tree of Thoughts, etc.)
  • More sophisticated memory (vector databases, episodic storage)
  • Multi-agent coordination (specialized agents working together)
  • Evaluation frameworks (how do you know if it’s working?)
  • Production infrastructure (monitoring, cost controls, safety)

But all of that builds on the core loop: observe, plan, act, observe. Master that first. Everything else is elaboration.

The Real Difference

The shift from LLMs to agents isn’t about better models or fancier prompts. It’s about giving language models the ability to do things.

Text generation is powerful. But generation plus action? That’s when things get genuinely useful. When your system can not just tell you the answer but actually execute the steps to get there.

That’s the promise of agents. And also why they’re harder to build than you expect.


Have you built any AI agents? What surprised you most about the difference from working with LLMs directly? Let me know what patterns you’ve discovered.


Code and Resources

If you want to dive deeper, I’ve put together a complete codebase with working examples of everything discussed here:

Building Real-World Agentic AI Systems with LangGraph – GitHub

The repository includes baseline chatbots, tool-calling agents, weather agents, and all the production patterns we covered. Start with module-01 for the fundamentals.

Further Reading

  • ReAct: Synergizing Reasoning and Acting (Yao et al., 2023) – The foundation paper for modern agent architectures. Shows how interleaving reasoning and acting improves agent performance.
  • Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn et al., 2023) – Explores how agents can learn from mistakes through self-reflection.
  • Toolformer (Schick et al., 2023) – Deep dive into how models learn to use tools effectively.

About This Series

This post is part of Building Real-World Agentic AI Systems with LangGraph, a comprehensive guide to production-ready AI agents. The series covers:

How Google’s SynthID Actually Works: A Visual Breakdown

1. Introduction

I spent the last few days trying to understand how Google’s text watermarking works, and honestly, most explanations I found were either too technical or too vague. So I built a visual explainer to make sense of it for myself—and hopefully for you too.

Let me walk you through what I learned.

How Google’s SynthID Actually Works

HTML Version of Visual Explainer

2. The Basic Problem

We’re generating billions of words with AI every day. ChatGPT writes essays, Gemini drafts emails, Claude helps with code. The question everyone’s asking is simple: how do we tell which text comes from an AI and which comes from a person?

You can’t just look at writing quality anymore. AI-generated text sounds natural, flows well, makes sense. Sometimes it’s better than what humans write. So we need something invisible, something embedded in the text itself that only computers can detect.

That’s what SynthID does.

3. Starting With How Language Models Think

Before we get to watermarking, you need to understand how these models actually generate text. They don’t just pick the “best” word for each position. They work with probabilities.

Think about this sentence: “My favorite tropical fruits are mango and ___”

What comes next? Probably “bananas” or “papaya” or “pineapple,” right? The model assigns each possible word a probability score. Bananas might get 85%, papaya gets 10%, pineapple gets 3%, and something completely random like “airplanes” gets 0.001%.

Then it picks from these options, usually choosing high-probability words but occasionally throwing in something less likely to keep things interesting. This randomness is why you get different responses when you ask the same question twice.

Here’s the key insight that makes watermarking possible: when multiple words have similar probabilities, the model has flexibility in which one to choose. And that’s where Google hides the watermark.

4. The Secret Ingredient: A Cryptographic Key

Google generates a secret key—basically a very long random number that only they know. This key determines everything about how the watermark gets embedded.

Think of it like a recipe. The key tells the system exactly which words to favor slightly and which ones to avoid. Without this key, you can’t create the watermark pattern, and you definitely can’t detect it.

This is important for security. If anyone could detect watermarks without the key, they could also forge them or remove them easily. The cryptographic approach makes both much harder.

5. Green Lists and Red Lists

Using the secret key, SynthID splits the entire vocabulary into two groups for each position in the text. Some words go on the “green list” and get a slight boost. Others go on the “red list” and get slightly suppressed.

Let’s say you’re writing about weather. For a particular spot in a sentence, the word “perfect” might be on the green list while “ideal” is on the red list. Both words mean roughly the same thing and both sound natural. But SynthID will nudge the model toward “perfect” just a tiny bit.

How tiny? We’re talking about 2-3% probability adjustments. If “perfect” and “ideal” both had 30% probability, SynthID might bump “perfect” up to 32% and drop “ideal” to 28%. Small enough that it doesn’t change how the text reads, but consistent enough to create a pattern.

And here’s the clever part: these lists change based on the words that came before. The same word might be green in one context and red in another. The pattern looks completely random unless you have the secret key.

6. Building the Statistical Pattern

As the model generates more and more text, it keeps favoring green list words. Not always—that would be obvious—but more often than chance would predict.

If you’re flipping a coin, you expect roughly 50% heads and 50% tails. With SynthID, you might see 65% green words and 35% red words. That 15% difference is your watermark.

But you need enough text for this pattern to become statistically significant. Google found that 200 words is about the minimum. With shorter text, there isn’t enough data to separate the watermark signal from random noise.

Think of it like this: if you flip a coin three times and get three heads, that’s not surprising. But if you flip it 200 times and get 130 heads, something’s definitely up with that coin.

7. Detection: Finding the Fingerprint

When you want to check if text is watermarked, you need access to Google’s secret key. Then you reconstruct what the green and red lists would have been for that text and count how many green words actually appear.

If the percentage is significantly above 50%, you’ve found a watermark. The more words you analyze, the more confident you can be. Google’s system outputs a score that tells you how likely it is that the text came from their watermarked model.

This is why watermarking isn’t perfect for short text. A tweet or a caption doesn’t have enough words to build up a clear pattern. You might see 60% green words just by chance. But a full essay? That 65% green word rate across 500 words is virtually impossible to happen randomly.

8. Why Humans Can’t See It

The adjustments are so small that they don’t change which words the model would naturally choose. Both “perfect” and “ideal” sound fine in most contexts. Both “delicious” and “tasty” work for describing food. The model is just picking between equally good options.

To a human reader, watermarked and unwatermarked text are indistinguishable. Google tested this with 20 million actual Gemini responses. They let users rate responses with thumbs up or thumbs down. Users showed absolutely no preference between watermarked and regular text.

The quality is identical. The style is identical. The meaning is identical. The only difference is a statistical bias that emerges when you analyze hundreds of words with the secret key.

9. What Actually Works and What Doesn’t

Google’s been pretty honest about SynthID’s limitations, which I appreciate.

It works great for:

  • Long-form creative writing
  • Essays and articles
  • Stories and scripts
  • Open-ended generation where many word choices are possible

It struggles with:

  • Factual questions with one right answer (What’s the capital of France? It’s Paris—no flexibility there)
  • Short text under 200 words
  • Code generation (syntax is too rigid)
  • Text that gets heavily edited or translated

The watermark can survive light editing. If you change a few words here and there, the overall pattern still holds up. But if you rewrite everything or run it through Google Translate, the pattern breaks down.

And here’s the uncomfortable truth: determined attackers can remove the watermark. Researchers showed you can do it for about $50 worth of API calls. You query the watermarked model thousands of times, figure out the pattern statistically, and then use that knowledge to either remove watermarks or forge them.

10. The Bigger Context

SynthID isn’t just a technical demo. It’s the first large-scale deployment of text watermarking that actually works in production. Millions of people use Gemini every day, and most of that text is now watermarked. They just don’t know it.

Google open-sourced the code in October 2024, which was a smart move. It lets researchers study the approach, find weaknesses, and build better systems. It also gives other companies a working example if they want to implement something similar.

The EU AI Act is starting to require “machine-readable markings” for AI content. Other jurisdictions are considering similar rules. SynthID gives everyone something concrete to point to when discussing what’s actually possible with current technology.

11. My Takeaway After Building This

The more I learned about watermarking, the more I realized it’s not the complete solution everyone wants it to be. It’s more like one tool in a toolkit.

You can’t watermark everything. You can’t make it unremovable. You can’t prove something wasn’t AI-generated just because you don’t find a watermark. And it only works if major AI providers actually implement it, which many won’t.

But for what it does—allowing companies to verify that text came from their models when it matters—it works remarkably well. The fact that it adds almost no overhead and doesn’t affect quality is genuinely impressive engineering.

What struck me most is the elegance of the approach. Using the natural randomness in language model generation to hide a detectable pattern is clever. It doesn’t require changing the model architecture or training process. It just tweaks the final step where words get selected.

12. If You Want to Try It Yourself

Google released the SynthID code on GitHub. If you’re comfortable with Python and have access to a language model, you can experiment with it. The repository includes examples using Gemma and GPT-2.

Fair warning: it’s not plug-and-play. You need to understand how to modify model output distributions, and you need a way to run the model locally or through an API that gives you token-level access. But it’s all there if you want to dig into the details.

The Nature paper is also worth reading if you want the full technical treatment. They go into the mathematical foundations, describe the tournament sampling approach, and share detailed performance metrics across different scenarios.

13. Where This Goes Next

Watermarking is just getting started. Google proved it can work at scale, but there’s still a lot to figure out.

Researchers are working on making watermarks more robust against attacks. They’re exploring ways to watermark shorter text. They’re trying to handle code and factual content better. They’re designing systems that work across multiple languages and survive translation.

There’s also the question of whether we need universal standards. Right now, each company could implement their own watermarking scheme with their own secret keys. That fragments the ecosystem and makes detection harder. But getting competitors to coordinate on technical standards is always tricky.

And of course, there’s the bigger question of whether watermarking is even the right approach for AI governance. It helps with attribution and accountability, but it doesn’t prevent misuse. It doesn’t stop bad actors from using unwatermarked models. It doesn’t solve the fundamental problem of AI-generated misinformation.

Those are harder problems that probably need policy solutions alongside technical ones.

14. Final Thoughts

I worked on this visual explainer because I wanted to understand how SynthID actually works, beyond the marketing language and vague descriptions. Building the visual explainer forced me to understand every detail—you can’t visualize something you don’t really get.

What I came away with is respect for how well-engineered the system is, combined with realism about its limitations. It’s impressive technical work that solves a real problem within specific constraints. It’s also not magic and won’t fix everything people hope it will.

If you’re interested in AI safety, content authenticity, or just how these systems work under the hood, it’s worth understanding. Not because watermarking is the answer, but because it shows what’s actually possible with current approaches and where the hard limits are.

And sometimes those limits tell you more than the capabilities do.

Open Source AI’s Original Sin: The Illusion of Democratization

When Meta released LLaMA as “open source” in February 2023, the AI community celebrated. Finally, the democratization of AI we’d been promised. No more gatekeeping by OpenAI and Google. Anyone could now build, modify, and deploy state-of-the-art language models.

Except that’s not what happened. A year and a half later, the concentration of AI power hasn’t decreased—it’s just shifted. The models are “open,” but the ability to actually use them remains locked behind the same economic barriers that closed models had. We traded one form of gatekeeping for another, more insidious one.

The Promise vs. The Reality

The open source AI narrative goes something like this: releasing model weights levels the playing field. Small startups can compete with tech giants. Researchers in developing countries can access cutting-edge technology. Independent developers can build without permission. Power gets distributed.

But look at who’s actually deploying these “open” models at scale. It’s the same handful of well-funded companies and research institutions that dominated before. The illusion of access masks the reality of a new kind of concentration—one that’s harder to see and therefore harder to challenge.

The Compute Barrier

Running Base Models

LLaMA-2 70B requires approximately 140GB of VRAM just to load into memory. A single NVIDIA A100 GPU (80GB) costs around $10,000 and you need at least two for inference. That’s $20,000 in hardware before you serve a single request.

Most developers can’t afford this. So they turn to cloud providers. AWS charges roughly $4-5 per hour for an instance with 8x A100 GPUs. Running 24/7 costs over $35,000 per month. For a single model. Before any users.

Compare this to GPT-4’s API: $0.03 per 1,000 tokens. You can build an application serving thousands of users for hundreds of dollars. The “closed” model is more economically accessible than the “open” one for anyone without serious capital.

The Quantization Trap

“Just quantize it,” they say. Run it on consumer hardware. And yes, you can compress LLaMA-2 70B down to 4-bit precision and squeeze it onto a high-end gaming PC with 48GB of RAM. But now your inference speed is 2-3 tokens per second. GPT-4 through the API serves 40-60 tokens per second.

You’ve traded capability for access. The model runs, but it’s unusable for real applications. Your users won’t wait 30 seconds for a response. So you either scale up to expensive infrastructure or accept that your “open source” model is a toy.

The Fine-Tuning Fortress

Training Costs

Base models are rarely production-ready. They need fine-tuning for specific tasks. Full fine-tuning of LLaMA-2 70B for a specialized domain costs $50,000-$100,000 in compute. That’s training for maybe a week on 32-64 GPUs.

LoRA and other parameter-efficient methods reduce this, but you still need $5,000-$10,000 for serious fine-tuning. OpenAI’s fine-tuning API? $8 per million tokens for training, then standard inference pricing. For most use cases, it’s an order of magnitude cheaper than self-hosting an open model.

Data Moats

But money is only part of the barrier. Fine-tuning requires high-quality training data. Thousands of examples, carefully curated, often hand-labeled. Building this dataset costs more than the compute—you need domain experts, data labelers, quality control infrastructure.

Large companies already have this data from their existing products. Startups don’t. The open weights are theoretically available to everyone, but the data needed to make them useful is concentrated in the same hands that controlled closed models.

Who Actually Benefits

Cloud Providers

Amazon, Microsoft, and Google are the real winners of open source AI. Every developer who can’t afford hardware becomes a cloud customer. AWS now offers “SageMaker JumpStart” with pre-configured LLaMA deployments. Microsoft has “Azure ML” with one-click open model hosting. They’ve turned the open source movement into a customer acquisition funnel.

The more compute-intensive open models become, the more revenue flows to cloud providers. They don’t need to own the models—they own the infrastructure required to run them. It’s a better business model than building proprietary AI because they capture value from everyone’s models.

Well-Funded Startups

Companies that raised $10M+ can afford to fine-tune and deploy open models. They get the benefits of customization without the transparency costs of closed APIs. Your fine-tuned LLaMA doesn’t send data to OpenAI for training. This is valuable.

But this creates a new divide. Funded startups can compete using open models. Bootstrapped founders can’t. The barrier isn’t access to weights anymore—it’s access to capital. We’ve replaced technical gatekeeping with economic gatekeeping.

Research Institutions

Universities with GPU clusters benefit enormously. They can experiment, publish papers, train students. This is genuinely valuable for advancing the field. But it doesn’t democratize AI deployment—it democratizes AI research. Those are different things.

A researcher at Stanford can fine-tune LLaMA and publish results. A developer in Lagos trying to build a business cannot. The knowledge diffuses, but the economic power doesn’t.

The Developer Experience Gap

OpenAI’s API takes 10 minutes to integrate. Three lines of code and you’re generating text. LLaMA requires setting up infrastructure, managing deployments, monitoring GPU utilization, handling model updates, implementing rate limiting, building evaluation pipelines. It’s weeks of engineering work before you write your first application line.

Yes, there are platforms like Hugging Face Inference Endpoints and Replicate that simplify this. But now you’re paying them instead of OpenAI, often at comparable prices. The “open” model stopped being open the moment you need it to actually work.

The Regulatory Capture

Here’s where it gets really interesting. As governments start regulating AI, compute requirements become a regulatory moat. The EU AI Act, for instance, has different tiers based on model capabilities and risk. High-risk models face stringent requirements.

Who can afford compliance infrastructure? Companies with capital. Who benefits from regulations that require extensive testing, monitoring, and safety measures? Companies that can amortize these costs across large user bases. Open source was supposed to prevent regulatory capture, but compute requirements ensure it anyway.

We might end up with a future where model weights are technically open, but only licensed entities can legally deploy them at scale. Same outcome as closed models, just with extra steps.

The Geographic Divide

NVIDIA GPUs are concentrated in North America, Europe, and parts of Asia. A developer in San Francisco can buy or rent A100s easily. A developer in Nairobi faces import restrictions, limited cloud availability, and 3-5x markup on hardware.

Open source was supposed to help developers in emerging markets. Instead, it created a new form of digital colonialism: we’ll give you the recipe, but the kitchen costs $100,000. The weights are free, but the compute isn’t. Same power concentration, new mechanism.

The Environmental Cost

Every startup running its own LLaMA instance is replicating infrastructure that could be shared. If a thousand companies each deploy their own 70B model, that’s thousands of GPUs running 24/7 instead of one shared cluster serving everyone through an API.

Ironically, centralized APIs are more energy-efficient. OpenAI’s shared infrastructure has better utilization than thousands of individually deployed models. We’re burning extra carbon for the ideology of openness without achieving actual decentralization.

What Real Democratization Would Look Like

If we’re serious about democratizing AI, we need to address the compute bottleneck directly.

Public compute infrastructure. Government-funded GPU clusters accessible to researchers and small businesses. Like public libraries for AI. The EU could build this for a fraction of what they’re spending on AI regulation.

Efficient model architectures. Research into models that actually run on consumer hardware without quality degradation. We’ve been scaling up compute instead of optimizing efficiency. The incentives are wrong—bigger models generate more cloud revenue.

Federated fine-tuning. Techniques that let multiple parties contribute to fine-tuning without centralizing compute or data. This is technically possible but underdeveloped because it doesn’t serve cloud providers’ interests.

Compute co-ops. Developer collectives that pool resources to share inference clusters. Like how small farmers form cooperatives to share expensive equipment. This exists in limited forms but needs better tooling and organization.

Transparent pricing. If you’re charging for “open source” model hosting, you’re not democratizing—you’re arbitraging. True democratization means commodity pricing on inference, not vendor lock-in disguised as open source.

The Uncomfortable Truth

Open source AI benefits the same people that closed AI benefits, just through different mechanisms. It’s better for researchers and well-funded companies. It’s not better for individual developers, small businesses in emerging markets, or people without access to capital.

We convinced ourselves that releasing weights was democratization. It’s not. It’s shifting the bottleneck from model access to compute access. For most developers, that’s a distinction without a difference.

The original sin isn’t releasing open models—that’s genuinely valuable. The sin is calling it democratization while ignoring the economic barriers that matter more than technical ones. We’re building cathedrals and wondering why only the wealthy enter, forgetting that doors without ramps aren’t really open.

Real democratization would mean a developer in any country can fine-tune and deploy a state-of-the-art model for $100 and an afternoon of work. We’re nowhere close. Until we address that, open source AI remains an aspiration, not a reality.

The weights are open. The power isn’t.

The Tyranny of the Mean: Population-Based Optimization in Healthcare and AI

Modern healthcare and artificial intelligence face a common challenge in how they handle individual variation. Both systems rely on population-level statistics to guide optimization, which can inadvertently push individuals toward averages that may not serve them well. More interesting still, both fields are independently discovering similar solutions—a shift from standardized targets to personalized approaches that preserve beneficial diversity.

Population Averages as Universal Targets

Healthcare’s Reference Ranges

Traditional medical practice establishes “normal” ranges by measuring population distributions. Blood pressure guidelines from the American Heart Association define 120/80 mmHg as optimal. The World Health Organization sets body mass index between 18.5 and 24.9 as the normal range. The American Diabetes Association considers fasting glucose optimal when it falls between 70 and 100 mg/dL. These ranges serve an essential function in identifying pathology, but their origin as population statistics rather than individual optima creates tension in clinical practice.

Elite endurance athletes routinely maintain resting heart rates between 40 and 50 beats per minute, well below the standard range of 60 to 100 bpm. This bradycardia reflects cardiac adaptation rather than dysfunction—their hearts pump more efficiently per beat, requiring fewer beats to maintain circulation. Treating such athletes to “normalize” their heart rates would be counterproductive, yet this scenario illustrates how population-derived ranges can mislead when applied universally.

The feedback mechanism compounds over time. When clinicians routinely intervene to move patients toward reference ranges, the population distribution narrows. Subsequent range calculations derive from this more homogeneous population, potentially tightening targets further. Natural variation that was once common becomes statistically rare, then clinically suspicious.

Language Models and Statistical Patterns

Large language models demonstrate a parallel phenomenon in their optimization behavior. These systems learn probability distributions over sequences of text, effectively encoding which expressions are most common for conveying particular meanings. When users request improvements to their writing, the model suggests revisions that shift the text toward higher-probability regions of this learned distribution—toward the statistical mode of how millions of other people have expressed similar ideas.

This process systematically replaces less common stylistic choices with more typical alternatives. Unusual metaphors get smoothed into familiar comparisons. Regional variations in vocabulary and grammar get normalized to a global standard. Deliberate syntactic choices that create specific rhetorical effects get “corrected” to conventional structures. The model isn’t making errors in this behavior—it’s doing exactly what training optimizes it to do: maximize the probability of generating text that resembles its training distribution.

Similar feedback dynamics appear here. Models train on diverse human writing and learn statistical patterns. People use these models to refine their prose, shifting it toward common patterns. That AI-influenced writing becomes training data for subsequent models. With each iteration, the style space that models learn contracts around increasingly dominant modes.

Precision Medicine’s Response

The healthcare industry has recognized that population averages make poor universal targets and developed precision medicine as an alternative framework. Rather than asking whether a patient’s metrics match population norms, precision medicine asks whether those metrics are optimal given that individual’s genetic makeup, microbiome composition, environmental context, and lifestyle factors.

Commercial genetic testing services like 23andMe and AncestryDNA have made personal genomic data accessible to millions of people. This genetic information reveals how individuals metabolize medications differently, process nutrients through distinct biochemical pathways, and carry polymorphisms that alter their baseline risk profiles. A cholesterol level that predicts cardiovascular risk in one genetic context may carry different implications in another.

Microbiome analysis adds another layer of personalization. Research published by Zeevi et al. in Cell (2015) demonstrated that individuals show dramatically different glycemic responses to identical foods based on their gut bacterial composition. Companies like Viome and DayTwo now offer commercial services that analyze personal microbiomes to generate nutrition recommendations tailored to individual metabolic responses rather than population averages.

Continuous monitoring technologies shift the focus from population comparison to individual trend analysis. Continuous glucose monitors from Dexcom and Abbott’s FreeStyle Libre track glucose dynamics throughout the day. Smartwatches monitor heart rate variability as an indicator of autonomic nervous system function. These devices establish personal baselines and detect deviations from an individual’s normal patterns rather than measuring deviation from population norms.

Applying Precision Concepts to Language Models

The techniques that enable precision medicine suggest analogous approaches for language models. Current systems could be modified to learn and preserve individual stylistic signatures while still improving clarity and correctness. The technical foundations already exist in various forms across the machine learning literature.

Fine-tuning methodology, now standard for adapting models to specific domains, could be applied at the individual level. A model fine-tuned on a person’s past writing would learn their characteristic sentence rhythms, vocabulary preferences, and stylistic patterns. Rather than suggesting edits that move text toward a global statistical mode, such a model would optimize toward patterns characteristic of that individual writer.

Research on style transfer, including work by Lample et al. (2019) on multiple-attribute text rewriting, has shown that writing style can be represented as vectors in latent space. Conditioning text generation on these style vectors enables controlled variation in output characteristics. A system that extracted style embeddings from an author’s corpus could use those embeddings to preserve stylistic consistency while making other improvements.

Constrained generation techniques allow models to optimize for multiple objectives simultaneously. Constraints could maintain statistical properties of an individual’s writing—their typical vocabulary distribution, sentence length patterns, or syntactic structures—while still optimizing for clarity within those boundaries. This approach parallels precision medicine’s goal of optimizing health outcomes within the constraints of an individual’s genetic and metabolic context.

Reinforcement learning from human feedback, as described by Ouyang et al. (2022), currently aggregates preferences across users to train generally applicable models. Implementing RLHF at the individual level would allow models to learn person-specific preferences about which edits preserve voice and which introduce unwanted homogenization. The system would learn not just what makes text “better” in general, but what makes this particular person’s writing more effective without losing its distinctive character.

Training objectives could explicitly reward stylistic diversity rather than purely minimizing loss against a training distribution. Instead of convergence toward a single mode, such objectives would encourage models to maintain facility with a broad range of stylistic choices. This mirrors precision medicine’s recognition that healthy human variation spans a range rather than clustering around a single optimum.

Implementation Challenges

Precision medicine didn’t emerge from purely technical innovation. It developed through sustained institutional commitment, including recognition that population-based approaches were failing certain patients, substantial investment in genomic infrastructure and data systems, regulatory frameworks for handling personal genetic data, and cultural shifts in how clinicians think about treatment targets. Building precision language systems faces analogous challenges beyond the purely technical.

Data requirements differ significantly from current practice. Personalized models need sufficient examples of an individual’s writing to learn their patterns, raising questions about privacy and data ownership. Training infrastructure would need to support many distinct model variants rather than a single universal system. Evaluation metrics would need to measure style preservation alongside traditional measures of fluency and correctness.

More fundamentally, building such systems demands a shift from treating diversity as noise to be averaged away toward treating it as signal to be preserved. This parallels the conceptual shift in medicine from viewing outliers as problems requiring correction toward understanding them as potentially healthy variations. The technical capabilities exist, but deploying them intentionally requires first recognizing that convergence toward statistical modes, while appearing optimal locally, may be problematic globally.

Both healthcare and AI have built optimization systems that push toward population averages. Healthcare recognized the limitations of this approach and developed precision medicine as an alternative. AI can learn from that trajectory, building systems that help individuals optimize for their own patterns rather than converging everyone toward a statistical mean.

References

  • American Heart Association. Blood pressure guidelines. https://www.heart.org
  • World Health Organization. BMI Classification. https://www.who.int
  • American Diabetes Association. Standards of Medical Care in Diabetes.
  • Zeevi, D., Korem, T., Zmora, N., et al. (2015). Personalized Nutrition by Prediction of Glycemic Responses. Cell, 163(5), 1079-1094. DOI: 10.1016/j.cell.2015.11.001
  • Lample, G., Conneau, A., Ranzato, M., Denoyer, L., & JĂ©gou, H. (2019). Multiple-Attribute Text Rewriting. International Conference on Learning Representations.
  • Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. arXiv:2203.02155

The Splintered Web: India 2025

Think about how you use the internet today. You Google something, or you ask ChatGPT. You scroll through Twitter or Instagram. You read the news on your phone. Simple, right?

But something big is changing. The internet is splitting into three different worlds. They’ll all exist on your phone, but they’ll be completely different experiences. And most people won’t even know which one they’re using.

Layer 1: The Premium Internet (Only for Those Who Can Pay)

Imagine The Hindu or Indian Express, but they charge you ₹5,000 per month. Why so much? Because they promise that no AI has touched their content. Every article is written by real journalists, edited by real editors, and meant to be read completely—not just summarized by ChatGPT.

This isn’t just about paywalls. It’s about the full experience. Like reading a well-written book versus reading chapter summaries on Wikipedia. You pay for the writing style, the depth, and the way the story is told.

Think about this: A Bloomberg Terminal costs lakhs per year. Why? Because traders need real, unfiltered information. Now imagine that becoming normal for all good content.

Here’s the problem for India: If quality information becomes expensive, only the rich get the full story. Everyone else gets summaries, shortcuts, and AI-filtered versions. This isn’t just unfair—it’s dangerous for democracy.

Layer 2: The AI Internet (Where Bots Read for You)

This is where most Indians will spend their time. It’s free, but there’s a catch.

You don’t read articles anymore—your AI does. You ask ChatGPT or Google’s Bard: “What happened in the Parliament session today?” The AI reads 50 news articles and gives you a 3-paragraph summary.

Sounds convenient, right? But think about what you’re missing:

  • The reporter’s perspective and context
  • The details that didn’t fit the summary
  • The minority opinions that the AI filtered out
  • The emotional weight of the story

Now add another problem: Most content will be written by AI, too. AI writing for AI readers. News websites will generate hundreds of articles daily because that’s what gets picked up by ChatGPT and Google.

Think about how WhatsApp forwards spread misinformation in India. Now imagine that happening at an internet scale, with AI systems copying from each other. One wrong fact gets repeated by 10 AI systems, and suddenly it becomes “truth” because everyone agrees.

Layer 3: The Dark Forest (Where Real People Hide)

This is the most interesting part. When the internet becomes full of AI-generated content and surveillance, real human conversation goes underground.

This is like how crypto communities use private Discord servers. Or how some journalists now share real stories only in closed WhatsApp groups.

These spaces are

  • Invite-only (you need to know someone to get in)
  • Hard to find (no Google search will show them)
  • High-trust (everyone vouches for everyone else)
  • Small and slow (quality over quantity)

Here’s what happens in these hidden spaces: Real discussions. People actually listening to each other. Long conversations over days and weeks. Experts sharing knowledge freely. Communities solving problems together.

But there’s a big problem: to stay hidden from AI and algorithms, you have to stay hidden from everyone. Great ideas get trapped in small circles. The smartest people become the hardest to find.

Why This Matters for India

India has 750 million internet users. Most are on free platforms—YouTube, Instagram, WhatsApp. Very few pay for premium content.

So what happens when the internet splits?

Rich Indians will pay for premium content. They’ll read full articles, get complete context, and make informed decisions.

Middle-class and poor Indians will use AI summaries. They’ll get the quick version, filtered by algorithms, missing important details.

Tech-savvy Indians will find the dark forest communities. But most people won’t even know these exist.

This creates a new kind of digital divide. Not about who has internet access, but about who has access to real information.

The Election Problem

Imagine the 2029 elections. Different people are getting their news from different layers:

Premium readers get in-depth analysis

AI layer users get simplified summaries (maybe biased, maybe incomplete)

Dark forest people get unfiltered discussions, but only within their small groups

How do we have a fair election when people aren’t even seeing the same information? When does fact-checking happen in different layers?

The Education Problem

Students from rich families will pay for premium learning resources. Clear explanations, quality content, and verified information.

Students from middle-class families will use free AI tools. They’ll get answers, but not always the full understanding. Copy-paste education.

The gap between haves and have-nots becomes a gap between those who understand deeply and those who only know summaries.

Can We Stop This?

Maybe, if we act now. Here’s what could help:

Government-funded quality content: Like Doordarshan once provided free TV, we need free, high-quality internet content. Not AI-generated. Real journalism, real education, accessible to everyone.

AI transparency rules: AI should show its sources. When ChatGPT gives you a summary, you should see which articles it read and what it left out.

Digital literacy programs: People need to understand which layer they’re using and what its limits are. Like how we teach people to spot fake news on WhatsApp, we need to teach them about AI filtering.

Public internet infrastructure: Community spaces that aren’t controlled by big tech. Like public libraries, but for the internet age.

But honestly? The market doesn’t want this. Premium content companies want to charge more. AI companies want to collect more data. Tech platforms want to keep people in their ecosystem.

What You Can Do Right Now

While we can’t stop the internet from splitting, we can be smart about it:

  • Read actual articles sometimes, not just summaries. Your brain works differently when you read the full story.
  • Pay for at least one good news source if you can afford it. Support real journalism.
  • When using AI, ask for sources. Don’t just trust the summary.
  • Join or create small, trusted communities. WhatsApp groups with real discussions, not just forwards.
  • Teach your kids to think critically. To question summaries. To seek original sources.

The Bottom Line

The internet is changing fast. In a few years, we’ll have three different internets:

  • The expensive one with real content
  • The free one where AI does your reading
  • The hidden one where real people connect

Most Indians will end up in the middle layer—the AI layer. Getting quick answers, but missing the full picture. This isn’t just about technology. It’s about who gets to know the truth. Who gets to make informed decisions? Who gets to participate fully in democracy?

We need to talk about this now, while we still have a common internet to have this conversation on.

The question is not whether the internet will split. It’s already happening. The question is: Will we let it create a new class divide in India, or will we fight to keep quality information accessible to everyone?

Which internet are you using right now? Do you even know?

The AI Ouroboros: How Gen AI is Eating Its Own Tail

Imagine a photocopier making copies of copies. Each generation gets a little blurrier, a little more degraded. That’s essentially what’s happening with Gen AI models today, and this diagram maps out exactly how.

The Cycle Begins

It starts innocently enough. An AI model (Generation N) creates content—articles, images, code, whatever. This content gets posted online, where it mingles with everything else on the web. So far, so good.

The Contamination Point

Here’s where things get interesting. Web scrapers come along, hoovering up data to build training datasets for the next generation of AI. They can’t always tell what’s human-made and what’s AI-generated. So both get scooped up together.

The diagram highlights this as the critical “Dataset Composition” decision point—that purple node where synthetic and human data merge. With each cycle, the ratio shifts. More AI content, less human content. The dataset is slowly being poisoned by its own output.

The Degradation Cascade

Train a new model (Generation N+1) on this contaminated data, and four things happen:

  • Accuracy drops: The model makes more mistakes
  • Creativity diminishes: It produces more generic, derivative work
  • Biases amplify: Whatever quirks existed get exaggerated
  • Reliability tanks: You can’t trust the outputs as much

The Vicious Circle Closes

Now here’s the kicker: this degraded Generation N+1 model goes out into the world and creates more content, which gets scraped again, which trains Generation N+2, which is even worse. Round and round it goes, each loop adding another layer of synthetic blur.

The Human Data Squeeze

Meanwhile, clean human-generated data becomes the gold standard—and increasingly rare. The blue pathway in the diagram shows this economic reality. As AI floods the web with synthetic content, finding authentic human data becomes harder and more expensive. It’s basic supply and demand, except the supply is being drowned in synthetic noise.

Why This Matters

This isn’t just a theoretical problem. We’re watching it happen in real-time. The diagram shows a self-reinforcing cycle with no natural brake. Unless we actively intervene—by filtering training data, marking AI content, or preserving human data sources—each generation of AI models will be trained on an increasingly polluted dataset.

The arrows loop back on themselves for a reason. This is a feedback system, and feedback systems can spiral. Understanding this flow is the first step to breaking it.