← Back to Blog

From LLMs to Agents: The Mindset Shift Nobody Talks About

#llm-vs-agent#agentic-ai#tool-use#agent-memory#agent-planning#production-ai#langgraph#ollama#agent-failure-modes#ai-engineering
ℹ️

An LLM responds. An agent decides.

The Question Most Engineers Get Wrong

Here's a question I ask engineers who are just getting into agents: "What's the difference between an LLM and an agent?"

The most common answer: "An agent has tools."

That's not wrong — but it's not the answer either. It's like saying the difference between a car and a self-driving car is that one has cameras. Technically true, completely misses the point.

The actual difference: an LLM generates text based on what you give it. An agent observes a situation, decides what to do, acts on that decision, and then adjusts based on what happened.

One is a very sophisticated autocomplete. The other is a decision-making system. They share a component — the LLM — but they are fundamentally different things to build, operate, and debug.

This distinction isn't academic. It changes how you architect your system, where it breaks in production, and what it costs to run. If you're treating your agent like a chatbot with extra steps, you'll hit the same walls every team hits — and you'll hit them in production, not in development.

This article is about that mindset shift. What actually makes a system agentic, where agents fail in the real world, and how to think about the spectrum of autonomy before you start building.


What LLMs Actually Do (and Can't Do)

Let's be precise about what a raw LLM call is. You send text in. You get text out. That's it.

from anthropic import Anthropicclient = Anthropic()response = client.messages.create(    model="claude-3-5-sonnet-20241022",    max_tokens=500,    messages=[        {"role": "user", "content": "What's the weather in Tokyo?"}    ])print(response.content[0].text)# "I don't have access to real-time weather data..."

Single input. Single output. Stateless. No memory of what you asked five minutes ago. No ability to call an API. No way to observe the result of an action and try something different.

The LLM is doing one thing: predicting the most probable continuation of your text, given everything it was trained on. It's extraordinarily good at this. But prediction is not action.

What LLMs genuinely cannot do alone:

  • Remember across calls — every invocation starts from zero, context window included
  • Access real-time state — training data has a cutoff; the model has no connection to live systems
  • Execute actions — it can describe calling an API, it cannot call one
  • Self-correct from outcomes — there's no feedback loop; it can't observe that its answer was wrong and try again
  • Plan across steps — it tries to answer in one shot; there's no native mechanism to decompose, execute, observe, and continue

This isn't a criticism of LLMs. They're not designed to do these things. The problem is when engineers assume the LLM will figure out the rest. It won't.


The Five Capabilities That Make a System Agentic

A system is agentic when it has five specific capabilities. Not three, not four. All five matter in production.

flowchart LR
    A[🤖 LLM Core] --> B[Autonomy\nDecides without\nstep-by-step instruction]
    A --> C[Planning\nBreaks goals into\nexecutable steps]
    A --> D[Tool Use\nInteracts with\nexternal systems]
    A --> E[Memory\nRetains state\nacross interactions]
    A --> F[Feedback Loops\nObserves outcomes\nand adjusts]

    style A fill:#4A90E2,color:#fff
    style B fill:#6BCF7F,color:#333
    style C fill:#FFD93D,color:#333
    style D fill:#FFA07A,color:#333
    style E fill:#98D8C8,color:#333
    style F fill:#9B59B6,color:#fff

Autonomy

The system makes decisions without you spelling out every step. You say "find last year's sales data" — the agent decides what query to run, which table to look in, what filters to apply. You don't specify any of it.

Important distinction: autonomy doesn't mean unconstrained. Production agents operate within explicit bounds — approved tools, allowed actions, cost caps. Autonomy within guardrails is the goal. Unconstrained autonomy is a production incident waiting to happen.

Planning

Complex requests require decomposition. "Research our competitors and write a comparison report" is not one action — it's a sequence: identify competitors, research each one, gather structured data, synthesize, write. An agent with planning capability breaks this down into executable steps before acting.

Without planning, the agent tries to do everything in one prompt. It fails at anything non-trivial, and the failures are often silent — you get a response that looks reasonable but skipped half the work.

Tool Use

Tools are how agents act on the world. A tool is just a function the LLM can decide to call:

def get_weather(city: str) -> dict:    """Get current weather for a city."""    response = requests.get(f"{WEATHER_API_URL}?q={city}&appid={API_KEY}")    data = response.json()    return {        "temperature": round(data["main"]["temp"] - 273.15, 1),        "condition": data["weather"][0]["description"],        "humidity": data["main"]["humidity"]    }

The LLM sees the function name, description, and parameter schema. When a user asks about weather, it decides to call this function, constructs the right arguments, and uses the result to formulate its response. The key word is decides — it's not hardcoded to call the weather tool on weather questions. It infers when it's needed.

Memory

Without memory, every interaction starts from zero. The agent doesn't know what you said two messages ago, can't reference earlier context, and treats every turn as independent.

There are three types in practice:

Working memory — the current conversation context. The message history you pass into each LLM call. Limited by context window size.

Long-term memory — facts and summaries persisted to a vector database or key-value store. "User is planning a trip to Tokyo." "User prefers metric units." Retrieved and injected into context as needed.

Procedural memory — learned behaviors encoded in the agent's configuration. How it handles a data analysis request vs. a customer query. Often expressed as routing logic or specialized sub-agents.

Memory is not optional for production agents. Without it, your agent is a stateless API — useful, but not agentic.

Feedback Loops

The ability to observe the outcome of an action and adjust. This is what separates true agency from scripted tool-calling.

# Attempt 1result = agent.call_tool("query_db", sql="SELECT * FROM sales")# Observation: table doesn't existif result.error and "relation does not exist" in result.error:    # Adjust: try to discover the right table name    schema = agent.call_tool("list_tables")    correct_table = agent.infer_table(schema, "sales data")    result = agent.call_tool("query_db", sql=f"SELECT * FROM {correct_table}")

LLMs hallucinate. APIs fail. Table names are wrong. Feedback loops are what allow the agent to recover from these situations rather than returning a confident but incorrect answer.


What an Agentic Interaction Actually Looks Like

Take a request like "Book me a flight to NYC next Tuesday and add it to my calendar."

A chatbot generates text about how it would do that.

An agent does this:

flowchart TD
    U([User Request]) --> P[Parse intent:\nflight + calendar]
    P --> F[Search flights\nfor next Tuesday]
    F --> R{Results found?}
    R -->|Yes| S[Select best option\nbased on preferences]
    R -->|No| E[Inform user\nno flights available]
    S --> C[Confirm with user\nbefore booking]
    C -->|Confirmed| B[Execute booking\nvia flight API]
    B --> CAL[Add to calendar\nvia calendar API]
    CAL --> D([Done: Confirm\nto user])

    style U fill:#4A90E2,color:#fff
    style D fill:#6BCF7F,color:#333
    style C fill:#FFD93D,color:#333
    style R fill:#FFA07A,color:#333

Notice what's happening: multiple decision points, tool calls based on need rather than hardcoded sequence, user-in-the-loop for the consequential step (actually booking), and state tracked across the whole flow. This is what agents are for.


Where Agents Actually Fail in Production

This is the section that matters. The capabilities above are relatively well-documented. The failure modes are what you learn the hard way.

The Infinite Loop Problem

The agent calls a tool, gets a result, decides it needs more information, calls the tool again, gets a similar result, decides it still needs more information, and keeps going until you hit a cost limit or timeout.

Why it happens: there's no convergence criterion. The LLM doesn't have an innate sense of "enough." It will keep gathering information if the task description implies thoroughness without setting a bound.

The fix is explicit: set maximum iteration counts, define termination conditions, and track whether meaningful progress was made in the last N steps. No termination strategy is not a production agent — it's a money-burning loop.

MAX_STEPS = 10MAX_COST_USD = 1.00for step in range(MAX_STEPS):    if estimated_cost() > MAX_COST_USD:        return AgentResult(status="budget_exceeded")    if goal_achieved(current_state):        return AgentResult(status="success")    action = agent.decide_next_action(current_state)    current_state = execute(action)

Tool Hallucination

The agent calls a tool that doesn't exist. It invents a plausible function name — search_internal_wiki(), get_customer_history() — because it was trained on code where similar functions existed.

This fails silently if your tool execution layer doesn't validate against a known schema. The agent receives an error, sometimes hallucinates a response anyway, and your user gets a confidently wrong answer.

Fix: validate every tool call against your registered tool registry before execution. Return a structured error for unknown tools that the agent can reason about. Never let an unknown function name silently pass.

Context Overflow

Long-running conversations or multi-step research tasks accumulate context faster than you'd expect. At 70-80% context utilization, the agent starts losing track of the original goal. Instructions from the beginning of the conversation get pushed out or deprioritized.

The symptom: the agent answers a different question than the one asked, or forgets constraints it was given in the system prompt.

Fix: backend token tracking (don't rely on the frontend to estimate this), summarization triggers at usage thresholds, and hierarchical memory that compresses old context without discarding the information it contained.

Cost Explosion

Each LLM call costs money. Each tool call may cost money. Multi-step agents chain these together. Without explicit cost controls, a single complex request can generate dozens of LLM calls and rack up costs that are invisible until your billing dashboard updates.

A research agent that calls web_search 100 times because it has no max-iteration guard is not a hypothetical — it's what happens the first time you deploy without cost controls.

Fix: token counting before expensive operations, per-session cost caps, caching for repeated identical tool calls, and cheaper models for routing and simple reasoning decisions.

Non-Deterministic Behavior

Same input, different output. This is by design for LLMs (temperature > 0 introduces randomness) but it's a debugging nightmare for agents. You can't reproduce a failure, you can't write deterministic tests, and you can't explain to a user why the same query gave different results on two consecutive days.

Fix: deterministic control flow for critical paths (this is exactly what LangGraph's state machine model solves), explicit step logging, and testing frameworks that validate behavior across multiple runs rather than asserting exact outputs.

Silent Failures

The agent returns a response. The response looks reasonable. It's wrong.

This is the most dangerous failure mode because you can't catch it without observability. The agent encountered an error partway through, the LLM filled in the gap with plausible-sounding generated content, and the result is confident nonsense.

Fix: structured error propagation at every tool boundary, validation layers that distinguish retrieved data from generated data, and logging that traces exactly which data came from tools vs. which was generated by the model.

⚠️

Silent failures are the failure mode that production incidents are made of. An infinite loop is visible immediately. A silent failure passes QA and gets discovered by a user six weeks after launch.


The Agent Spectrum: Where Your Use Case Actually Falls

Not every use case needs a fully autonomous agent. There's a spectrum, and most production systems live in the middle.

flowchart LR
    P1[Position 1\nChatbot\nNo agency] -->
    P2[Position 2\nTool-Using\nAssistant] -->
    P3[Position 3\nPlanning\nAgent] -->
    P4[Position 4\nMulti-Agent\nSystem] -->
    P5[Position 5\nFully Autonomous\nAgent]

    style P1 fill:#95A5A6,color:#fff
    style P2 fill:#6BCF7F,color:#333
    style P3 fill:#4A90E2,color:#fff
    style P4 fill:#FFD93D,color:#333
    style P5 fill:#E74C3C,color:#fff

Position 1 — Chatbot: Pure LLM, no tools, no state. Fine for Q&A against known information. Not an agent.

Position 2 — Tool-Using Assistant: Can call specific tools on demand. No planning, responds to immediate requests. Most "AI features" in SaaS products live here. Sufficient for a lot of use cases.

Position 3 — Planning Agent: Decomposes tasks into steps, uses tools strategically, maintains state across steps. This is where most practical agentic systems should operate. Complex enough to be genuinely useful, constrained enough to be debuggable.

Position 4 — Multi-Agent System: Multiple specialized agents coordinating. A research agent, a writing agent, a fact-checker agent. Higher capability ceiling, exponentially harder to debug. Justified when you have genuinely distinct specialized domains that shouldn't share context.

Position 5 — Fully Autonomous Agent: Open-ended goal execution, minimal human oversight, runs for hours or days. Rarely practical in production today. Cost, reliability, and safety constraints make this a research problem more than an engineering one.

This is the practical guidance: start at Position 2, build to Position 3 when your use case requires planning across multiple steps, consider Position 4 only when you've fully mastered Position 3 and have a specific reason why specialization is needed.

The engineers who jump straight to multi-agent systems because they sound sophisticated are the ones who email me six months later asking why their system is impossible to debug.


Building the Baseline: Chatbot to Agent

Here's the progression in code. Starting with a chatbot that only generates text:

# baseline_chatbot.pyimport ollamadef chatbot(user_message: str) -> str:    response = ollama.chat(        model="qwen3:8b",        messages=[            {"role": "system", "content": "You are a helpful assistant."},            {"role": "user", "content": user_message}        ]    )    return response['message']['content']print(chatbot("What's the weather in Tokyo?"))# "I don't have access to real-time weather data..."

This is Position 1. It answers based on training data. Ask it about the weather — it guesses or declines. Ask it to send an email — it describes what an email would look like.

Now add tools and memory:

# agent_v2.py — with tools and conversation memorydef intelligent_agent(user_message: str, conversation_history: list = None) -> dict:    if conversation_history is None:        conversation_history = []    messages = [        {            "role": "system",            "content": """You are an intelligent assistant that can:1. Use tools when needed2. Explain your reasoning before acting3. Remember previous contextIf you cannot complete a task, explain what's missing."""        }    ]    messages.extend(conversation_history)    messages.append({"role": "user", "content": user_message})    response = ollama.chat(model="qwen3:8b", messages=messages, tools=tools)    response_message = response['message']    messages.append(response_message)    tool_results = []    if response_message.get('tool_calls'):        for tool_call in response_message['tool_calls']:            function_name = tool_call['function']['name']            function_args = tool_call['function']['arguments']            # Validate against registered tools before executing            if function_name not in available_functions:                raise ValueError(f"Unknown tool: {function_name}")            function_response = available_functions[function_name](**function_args)            tool_results.append({                "tool": function_name,                "input": function_args,                "output": function_response            })            messages.append({                "role": "tool",                "content": json.dumps(function_response) if not isinstance(function_response, str)                           else function_response            })        final_response = ollama.chat(model="qwen3:8b", messages=messages)        return {            "response": final_response['message']['content'],            "actions": tool_results,            "conversation_history": messages        }    return {        "response": response_message.get('content', ''),        "actions": [],        "conversation_history": messages    }

What changed: the agent now maintains conversation_history across turns, validates tool calls before execution, and returns a structured result that includes what tools were called and why. Ask it about Tokyo weather — it calls get_weather("Tokyo"), gets real data, responds with real numbers. Ask it "what was my name again?" three turns later — it finds it in the conversation history.

That's the shift from Position 1 to Position 2–3. The LLM is still doing the thinking. The infrastructure around it gives those thoughts the ability to act and persist.


Production Checklist Before You Ship

These aren't optional. They're the things you build before users find the failure modes for you.

Observability — can you see what the agent is doing?

Every LLM call should log: model, input tokens, output tokens, latency, tool calls made, cost. Every tool call should log: function name, arguments, result, latency, whether it succeeded. Without this, debugging a production incident is archaeology.

Cost controls — what's the maximum cost per request?

Token counting before expensive multi-step operations. Per-session cost caps with graceful early termination. Caching for repeated identical tool calls. Smaller models for routing and simple decisions — you don't need GPT-4 to decide whether a query is about weather or calendar.

Termination strategy — what stops the agent?

Maximum step count. Timeout. Cost cap. Goal-achievement detection. All four. Any agent without explicit termination conditions is a liability.

Error propagation — are errors surfaced or swallowed?

Structured error types at every tool boundary. Clear distinction between data retrieved from tools and content generated by the model. Validation layers that catch the cases where the LLM fills in missing information with plausible nonsense.

Safety bounds — what can the agent not do?

Explicit tool permissions. Human-in-the-loop gates for consequential actions (sending emails, modifying databases, making purchases). Rollback mechanisms for reversible actions. The narrower the tool surface, the smaller the blast radius when something goes wrong.


The Real Mindset Shift

Here's what this whole module comes down to.

When you use an LLM, you're asking a question. When you build an agent, you're designing a decision-making system. Those require completely different mental models.

Designing a decision-making system means thinking about: what decisions need to be made, what information those decisions require, what actions result from those decisions, what happens when information is missing or actions fail, and how the system knows when it's done.

Most production failures in agentic systems come from engineers who designed a good LLM prompt but didn't design the decision system around it. They thought about the happy path (user asks, agent answers correctly) and didn't think about the failure paths (API down, context overflowed, tool hallucinated, loop didn't terminate).

The five capabilities aren't a checklist to implement — they're a framework for thinking about where your system is incomplete. If it doesn't have memory, every session starts from zero. If it doesn't have feedback loops, it can't recover from errors. If it doesn't have explicit termination, it will loop until it's stopped.

Build with all five in mind from the start. Add production controls before you ship. Start at Position 2–3 on the autonomy spectrum and move right only when you have specific reasons to.

That's the mindset shift. Everything else is implementation.


What's Next

This article covers the foundations — the mental model for what makes a system agentic and where it breaks in production.

The next layer is the technical architecture: how the agent loop actually works under the hood, how tool calling is implemented in LangChain and LangGraph, how to design memory systems for production use cases, and how to build deterministic control flow with LangGraph's state machine model.

That's Module 2 — and the shift from conceptual understanding to actual implementation.


References and Further Reading



Related Articles

Agentic AI

More Articles

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:



Comments