When Your Chatbot Needs to Actually Do Something: Understanding AI Agents

Introduction

I’ve been building AI systems for production. The shift from LLMs to agents seemed small at first. Then I hit an infinite loop bug and watched costs spike. These aren’t the same thing at all.

Here’s what nobody tells you upfront: an LLM responds to you. An agent decides what to do. That difference is everything.

The Weather Test

Look at these two conversations:

System A:

You: "What's the weather in San Francisco?"
Bot: "I don't have access to real-time weather data, but San Francisco 
typically has mild temperatures year-round..."

System B:

You: "What's the weather in San Francisco?"
Agent: "It's currently 62°F and cloudy in San Francisco."

What happened differently? System B looked at your question, decided it needed weather data, called an API, got the result, and gave you an answer. Four distinct steps that System A couldn’t do.

That’s the line between generation and action. Between completing text and making decisions.

What LLMs Actually Do (And Don’t)

When you call GPT-4 or Claude, you’re using a completion engine. Feed it text, get text back. That’s it.

LLMs are genuinely impressive at pattern completion, synthesizing knowledge from training data, and understanding context. But they can’t remember your last conversation, access current information, execute code, or fix their own mistakes. Each API call is independent. No state. No feedback loop.

This isn’t a flaw. It’s what they are. A calculator doesn’t fail at making coffee. LLMs don’t fail at taking actions. They were built for a different job.

Five Things That Make Something an Agent

After building enough systems, you start seeing patterns. Real agents have these five capabilities:

Agentic Capabilities

Figure 1: Agentic Capabilities

Autonomy – The system figures out what to do, not just how to phrase things. You say “find last year’s sales data” and it determines which database to query, what filters to apply, and how to format results.

Planning – Breaking “analyze this dataset” into actual executable steps. Find the file, check the schema, run calculations, generate visualizations. Multi-step reasoning that adapts based on what it discovers.

Tool Use – APIs, databases, code execution. Ways to actually do things in the world beyond generating tokens.

Memory – Remembering the conversation two messages ago. Keeping track of what worked and what failed. Building context across interactions.

Feedback Loops – When something breaks, the agent sees the error and tries a different approach. Observation, action, observation, adaptation.

Strip away any of these and you’re back to an LLM with extra steps.

How Agents Actually Work

The core mechanism is simpler than you’d expect. It’s an observe-plan-act-observe loop:

  1. Observe – Process the user’s request and current state
  2. Plan – Decide what actions to take
  3. Act – Execute those actions (call tools, run code)
  4. Observe again – See what happened, decide next step

Let’s trace a real interaction:

User: "Book me a flight to NYC next Tuesday and add it to my calendar"

OBSERVATION:
- Two tasks: book flight + calendar entry
- Need: destination (NYC), date (next Tuesday), available tools

PLANNING:
1. Search flights to NYC for next Tuesday
2. Present options to user
3. Wait for user selection
4. Book selected flight
5. Add to calendar with flight details

ACTION:
- Execute: flight_search(destination="NYC", date="2025-12-17")

OBSERVATION (Result):
- Received 3 flight options with prices
- Status: Success

DECISION:
- Present options, wait for selection
- Update state: awaiting_user_selection

The agent isn’t just completing text. It’s making a decision at each step about what to do next based on what it observes.

The Spectrum of Agency

Not everything needs full autonomy. There’s a spectrum:

The Spectrum of Agency

Figure 2: The Spectrum of Agency

Chatbots (low autonomy) – No tools, no state. Pure conversation. This is fine for FAQ bots where all you need is text generation.

Tool-using assistants – Fixed set of tools, simple state. The assistant can call your CRM API or check documentation, but it’s not planning multi-step operations.

Planning agents – Dynamic plans, complex state management. These can break down “analyze Q3 sales and generate a presentation” into actual steps that adapt based on intermediate results.

Multi-agent systems – Multiple agents coordinating, shared state. One agent handles research, another writes, another fact-checks. They communicate and negotiate task division.

Fully autonomous systems – Long-running, open-ended goals. These operate for extended periods with minimal supervision.

Most production systems land somewhere in the middle. You rarely need full autonomy. Often, you just need tools and basic memory.

Where Agents Break in Production

These six failure modes show up constantly in production:

Agent Faillures in Production

Figure 3: Agent Faillures in Production

Infinite loops – Agent calls web_search, doesn’t find what it needs, calls web_search again with slightly different parameters, repeats forever. Solution: set max iterations.

Tool hallucination – Agent tries to call send_email_to_team() which doesn’t exist. The LLM confidently invents plausible-sounding tool names. Solution: strict tool validation.

Context overflow – After 50 messages, the conversation history exceeds the context window. Agent forgets the original goal. Solution: smart context management and summarization.

Cost explosion – No cost caps, agent makes 200 API calls trying to debug something. Your bill hits $10,000 before you notice. Solution: per-request budget limits.

Non-deterministic failures – Same input, different outputs. Sometimes it works, sometimes it doesn’t. Hard to debug. Solution: extensive logging and trace analysis.

Silent failures – Tool call fails, agent doesn’t handle the error, just continues. User gets incorrect results with no indication that something went wrong. Solution: explicit error handling everywhere.

The common thread? These all happen because the agent is making decisions, and decisions can be wrong. With pure LLMs, you can validate outputs. With agents, you need to validate the entire decision-making process.

Memory: Short-term, Long-term, and Procedural

Memory turns out to be more nuanced than “remember the conversation.”

Agent Memory

Figure 4: Agent Memory

Short-term memory (working memory) – Holds the current conversation and immediate context. This is what keeps the agent from asking “what’s your name?” after you just told it.

Long-term memory (episodic) – Stores information across sessions. “Last time we talked, you mentioned you preferred Python over JavaScript.” This is harder to implement but crucial for personalized experiences.

Procedural memory – Learned patterns and behaviors. “When the user asks about sales data, they usually want year-over-year comparisons, not raw numbers.” This often comes from fine-tuning or RLHF (Reinforcement Learning from Human Feedback).

Most systems implement short-term memory (conversation history) and skip the rest. That’s often fine. Long-term memory adds complexity quickly, especially around retrieval and relevance.

Tools: The Actual Interface to the World

Tool calling is how agents affect reality. The LLM generates structured output that your code executes:

# LLM generates this structured decision
{
  "tool": "send_email",
  "arguments": {
    "to": "team@company.com",
    "subject": "Q3 Results",
    "body": "Attached are the Q3 metrics we discussed."
  }
}

# Your code executes it
result = tools["send_email"](**arguments)

# Agent sees the result and decides what to do next

The critical part is validation. Before executing any tool call, check that the tool exists, the parameters are valid, and you have permission to run it. Tool hallucination is common and dangerous.

Also, most tools fail sometimes. APIs timeout, databases lock, network connections drop. Your agent needs explicit error handling for every tool call. Assume failure. Build retry logic. Log everything.

The Planning Problem

“Book a flight and add it to my calendar” sounds simple until you break it down:

  1. Extract destination and date from natural language
  2. Check if you have enough context (do you know which airport? which calendar?)
  3. Search for flights
  4. Evaluate options based on unstated preferences
  5. Present choices without overwhelming the user
  6. Wait for selection (this is a state transition)
  7. Execute booking (this might fail)
  8. Extract flight details from booking confirmation
  9. Format for calendar API
  10. Add to calendar (this might also fail)
  11. Confirm success to user

That’s 11 steps with multiple decision points and error states. An LLM can’t do this. It can generate text that looks like it did this, but it can’t actually execute the steps and adapt based on outcomes.

Planning means breaking fuzzy goals into executable steps and handling the inevitable failures along the way.

When You Actually Need an Agent

Not every problem needs an agent. Most don’t. Here’s a rough guide:

Use an LLM directly when:

  • You just need text generation (summaries, rewrites, explanations)
  • The task is single-shot (one input, one output)
  • You don’t need current data or external actions
  • Latency matters (agents add overhead)

Use an agent when:

  • You need to call multiple APIs based on conditions
  • The task requires multi-step reasoning
  • You need error recovery and retry logic
  • Users expect the system to “figure it out” rather than follow explicit instructions

The deciding factor is usually decision-making under uncertainty. If you can write a script with if-statements that handles all cases, use the script. If you need the system to figure out what to do based on context, that’s when agents make sense.

Three Real Examples

Customer support bot – Most don’t need to be agents. They’re fine at looking up articles and answering questions. But if you want them to check order status, process refunds, and escalate to humans when needed? Now you need autonomy, tools, and decision-making.

Research assistant – A system that searches papers, extracts key findings, and generates summaries is perfect for agents. It needs to decide which papers are relevant, adapt search strategies based on initial results, and synthesize information from multiple sources.

Code reviewer – Analyzing pull requests, running tests, checking style guides, and posting comments. This needs tools (Git API, test runners), multi-step planning, and error handling. Classic agent territory.

Starting Simple

When you build your first agent, resist the temptation to add everything at once. Start with:

  1. One tool (maybe a web search or database query)
  2. Basic conversation memory (just track the last few messages)
  3. Simple decision logic (if user asks about X, call tool Y)
  4. Explicit error handling (what happens when the tool fails?)

Get that working reliably before adding planning, reflection, or multi-agent coordination. The complexity multiplies fast.

I learned this the hard way. Built a “research agent” with 12 tools, complex planning logic, and multi-step reasoning. Spent three weeks debugging edge cases. Rebuilt it with 3 tools and simple logic. Worked better and shipped in two days.

Production Realities

Running agents in production means dealing with issues you don’t face with static LLM calls:

Observability – You need to see what the agent is doing. Log every LLM call, every tool invocation, every decision point. When something breaks (and it will), you need to reconstruct exactly what happened.

Cost control – Set maximum token budgets per request. Cap the number of tool calls. Use caching aggressively for repeated operations. An agent can burn through thousands of tokens if it gets stuck in a loop.

Safety guardrails – Which tools can execute automatically vs requiring human approval? What actions are never allowed? How do you handle sensitive data in tool arguments?

Graceful degradation – When a tool fails, can the agent accomplish the goal another way? Or should it just tell the user it can’t help? Design for partial success, not just all-or-nothing.

These aren’t optional. They’re the difference between a demo and a production system.

The Mental Model Shift

The hardest part isn’t the code. It’s changing how you think about the system.

With LLMs, you’re optimizing prompts to get better completions. With agents, you’re designing decision spaces and constraining behavior. It’s closer to building an API than writing a prompt.

You stop asking “how do I get it to say this?” and start asking “what decisions does it need to make?” and “how do I prevent bad decisions?”

This shift took me longer than learning the technical pieces. I kept trying to solve agent problems with better prompts when I needed better architecture.

What I Wish I’d Known

Before building my first production agent, I wish someone had told me:

Logging is not optional. You will spend hours debugging. Good logs make the difference between “I have no idea what happened” and “oh, it’s calling the wrong tool on step 7.”

Start with deterministic baselines. Before building the agent, write a script that solves the problem with if-statements. This gives you something to compare against and helps you understand the decision logic.

Most complexity is not AI complexity. It’s error handling, state management, API retries, and data validation. The LLM is often the simplest part.

Users don’t care about your architecture. They care whether it works. A simple agent that reliably solves their problem beats a sophisticated agent that’s impressive but breaks often.

Building Your First Agent

If you’re ready to try this, here’s what I’d recommend:

Start with a weather agent. It’s simple enough to finish but complex enough to teach you the core concepts:

Tools:

  • get_weather(location) – fetches current weather
  • geocode(city_name) – converts city names to coordinates

Decision logic:

  • Does user query include a location?
  • If yes, call get_weather directly
  • If no, ask for location or use default
  • Handle API failures gracefully

Memory:

  • Remember the user’s last location
  • Don’t ask again if they query weather multiple times

Build this and you’ll hit most of the core challenges. Tool calling, error handling, state management, and decision logic. It’s a good litmus test for whether you understand the fundamentals.

Where This Goes Next

Once you have a basic agent working, the natural progression is:

  • Better planning algorithms (ReAct, Tree of Thoughts, etc.)
  • More sophisticated memory (vector databases, episodic storage)
  • Multi-agent coordination (specialized agents working together)
  • Evaluation frameworks (how do you know if it’s working?)
  • Production infrastructure (monitoring, cost controls, safety)

But all of that builds on the core loop: observe, plan, act, observe. Master that first. Everything else is elaboration.

The Real Difference

The shift from LLMs to agents isn’t about better models or fancier prompts. It’s about giving language models the ability to do things.

Text generation is powerful. But generation plus action? That’s when things get genuinely useful. When your system can not just tell you the answer but actually execute the steps to get there.

That’s the promise of agents. And also why they’re harder to build than you expect.


Have you built any AI agents? What surprised you most about the difference from working with LLMs directly? Let me know what patterns you’ve discovered.


Code and Resources

If you want to dive deeper, I’ve put together a complete codebase with working examples of everything discussed here:

Building Real-World Agentic AI Systems with LangGraph – GitHub

The repository includes baseline chatbots, tool-calling agents, weather agents, and all the production patterns we covered. Start with module-01 for the fundamentals.

Further Reading

  • ReAct: Synergizing Reasoning and Acting (Yao et al., 2023) – The foundation paper for modern agent architectures. Shows how interleaving reasoning and acting improves agent performance.
  • Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn et al., 2023) – Explores how agents can learn from mistakes through self-reflection.
  • Toolformer (Schick et al., 2023) – Deep dive into how models learn to use tools effectively.

Leave a Comment

Your email address will not be published. Required fields are marked *