Agent Building Blocks: Build Production-Ready AI Agents with LangChain

From understanding concepts to building systems – this comprehensive guide takes you through every component needed to build reliable, production-ready AI agents.

Table of Contents

Introduction: From Theory to Implementation

Building an AI agent isn’t about chaining a few LLM calls together and hoping for the best. It’s about understanding the fundamental mechanics that make agents actually work in production environments.

If you’ve experimented with agent frameworks like LangChain or AutoGPT, you’ve probably noticed something: they make agents look simple on the surface, but when things break (and they will), you’re left debugging a black box. The agent gets stuck in loops, picks wrong tools, forgets context, or hallucinates operations that don’t exist.

The problem? Most developers treat agents as magical systems without understanding what’s happening under the hood.

This guide changes that. We’re deconstructing agents into their core building blocks – the execution loop, tool interfaces, memory architecture, and state transitions. By the end, you’ll not only understand how agents work, but you’ll be able to build robust, debuggable systems that handle real-world tasks.

What makes this different from other agent tutorials?

Instead of showing you how to call agent.run() and praying it works, we’re breaking down each component with production-grade implementations. You’ll see working code for every pattern, understand why each piece matters, and learn where systems typically fail.

Who is this guide for?

AI engineers and software developers who want to move beyond toy examples. If you’ve built demos that work 70% of the time but can’t figure out why they fail the other 30%, this is for you. If you need agents that handle errors gracefully, maintain context across conversations, and execute tools reliably, keep reading.

The Fundamental Truth About Agents

Here’s what most tutorials won’t tell you upfront: An agent is not a monolith – it’s a loop with state, tools, and memory.

Every agent system, regardless of complexity, follows the same pattern:

Figure 1: The canonical agent execution loop showing Observe → Think → Decide → Act → Update State → Termination Check cycle

This six-step pattern (five actions plus termination check) appears everywhere:

Simple chatbots implement it minimally
Complex multi-agent systems run multiple instances simultaneously
Production systems add error handling and recovery to each step

The sophistication varies, but the structure stays constant.

Why this matters for production systems:

When you call agent.run() in LangChain or set up a workflow in LangGraph, this loop executes behind the scenes. When something breaks – the agent loops infinitely, selects wrong tools, or loses context – you need to know which step failed:

Observe: Did it lack necessary context?
Think: Was the prompt unclear or misleading?
Decide: Were tool descriptions ambiguous?
Act: Did the tool crash or return unexpected data?
Update State: Did memory overflow or lose information?

Without understanding the loop, you’re debugging blind.

Anatomy of the Agent Execution Loop

Let’s examine the agent loop with precision. This isn’t pseudocode – this is the actual pattern every agent implements:

def agent_loop(task: str, max_iterations: int = 10) -> str:
    """
    The canonical agent execution loop.
    This foundation appears in every agent system.
    """
    # Initialize state
    state = {
        "task": task,
        "conversation_history": [],
        "iteration": 0,
        "completed": False
    }
    
    while not state["completed"] and state["iteration"] < max_iterations:
        # STEP 1: OBSERVE
        # Gather current context: task, history, available tools, memory
        observation = observe(state)
        
        # STEP 2: THINK
        # LLM reasons about what to do next
        reasoning = llm_think(observation)
        
        # STEP 3: DECIDE
        # Choose an action based on reasoning
        action = decide_action(reasoning)
        
        # STEP 4: ACT
        # Execute the chosen action (tool call or final answer)
        result = execute_action(action)
        
        # STEP 5: UPDATE STATE
        # Store the outcome and update memory
        state = update_state(state, action, result)
        
        # TERMINATION CHECK
        if is_task_complete(state):
            state["completed"] = True
        
        state["iteration"] += 1
    
    return extract_final_answer(state)

def agent_loop(task: str, max_iterations: int = 10) -> str:
    """
    The canonical agent execution loop.
    This foundation appears in every agent system.
    """
    # Initialize state
    state = {
        "task": task,
        "conversation_history": [],
        "iteration": 0,
        "completed": False
    }
    
    while not state["completed"] and state["iteration"] < max_iterations:
        # STEP 1: OBSERVE
        # Gather current context: task, history, available tools, memory
        observation = observe(state)
        
        # STEP 2: THINK
        # LLM reasons about what to do next
        reasoning = llm_think(observation)
        
        # STEP 3: DECIDE
        # Choose an action based on reasoning
        action = decide_action(reasoning)
        
        # STEP 4: ACT
        # Execute the chosen action (tool call or final answer)
        result = execute_action(action)
        
        # STEP 5: UPDATE STATE
        # Store the outcome and update memory
        state = update_state(state, action, result)
        
        # TERMINATION CHECK
        if is_task_complete(state):
            state["completed"] = True
        
        state["iteration"] += 1
    
    return extract_final_answer(state)

The state dictionary is the agent’s working memory. It tracks everything: the original task, conversation history, current iteration, and completion status. This state persists across iterations, accumulating context as the agent progresses.

The while condition has two critical parts:

not state["completed"] – checks if the task is finished
state["iteration"] < max_iterations – safety valve preventing infinite loops

Without the second condition, a logic error or unclear task makes your agent run forever, burning through API credits and compute resources.

The five steps must execute in order:

You can’t decide without observing
You can’t act without deciding
You can’t update state without seeing results

This sequence is fundamental, not arbitrary.

Step 1: Observe – Information Gathering

Purpose: Assemble all relevant information for decision-making

def observe(state: dict) -> dict:
    """
    Observation packages everything the LLM needs:
    - Original task/goal
    - Conversation history
    - Available tools
    - Current memory/context
    - Previous action outcomes
    """
    return {
        "task": state["task"],
        "history": state["conversation_history"][-5:],  # Last 5 turns
        "available_tools": get_available_tools(),
        "iteration": state["iteration"],
        "previous_result": state.get("last_result")
    }

def observe(state: dict) -> dict:
    """
    Observation packages everything the LLM needs:
    - Original task/goal
    - Conversation history
    - Available tools
    - Current memory/context
    - Previous action outcomes
    """
    return {
        "task": state["task"],
        "history": state["conversation_history"][-5:],  # Last 5 turns
        "available_tools": get_available_tools(),
        "iteration": state["iteration"],
        "previous_result": state.get("last_result")
    }

Why observation matters:

The LLM can’t see your entire system state – you must explicitly package relevant information. Think of it as preparing a briefing document before a meeting. Miss critical context, and decisions suffer.

Key considerations:

Context window management: You can’t include unlimited history. The code above keeps the last 5 conversation turns. This prevents token overflow while maintaining recent context. For longer conversations, implement summarization or semantic filtering.
Tool visibility: The agent needs to know what actions it can take. This seems obvious until you’re debugging why an agent doesn’t use a tool you just added. Make tool descriptions visible in every observation.
Iteration tracking: Including the current iteration helps the LLM understand how long it’s been working. After iteration 8 of 10, it might change strategy or provide intermediate results.
Previous results: The outcome of the last action directly influences the next decision. Did the API call succeed? What data came back? This feedback is essential.

Common failures:

Token limit exceeded because you included entire conversation history
Missing tool descriptions causing the agent to ignore available functions
No previous result context making the agent repeat failed actions
Task description missing causing goal drift over multiple iterations

Step 2: Think – LLM Reasoning

Purpose: Process observations and reason about next steps

def llm_think(observation: dict) -> str:
    """
    The LLM receives context and generates reasoning.
    This is where the intelligence happens.
    """
    prompt = f"""
    Task: {observation['task']}
    
    Previous conversation:
    {format_history(observation['history'])}
    
    Available tools:
    {format_tools(observation['available_tools'])}
    
    Previous result: {observation.get('previous_result', 'None')}
    
    Based on this context, what should you do next?
    Think step-by-step about:
    1. What information do you have?
    2. What information do you need?
    3. Which tool (if any) should you use?
    4. Can you provide a final answer?
    """
    
    return llm.generate(prompt)

def llm_think(observation: dict) -> str:
    """
    The LLM receives context and generates reasoning.
    This is where the intelligence happens.
    """
    prompt = f"""
    Task: {observation['task']}
    
    Previous conversation:
    {format_history(observation['history'])}
    
    Available tools:
    {format_tools(observation['available_tools'])}
    
    Previous result: {observation.get('previous_result', 'None')}
    
    Based on this context, what should you do next?
    Think step-by-step about:
    1. What information do you have?
    2. What information do you need?
    3. Which tool (if any) should you use?
    4. Can you provide a final answer?
    """
    
    return llm.generate(prompt)

This is where reasoning happens. The LLM analyzes the current situation and determines the next action. Quality of thinking depends entirely on prompt design.

Prompt engineering for agents:

Structure matters: Notice the prompt breaks down reasoning into steps. “What should you do next?” is too vague. “Think step-by-step about information you have, information you need, tools to use, and whether you can answer” guides better reasoning.
Context ordering: Put the most important information first. Task description comes before history. Tool descriptions come before previous results. LLMs perform better with well-structured input.
Tool descriptions in reasoning: The agent needs clear descriptions of each tool’s purpose, inputs, and outputs. Ambiguous descriptions lead to wrong tool selection.
ReAct pattern: Many production systems use “Reason + Act” prompting. The LLM explicitly writes its reasoning (“I need weather data, so I’ll use the weather tool”) before selecting actions. This improves decision quality and debuggability.

Common reasoning failures:

Generic prompts that don’t guide step-by-step thinking
Missing tool descriptions causing the agent to hallucinate functions
Unclear task specifications leading to goal confusion
No explicit reasoning step making decisions opaque

Step 3: Decide – Action Selection

Purpose: Convert reasoning into a specific, executable action

def decide_action(reasoning: str) -> dict:
    """
    Parse the LLM's reasoning and extract a structured action.
    This bridges thinking and execution.
    """
    # Parse LLM output for tool calls or final answers
    if "Tool:" in reasoning:
        tool_name = extract_tool_name(reasoning)
        tool_args = extract_tool_arguments(reasoning)
        return {
            "type": "tool_call",
            "tool": tool_name,
            "arguments": tool_args
        }
    elif "Final Answer:" in reasoning:
        answer = extract_final_answer(reasoning)
        return {
            "type": "final_answer",
            "content": answer
        }
    else:
        # Unclear reasoning - request clarification
        return {
            "type": "continue",
            "message": "Need more information"
        }

def decide_action(reasoning: str) -> dict:
    """
    Parse the LLM's reasoning and extract a structured action.
    This bridges thinking and execution.
    """
    # Parse LLM output for tool calls or final answers
    if "Tool:" in reasoning:
        tool_name = extract_tool_name(reasoning)
        tool_args = extract_tool_arguments(reasoning)
        return {
            "type": "tool_call",
            "tool": tool_name,
            "arguments": tool_args
        }
    elif "Final Answer:" in reasoning:
        answer = extract_final_answer(reasoning)
        return {
            "type": "final_answer",
            "content": answer
        }
    else:
        # Unclear reasoning - request clarification
        return {
            "type": "continue",
            "message": "Need more information"
        }

Decision making converts reasoning to structure. The LLM output is text. Execution requires structured data. This step parses reasoning into actionable commands.

Structured output formats:

Modern LLMs support structured outputs via function calling or JSON mode. Instead of parsing text, you can get typed responses:

# Using OpenAI function calling
action = llm.generate(
    messages=messages,
    tools=[
        {
            "type": "function",
            "function": {
                "name": "calculator",
                "description": "Perform mathematical calculations",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "expression": {"type": "string"}
                    },
                    "required": ["expression"]
                }
            }
        }
    ],
    tool_choice="auto"
)

# Using OpenAI function calling
action = llm.generate(
    messages=messages,
    tools=[
        {
            "type": "function",
            "function": {
                "name": "calculator",
                "description": "Perform mathematical calculations",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "expression": {"type": "string"}
                    },
                    "required": ["expression"]
                }
            }
        }
    ],
    tool_choice="auto"
)

This approach eliminates parsing errors and guarantees valid tool calls.

Decision validation:

Before executing, validate the decision:

Does the requested tool exist?
Are all required arguments provided?
Do argument types match the schema?
Are argument values reasonable?

Failure handling:

What happens when the LLM generates invalid output? You need fallback logic:

def decide_action_safe(reasoning: str) -> dict:
    try:
        action = decide_action(reasoning)
        validate_action(action)
        return action
    except ParseError:
        return {
            "type": "error",
            "message": "Could not parse action from reasoning"
        }
    except ValidationError as e:
        return {
            "type": "error",
            "message": f"Invalid action: {str(e)}"
        }

def decide_action_safe(reasoning: str) -> dict:
    try:
        action = decide_action(reasoning)
        validate_action(action)
        return action
    except ParseError:
        return {
            "type": "error",
            "message": "Could not parse action from reasoning"
        }
    except ValidationError as e:
        return {
            "type": "error",
            "message": f"Invalid action: {str(e)}"
        }

Common decision failures:

LLM hallucinates non-existent tools
Missing required arguments in tool calls
Type mismatches between provided and expected arguments
No validation before execution causing downstream crashes

Step 4: Act – Execution

Purpose: Execute the decided action and return results

def execute_action(action: dict) -> dict:
    """
    Execute tool calls or generate final answers.
    This is where the agent interacts with the world.
    """
    if action["type"] == "tool_call":
        tool = get_tool(action["tool"])
        try:
            result = tool.execute(**action["arguments"])
            return {
                "success": True,
                "result": result,
                "tool": action["tool"]
            }
        except Exception as e:
            return {
                "success": False,
                "error": str(e),
                "tool": action["tool"]
            }
    
    elif action["type"] == "final_answer":
        return {
            "success": True,
            "result": action["content"],
            "final": True
        }
    
    elif action["type"] == "error":
        return {
            "success": False,
            "error": action["message"]
        }

def execute_action(action: dict) -> dict:
    """
    Execute tool calls or generate final answers.
    This is where the agent interacts with the world.
    """
    if action["type"] == "tool_call":
        tool = get_tool(action["tool"])
        try:
            result = tool.execute(**action["arguments"])
            return {
                "success": True,
                "result": result,
                "tool": action["tool"]
            }
        except Exception as e:
            return {
                "success": False,
                "error": str(e),
                "tool": action["tool"]
            }
    
    elif action["type"] == "final_answer":
        return {
            "success": True,
            "result": action["content"],
            "final": True
        }
    
    elif action["type"] == "error":
        return {
            "success": False,
            "error": action["message"]
        }

This is where theory meets reality. Tools interact with external systems: APIs, databases, file systems, calculators. External systems fail, timeout, return unexpected data, or change their interfaces.

Production execution considerations:

Error handling is mandatory: Every external call can fail. Network issues, API rate limits, authentication failures, malformed responses – expect everything.

def safe_tool_execution(tool, arguments, timeout=30):
    """Production-grade tool execution with comprehensive error handling"""
    try:
        # Set timeout to prevent hanging
        with time_limit(timeout):
            result = tool.execute(**arguments)
        
        # Validate result format
        validate_result_schema(result)
        
        return {"success": True, "result": result}
    
    except TimeoutError:
        return {"success": False, "error": "Tool execution timeout"}
    except ValidationError as e:
        return {"success": False, "error": f"Invalid result format: {e}"}
    except APIError as e:
        return {"success": False, "error": f"API error: {e}"}
    except Exception as e:
        # Log unexpected errors for debugging
        logger.exception(f"Unexpected error in {tool.name}")
        return {"success": False, "error": "Tool execution failed"}

def safe_tool_execution(tool, arguments, timeout=30):
    """Production-grade tool execution with comprehensive error handling"""
    try:
        # Set timeout to prevent hanging
        with time_limit(timeout):
            result = tool.execute(**arguments)
        
        # Validate result format
        validate_result_schema(result)
        
        return {"success": True, "result": result}
    
    except TimeoutError:
        return {"success": False, "error": "Tool execution timeout"}
    except ValidationError as e:
        return {"success": False, "error": f"Invalid result format: {e}"}
    except APIError as e:
        return {"success": False, "error": f"API error: {e}"}
    except Exception as e:
        # Log unexpected errors for debugging
        logger.exception(f"Unexpected error in {tool.name}")
        return {"success": False, "error": "Tool execution failed"}

Retry logic: Transient failures (network issues, temporary API problems) should trigger retries with exponential backoff:

def execute_with_retry(tool, arguments, max_retries=3):
    for attempt in range(max_retries):
        result = tool.execute(**arguments)
        if result["success"]:
            return result
        
        if not is_retryable_error(result["error"]):
            return result
        
        # Exponential backoff: 1s, 2s, 4s
        time.sleep(2 ** attempt)
    
    return result

def execute_with_retry(tool, arguments, max_retries=3):
    for attempt in range(max_retries):
        result = tool.execute(**arguments)
        if result["success"]:
            return result
        
        if not is_retryable_error(result["error"]):
            return result
        
        # Exponential backoff: 1s, 2s, 4s
        time.sleep(2 ** attempt)
    
    return result

Result formatting: Tools should return consistent result structures. Standardize on success/error patterns:

# Good: Consistent structure
{
    "success": True,
    "result": "42",
    "metadata": {"tool": "calculator", "execution_time": 0.01}
}

# Bad: Inconsistent structure
"42"  # Just a string - no error information

# Good: Consistent structure
{
    "success": True,
    "result": "42",
    "metadata": {"tool": "calculator", "execution_time": 0.01}
}

# Bad: Inconsistent structure
"42"  # Just a string - no error information

Common execution failures:

Missing timeout handling causing agents to hang
No retry logic for transient failures
Poor error messages making debugging impossible
Inconsistent result formats breaking downstream processing

Step 5: Update State – Memory Management

Purpose: Incorporate action results into agent state

def update_state(state: dict, action: dict, result: dict) -> dict:
    """
    Update state with action outcomes.
    This is how the agent learns and remembers.
    """
    # Add to conversation history
    state["conversation_history"].append({
        "iteration": state["iteration"],
        "action": action,
        "result": result,
        "timestamp": datetime.now()
    })
    
    # Update last result for next observation
    state["last_result"] = result
    
    # Check for completion
    if result.get("final"):
        state["completed"] = True
        state["final_answer"] = result["result"]
    
    # Trim history if too long
    if len(state["conversation_history"]) > 20:
        state["conversation_history"] = state["conversation_history"][-20:]
    
    return state

def update_state(state: dict, action: dict, result: dict) -> dict:
    """
    Update state with action outcomes.
    This is how the agent learns and remembers.
    """
    # Add to conversation history
    state["conversation_history"].append({
        "iteration": state["iteration"],
        "action": action,
        "result": result,
        "timestamp": datetime.now()
    })
    
    # Update last result for next observation
    state["last_result"] = result
    
    # Check for completion
    if result.get("final"):
        state["completed"] = True
        state["final_answer"] = result["result"]
    
    # Trim history if too long
    if len(state["conversation_history"]) > 20:
        state["conversation_history"] = state["conversation_history"][-20:]
    
    return state

State management is how agents remember. Without proper updates, agents repeat actions, forget results, and lose context.

What to store:

Conversation history: Every action and result. This creates the narrative of what happened. Essential for debugging and providing context in future observations.
Last result: The most recent outcome directly influences the next decision. Store it separately for easy access.
Metadata: Timestamps, iteration numbers, execution times. Useful for debugging and performance analysis.

State trimming strategies:

States grow indefinitely if not managed. Implement strategies:

Fixed window: Keep last N interactions (shown above)
Summarization: Use an LLM to summarize old history into concise context
Semantic filtering: Keep only relevant interactions based on similarity to current task
Hierarchical storage: Recent items in full detail, older items summarized, ancient items removed

Memory types explained:

Figure 2: Three types of agent memory – Short-term (conversation), Long-term (persistent), and Episodic (learning from past interactions)

Short-term memory:

Current conversation context
Lasts for a single session
Stored in the state dictionary
Used for maintaining coherence within a task

Long-term memory:

Persistent information across sessions
User preferences, learned facts, configuration
Stored in databases or vector stores
Requires explicit saving and loading

Episodic memory:

Past successful/failed strategies
Patterns of what works in specific situations
Used for learning and improvement
Stored as embeddings of past interactions

Common state management failures:

Unbounded state growth causing memory issues
Not trimming history leading to token limit errors
Missing metadata making debugging impossible
No persistent storage losing context between sessions

Termination Check – Knowing When to Stop

Purpose: Determine if the agent should continue or finish

def is_task_complete(state: dict) -> bool:
    """
    Multiple termination conditions for safety and correctness.
    Never rely on a single condition.
    """
    # Success: Explicit completion
    if state.get("completed"):
        return True
    
    # Safety: Maximum iterations
    if state["iteration"] >= MAX_ITERATIONS:
        logger.warning("Max iterations reached")
        return True
    
    # Safety: Cost limits
    if calculate_cost(state) >= MAX_COST:
        logger.warning("Cost budget exceeded")
        return True
    
    # Safety: Time limits
    if time_elapsed(state) >= MAX_TIME:
        logger.warning("Time limit exceeded")
        return True
    
    # Detection: Loop/stuck state
    if detect_loop(state):
        logger.warning("Loop detected")
        return True
    
    return False

def is_task_complete(state: dict) -> bool:
    """
    Multiple termination conditions for safety and correctness.
    Never rely on a single condition.
    """
    # Success: Explicit completion
    if state.get("completed"):
        return True
    
    # Safety: Maximum iterations
    if state["iteration"] >= MAX_ITERATIONS:
        logger.warning("Max iterations reached")
        return True
    
    # Safety: Cost limits
    if calculate_cost(state) >= MAX_COST:
        logger.warning("Cost budget exceeded")
        return True
    
    # Safety: Time limits
    if time_elapsed(state) >= MAX_TIME:
        logger.warning("Time limit exceeded")
        return True
    
    # Detection: Loop/stuck state
    if detect_loop(state):
        logger.warning("Loop detected")
        return True
    
    return False

Termination is critical and complex. A single condition isn’t enough. You need multiple safety valves.

Termination conditions explained:

Task completion (success): The agent explicitly generated a final answer and marked itself complete. This is the happy path.
Max iterations (safety): After N iterations, stop regardless. Prevents infinite loops from logic errors or unclear tasks. Set this based on task complexity – simple tasks might need 5 iterations, complex ones might need 20.
Cost limits (budget): Each LLM call costs money. Set a budget (in dollars or tokens) and stop when exceeded. Protects against runaway costs.
Time limits (performance): User-facing agents need responsiveness. If execution exceeds time budget, return partial results rather than making users wait indefinitely.
Loop detection (stuck states): If the agent repeats the same action multiple times or cycles through the same states, it’s stuck. Detect this and terminate.

Loop detection implementation:

def detect_loop(state: dict, window=3) -> bool:
    """
    Detect if agent is repeating actions.
    Compares last N actions for similarity.
    """
    if len(state["conversation_history"]) < window:
        return False
    
    recent_actions = [
        h["action"] for h in state["conversation_history"][-window:]
    ]
    
    # Check if all recent actions are identical
    if all(a == recent_actions[0] for a in recent_actions):
        return True
    
    # Check if cycling through same set of actions
    if len(set(str(a) for a in recent_actions)) < window / 2:
        return True
    
    return False

def detect_loop(state: dict, window=3) -> bool:
    """
    Detect if agent is repeating actions.
    Compares last N actions for similarity.
    """
    if len(state["conversation_history"]) < window:
        return False
    
    recent_actions = [
        h["action"] for h in state["conversation_history"][-window:]
    ]
    
    # Check if all recent actions are identical
    if all(a == recent_actions[0] for a in recent_actions):
        return True
    
    # Check if cycling through same set of actions
    if len(set(str(a) for a in recent_actions)) < window / 2:
        return True
    
    return False

Graceful degradation:

When terminating due to safety conditions, provide useful output:

def extract_final_answer(state: dict) -> str:
    """
    Extract final answer, handling different termination reasons.
    """
    if state.get("final_answer"):
        return state["final_answer"]
    
    # Terminated due to safety condition
    if state["iteration"] >= MAX_ITERATIONS:
        return "Could not complete task within iteration limit. " + \
               summarize_progress(state)
    
    if detect_loop(state):
        return "Task appears stuck. Last attempted: " + \
               describe_last_action(state)
    
    # Fallback
    return "Task incomplete. Progress: " + summarize_progress(state)

def extract_final_answer(state: dict) -> str:
    """
    Extract final answer, handling different termination reasons.
    """
    if state.get("final_answer"):
        return state["final_answer"]
    
    # Terminated due to safety condition
    if state["iteration"] >= MAX_ITERATIONS:
        return "Could not complete task within iteration limit. " + \
               summarize_progress(state)
    
    if detect_loop(state):
        return "Task appears stuck. Last attempted: " + \
               describe_last_action(state)
    
    # Fallback
    return "Task incomplete. Progress: " + summarize_progress(state)

Common termination failures:

Single termination condition causing infinite loops
No cost limits burning through API budgets
Missing timeout making user-facing agents unresponsive
Poor loop detection allowing stuck states to continue

Tool Calling: The Action Interface

Tools are how agents interact with the world. Without properly designed tools, agents are just chatbots. With them, agents can query databases, call APIs, perform calculations, and manipulate systems.

The three-part tool structure:

Every production tool needs three components:

1. Function Implementation:

def search_web(query: str, num_results: int = 5) -> str:
    """
    Search the web and return results.
    
    Args:
        query: Search query string
        num_results: Number of results to return (default: 5)
    
    Returns:
        Formatted search results
    """
    try:
        # Implementation
        results = web_search_api.search(query, num_results)
        return format_results(results)
    except Exception as e:
        return f"Search failed: {str(e)}"

def search_web(query: str, num_results: int = 5) -> str:
    """
    Search the web and return results.
    
    Args:
        query: Search query string
        num_results: Number of results to return (default: 5)
    
    Returns:
        Formatted search results
    """
    try:
        # Implementation
        results = web_search_api.search(query, num_results)
        return format_results(results)
    except Exception as e:
        return f"Search failed: {str(e)}"

2. Schema Definition:

search_tool_schema = {
    "name": "search_web",
    "description": "Search the web for current information. Use this when you need recent data, news, or information not in your training data.",
    "parameters": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "The search query"
            },
            "num_results": {
                "type": "integer",
                "description": "Number of results (1-10)",
                "default": 5
            }
        },
        "required": ["query"]
    }
}

search_tool_schema = {
    "name": "search_web",
    "description": "Search the web for current information. Use this when you need recent data, news, or information not in your training data.",
    "parameters": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "The search query"
            },
            "num_results": {
                "type": "integer",
                "description": "Number of results (1-10)",
                "default": 5
            }
        },
        "required": ["query"]
    }
}

3. Wrapper Class:

class Tool:
    """Base tool interface"""
    def __init__(self, name: str, description: str, function: callable, schema: dict):
        self.name = name
        self.description = description
        self.function = function
        self.schema = schema
    
    def execute(self, **kwargs) -> dict:
        """Execute tool with validation and error handling"""
        # Validate inputs against schema
        self._validate_inputs(kwargs)
        
        # Execute with error handling
        try:
            result = self.function(**kwargs)
            return {"success": True, "result": result}
        except Exception as e:
            return {"success": False, "error": str(e)}
    
    def _validate_inputs(self, kwargs: dict):
        """Validate inputs match schema"""
        required = self.schema["parameters"].get("required", [])
        for param in required:
            if param not in kwargs:
                raise ValueError(f"Missing required parameter: {param}")

class Tool:
    """Base tool interface"""
    def __init__(self, name: str, description: str, function: callable, schema: dict):
        self.name = name
        self.description = description
        self.function = function
        self.schema = schema
    
    def execute(self, **kwargs) -> dict:
        """Execute tool with validation and error handling"""
        # Validate inputs against schema
        self._validate_inputs(kwargs)
        
        # Execute with error handling
        try:
            result = self.function(**kwargs)
            return {"success": True, "result": result}
        except Exception as e:
            return {"success": False, "error": str(e)}
    
    def _validate_inputs(self, kwargs: dict):
        """Validate inputs match schema"""
        required = self.schema["parameters"].get("required", [])
        for param in required:
            if param not in kwargs:
                raise ValueError(f"Missing required parameter: {param}")

Why all three components matter:

Function implementation does the actual work. This is where you integrate with external systems.
Schema definition tells the LLM how to use the tool. Clear descriptions and parameter documentation are essential. The LLM decides which tool to use based entirely on this information.
Wrapper class provides standardization. All tools follow the same interface, simplifying agent logic and error handling.

Tool description best practices:

# Bad description
"search_web: Searches the web"

# Good description
"search_web: Search the internet for current information, news, and recent events. Use this when you need information published after your knowledge cutoff or want to verify current facts. Returns the top search results with titles and snippets."

# Bad description
"search_web: Searches the web"

# Good description
"search_web: Search the internet for current information, news, and recent events. Use this when you need information published after your knowledge cutoff or want to verify current facts. Returns the top search results with titles and snippets."

Good descriptions answer:

What does it do?
When should you use it?
What does it return?

Figure 3: Tool calling flow – LLM generates tool call → Schema validation → Function execution → Result formatting → State update

Real-world tool examples:

Calculator tool:

def calculator(expression: str) -> str:
    """
    Evaluate mathematical expressions safely.
    Supports: +, -, *, /, **, (), and common functions.
    """
    try:
        # Safe evaluation without exec/eval
        from ast import literal_eval
        result = eval_expression_safe(expression)
        return f"Result: {result}"
    except Exception as e:
        return f"Error: {str(e)}"

calculator_schema = {
    "name": "calculator",
    "description": "Perform mathematical calculations. Supports arithmetic, exponents, and parentheses. Use for any computation.",
    "parameters": {
        "type": "object",
        "properties": {
            "expression": {
                "type": "string",
                "description": "Mathematical expression (e.g., '2 + 2', '(10 * 5) / 2')"
            }
        },
        "required": ["expression"]
    }
}

def calculator(expression: str) -> str:
    """
    Evaluate mathematical expressions safely.
    Supports: +, -, *, /, **, (), and common functions.
    """
    try:
        # Safe evaluation without exec/eval
        from ast import literal_eval
        result = eval_expression_safe(expression)
        return f"Result: {result}"
    except Exception as e:
        return f"Error: {str(e)}"

calculator_schema = {
    "name": "calculator",
    "description": "Perform mathematical calculations. Supports arithmetic, exponents, and parentheses. Use for any computation.",
    "parameters": {
        "type": "object",
        "properties": {
            "expression": {
                "type": "string",
                "description": "Mathematical expression (e.g., '2 + 2', '(10 * 5) / 2')"
            }
        },
        "required": ["expression"]
    }
}

Database query tool:

def query_database(query: str, table: str) -> str:
    """
    Execute SQL query on specified table.
    Supports: SELECT statements only (read-only).
    """
    # Validate query is SELECT only
    if not query.strip().upper().startswith("SELECT"):
        return "Error: Only SELECT queries allowed"
    
    try:
        results = db.execute(query, table)
        return format_db_results(results)
    except Exception as e:
        return f"Query error: {str(e)}"

database_schema = {
    "name": "query_database",
    "description": "Query the database for stored information. Use this to retrieve user data, preferences, past orders, or historical records. Read-only access.",
    "parameters": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "SQL SELECT query"
            },
            "table": {
                "type": "string",
                "description": "Table name to query",
                "enum": ["users", "orders", "products", "preferences"]
            }
        },
        "required": ["query", "table"]
    }
}

def query_database(query: str, table: str) -> str:
    """
    Execute SQL query on specified table.
    Supports: SELECT statements only (read-only).
    """
    # Validate query is SELECT only
    if not query.strip().upper().startswith("SELECT"):
        return "Error: Only SELECT queries allowed"
    
    try:
        results = db.execute(query, table)
        return format_db_results(results)
    except Exception as e:
        return f"Query error: {str(e)}"

database_schema = {
    "name": "query_database",
    "description": "Query the database for stored information. Use this to retrieve user data, preferences, past orders, or historical records. Read-only access.",
    "parameters": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "SQL SELECT query"
            },
            "table": {
                "type": "string",
                "description": "Table name to query",
                "enum": ["users", "orders", "products", "preferences"]
            }
        },
        "required": ["query", "table"]
    }
}

API call tool:

def api_call(endpoint: str, method: str = "GET", data: dict = None) -> str:
    """
    Make API requests to external services.
    Handles authentication and error responses.
    """
    try:
        response = requests.request(
            method=method,
            url=f"{API_BASE_URL}/{endpoint}",
            json=data,
            headers={"Authorization": f"Bearer {API_KEY}"},
            timeout=30
        )
        response.raise_for_status()
        return response.json()
    except requests.Timeout:
        return "Error: Request timeout"
    except requests.RequestException as e:
        return f"Error: {str(e)}"

api_call_schema = {
    "name": "api_call",
    "description": "Call external APIs for real-time data. Use for weather, stock prices, exchange rates, or other external services.",
    "parameters": {
        "type": "object",
        "properties": {
            "endpoint": {
                "type": "string",
                "description": "API endpoint path (e.g., 'weather', 'stocks/AAPL')"
            },
            "method": {
                "type": "string",
                "enum": ["GET", "POST"],
                "default": "GET"
            },
            "data": {
                "type": "object",
                "description": "Request body for POST requests"
            }
        },
        "required": ["endpoint"]
    }
}

def api_call(endpoint: str, method: str = "GET", data: dict = None) -> str:
    """
    Make API requests to external services.
    Handles authentication and error responses.
    """
    try:
        response = requests.request(
            method=method,
            url=f"{API_BASE_URL}/{endpoint}",
            json=data,
            headers={"Authorization": f"Bearer {API_KEY}"},
            timeout=30
        )
        response.raise_for_status()
        return response.json()
    except requests.Timeout:
        return "Error: Request timeout"
    except requests.RequestException as e:
        return f"Error: {str(e)}"

api_call_schema = {
    "name": "api_call",
    "description": "Call external APIs for real-time data. Use for weather, stock prices, exchange rates, or other external services.",
    "parameters": {
        "type": "object",
        "properties": {
            "endpoint": {
                "type": "string",
                "description": "API endpoint path (e.g., 'weather', 'stocks/AAPL')"
            },
            "method": {
                "type": "string",
                "enum": ["GET", "POST"],
                "default": "GET"
            },
            "data": {
                "type": "object",
                "description": "Request body for POST requests"
            }
        },
        "required": ["endpoint"]
    }
}

Tool error handling patterns:

class ToolExecutionError(Exception):
    """Base exception for tool errors"""
    pass

class ToolTimeoutError(ToolExecutionError):
    """Tool execution exceeded timeout"""
    pass

class ToolValidationError(ToolExecutionError):
    """Tool inputs failed validation"""
    pass

def execute_tool_safe(tool: Tool, arguments: dict) -> dict:
    """
    Production-grade tool execution with comprehensive error handling.
    """
    try:
        # Validate inputs
        tool._validate_inputs(arguments)
        
        # Execute with timeout
        with timeout(30):
            result = tool.execute(**arguments)
        
        # Validate output
        validate_tool_output(result)
        
        return result
    
    except ToolValidationError as e:
        logger.error(f"Tool validation failed: {e}")
        return {
            "success": False,
            "error": f"Invalid input: {str(e)}",
            "recoverable": True
        }
    
    except ToolTimeoutError:
        logger.error(f"Tool timeout: {tool.name}")
        return {
            "success": False,
            "error": "Tool execution timeout",
            "recoverable": True
        }
    
    except Exception as e:
        logger.exception(f"Tool error: {tool.name}")
        return {
            "success": False,
            "error": f"Execution failed: {str(e)}",
            "recoverable": False
        }

class ToolExecutionError(Exception):
    """Base exception for tool errors"""
    pass

class ToolTimeoutError(ToolExecutionError):
    """Tool execution exceeded timeout"""
    pass

class ToolValidationError(ToolExecutionError):
    """Tool inputs failed validation"""
    pass

def execute_tool_safe(tool: Tool, arguments: dict) -> dict:
    """
    Production-grade tool execution with comprehensive error handling.
    """
    try:
        # Validate inputs
        tool._validate_inputs(arguments)
        
        # Execute with timeout
        with timeout(30):
            result = tool.execute(**arguments)
        
        # Validate output
        validate_tool_output(result)
        
        return result
    
    except ToolValidationError as e:
        logger.error(f"Tool validation failed: {e}")
        return {
            "success": False,
            "error": f"Invalid input: {str(e)}",
            "recoverable": True
        }
    
    except ToolTimeoutError:
        logger.error(f"Tool timeout: {tool.name}")
        return {
            "success": False,
            "error": "Tool execution timeout",
            "recoverable": True
        }
    
    except Exception as e:
        logger.exception(f"Tool error: {tool.name}")
        return {
            "success": False,
            "error": f"Execution failed: {str(e)}",
            "recoverable": False
        }

Common tool implementation mistakes:

Vague descriptions causing the LLM to misuse tools
Missing input validation allowing invalid data through
No timeout handling causing hung executions
Poor error messages making debugging impossible
Inconsistent return formats breaking state updates

Memory Architecture: Short-term, Long-term, and Episodic

Memory separates toy demos from production systems. Conversation without memory frustrates users. But not all memory is the same – different types serve different purposes.

Figure 4: Three-tier memory architecture showing Short-term memory (current session), Long-term memory (persistent storage), and Episodic memory (past interaction patterns)

Short-term Memory: Conversation Context

Purpose: Maintain coherence within a single conversation

Implementation:

class ShortTermMemory:
    """
    Manages conversation context for current session.
    Stored in-memory, not persisted.
    """
    def __init__(self, max_messages: int = 20):
        self.messages = []
        self.max_messages = max_messages
    
    def add_message(self, role: str, content: str):
        """Add message to history"""
        self.messages.append({
            "role": role,
            "content": content,
            "timestamp": datetime.now()
        })
        
        # Trim if too long
        if len(self.messages) > self.max_messages:
            self.messages = self.messages[-self.max_messages:]
    
    def get_context(self) -> list:
        """Get recent conversation context"""
        return self.messages
    
    def clear(self):
        """Clear conversation history"""
        self.messages = []

class ShortTermMemory:
    """
    Manages conversation context for current session.
    Stored in-memory, not persisted.
    """
    def __init__(self, max_messages: int = 20):
        self.messages = []
        self.max_messages = max_messages
    
    def add_message(self, role: str, content: str):
        """Add message to history"""
        self.messages.append({
            "role": role,
            "content": content,
            "timestamp": datetime.now()
        })
        
        # Trim if too long
        if len(self.messages) > self.max_messages:
            self.messages = self.messages[-self.max_messages:]
    
    def get_context(self) -> list:
        """Get recent conversation context"""
        return self.messages
    
    def clear(self):
        """Clear conversation history"""
        self.messages = []

Use cases:

Current conversation flow
Immediate context for next response
Temporary task state
Within-session coherence

Limitations:

Lost when session ends
Grows unbounded without trimming
Token limits force summarization

Long-term Memory: Persistent Storage

Purpose: Remember information across sessions

Implementation:

class LongTermMemory:
    """
    Persistent storage for facts and preferences.
    Uses database or key-value store.
    """
    def __init__(self, user_id: str, db_connection):
        self.user_id = user_id
        self.db = db_connection
    
    def store_fact(self, key: str, value: str, category: str = "general"):
        """Store a fact about the user"""
        self.db.upsert(
            table="user_facts",
            data={
                "user_id": self.user_id,
                "key": key,
                "value": value,
                "category": category,
                "updated_at": datetime.now()
            }
        )
    
    def retrieve_fact(self, key: str) -> str:
        """Retrieve a stored fact"""
        result = self.db.query(
            f"SELECT value FROM user_facts WHERE user_id = ? AND key = ?",
            (self.user_id, key)
        )
        return result["value"] if result else None
    
    def get_all_facts(self, category: str = None) -> dict:
        """Get all facts, optionally filtered by category"""
        query = "SELECT key, value FROM user_facts WHERE user_id = ?"
        params = [self.user_id]
        
        if category:
            query += " AND category = ?"
            params.append(category)
        
        results = self.db.query(query, params)
        return {r["key"]: r["value"] for r in results}

class LongTermMemory:
    """
    Persistent storage for facts and preferences.
    Uses database or key-value store.
    """
    def __init__(self, user_id: str, db_connection):
        self.user_id = user_id
        self.db = db_connection
    
    def store_fact(self, key: str, value: str, category: str = "general"):
        """Store a fact about the user"""
        self.db.upsert(
            table="user_facts",
            data={
                "user_id": self.user_id,
                "key": key,
                "value": value,
                "category": category,
                "updated_at": datetime.now()
            }
        )
    
    def retrieve_fact(self, key: str) -> str:
        """Retrieve a stored fact"""
        result = self.db.query(
            f"SELECT value FROM user_facts WHERE user_id = ? AND key = ?",
            (self.user_id, key)
        )
        return result["value"] if result else None
    
    def get_all_facts(self, category: str = None) -> dict:
        """Get all facts, optionally filtered by category"""
        query = "SELECT key, value FROM user_facts WHERE user_id = ?"
        params = [self.user_id]
        
        if category:
            query += " AND category = ?"
            params.append(category)
        
        results = self.db.query(query, params)
        return {r["key"]: r["value"] for r in results}

Use cases:

User preferences (communication style, format preferences)
Personal information (name, location, job title)
Learned facts (favorite tools, common tasks)
Configuration (default parameters, enabled features)

Storage considerations:

Database: Structured facts work well in relational databases

# Schema
CREATE TABLE user_facts (
    user_id TEXT,
    key TEXT,
    value TEXT,
    category TEXT,
    updated_at TIMESTAMP,
    PRIMARY KEY (user_id, key)
);

# Schema
CREATE TABLE user_facts (
    user_id TEXT,
    key TEXT,
    value TEXT,
    category TEXT,
    updated_at TIMESTAMP,
    PRIMARY KEY (user_id, key)
);

Vector database: Semantic retrieval for unstructured information

class VectorMemory:
    """Store and retrieve memories by semantic similarity"""
    def __init__(self, user_id: str, vector_db):
        self.user_id = user_id
        self.vector_db = vector_db
    
    def store(self, content: str, metadata: dict = None):
        """Store content with embeddings"""
        embedding = generate_embedding(content)
        self.vector_db.upsert(
            user_id=self.user_id,
            embedding=embedding,
            content=content,
            metadata=metadata or {}
        )
    
    def search(self, query: str, top_k: int = 5) -> list:
        """Find similar memories"""
        query_embedding = generate_embedding(query)
        return self.vector_db.search(
            user_id=self.user_id,
            embedding=query_embedding,
            top_k=top_k
        )

class VectorMemory:
    """Store and retrieve memories by semantic similarity"""
    def __init__(self, user_id: str, vector_db):
        self.user_id = user_id
        self.vector_db = vector_db
    
    def store(self, content: str, metadata: dict = None):
        """Store content with embeddings"""
        embedding = generate_embedding(content)
        self.vector_db.upsert(
            user_id=self.user_id,
            embedding=embedding,
            content=content,
            metadata=metadata or {}
        )
    
    def search(self, query: str, top_k: int = 5) -> list:
        """Find similar memories"""
        query_embedding = generate_embedding(query)
        return self.vector_db.search(
            user_id=self.user_id,
            embedding=query_embedding,
            top_k=top_k
        )

Episodic Memory: Learning from Past Interactions

Purpose: Remember and learn from past episodes (complete task sequences)

Implementation:

class EpisodicMemory:
    """
    Stores complete interaction episodes for learning.
    Captures successful strategies and failure patterns.
    """
    def __init__(self, user_id: str, vector_db):
        self.user_id = user_id
        self.vector_db = vector_db
    
    def store_episode(self, task: str, actions: list, outcome: dict):
        """Store a complete task episode"""
        episode = {
            "task": task,
            "actions": actions,
            "outcome": outcome,
            "success": outcome.get("success", False),
            "timestamp": datetime.now()
        }
        
        # Create embeddings for semantic search
        episode_text = f"{task} | {format_actions(actions)}"
        embedding = generate_embedding(episode_text)
        
        self.vector_db.upsert(
            collection="episodes",
            user_id=self.user_id,
            embedding=embedding,
            data=episode
        )
    
    def retrieve_similar_episodes(self, task: str, top_k: int = 3) -> list:
        """Find similar past episodes"""
        query_embedding = generate_embedding(task)
        return self.vector_db.search(
            collection="episodes",
            user_id=self.user_id,
            embedding=query_embedding,
            top_k=top_k
        )
    
    def get_successful_strategies(self, task_type: str) -> list:
        """Get successful strategies for similar tasks"""
        episodes = self.retrieve_similar_episodes(task_type, top_k=10)
        successful = [e for e in episodes if e["success"]]
        return [e["actions"] for e in successful]

class EpisodicMemory:
    """
    Stores complete interaction episodes for learning.
    Captures successful strategies and failure patterns.
    """
    def __init__(self, user_id: str, vector_db):
        self.user_id = user_id
        self.vector_db = vector_db
    
    def store_episode(self, task: str, actions: list, outcome: dict):
        """Store a complete task episode"""
        episode = {
            "task": task,
            "actions": actions,
            "outcome": outcome,
            "success": outcome.get("success", False),
            "timestamp": datetime.now()
        }
        
        # Create embeddings for semantic search
        episode_text = f"{task} | {format_actions(actions)}"
        embedding = generate_embedding(episode_text)
        
        self.vector_db.upsert(
            collection="episodes",
            user_id=self.user_id,
            embedding=embedding,
            data=episode
        )
    
    def retrieve_similar_episodes(self, task: str, top_k: int = 3) -> list:
        """Find similar past episodes"""
        query_embedding = generate_embedding(task)
        return self.vector_db.search(
            collection="episodes",
            user_id=self.user_id,
            embedding=query_embedding,
            top_k=top_k
        )
    
    def get_successful_strategies(self, task_type: str) -> list:
        """Get successful strategies for similar tasks"""
        episodes = self.retrieve_similar_episodes(task_type, top_k=10)
        successful = [e for e in episodes if e["success"]]
        return [e["actions"] for e in successful]

Use cases:

Learning which approaches work for specific task types
Avoiding previously failed strategies
Transferring successful patterns to similar tasks
Building user-specific behavior models

Episode structure:

episode = {
    "task": "Find weather in San Francisco",
    "actions": [
        {
            "type": "tool_call",
            "tool": "weather_api",
            "arguments": {"city": "San Francisco"},
            "result": {"success": True, "temp": 68}
        }
    ],
    "outcome": {
        "success": True,
        "user_satisfied": True,
        "execution_time": 1.2
    },
    "metadata": {
        "context": "user planning trip",
        "tools_available": ["weather_api", "search_web"],
        "strategy": "direct_api_call"
    }
}

episode = {
    "task": "Find weather in San Francisco",
    "actions": [
        {
            "type": "tool_call",
            "tool": "weather_api",
            "arguments": {"city": "San Francisco"},
            "result": {"success": True, "temp": 68}
        }
    ],
    "outcome": {
        "success": True,
        "user_satisfied": True,
        "execution_time": 1.2
    },
    "metadata": {
        "context": "user planning trip",
        "tools_available": ["weather_api", "search_web"],
        "strategy": "direct_api_call"
    }
}

Hybrid Memory System

Production systems combine all three types:

class HybridMemory:
    """
    Complete memory system combining short-term, long-term, and episodic.
    """
    def __init__(self, user_id: str):
        self.short_term = ShortTermMemory()
        self.long_term = LongTermMemory(user_id, get_db())
        self.episodic = EpisodicMemory(user_id, get_vector_db())
    
    def prepare_context(self, task: str) -> dict:
        """Prepare complete context for agent"""
        return {
            # Current conversation
            "recent_messages": self.short_term.get_context(),
            
            # User facts and preferences
            "user_facts": self.long_term.get_all_facts(),
            
            # Similar past successes
            "similar_episodes": self.episodic.retrieve_similar_episodes(task),
            
            # Learned strategies
            "successful_strategies": self.episodic.get_successful_strategies(task)
        }
    
    def update(self, role: str, content: str, metadata: dict = None):
        """Update all memory types"""
        # Update short-term
        self.short_term.add_message(role, content)
        
        # Extract and store facts
        if facts := extract_facts(content):
            for key, value in facts.items():
                self.long_term.store_fact(key, value)
    
    def finalize_episode(self, task: str, outcome: dict):
        """Store complete episode after task completion"""
        actions = self.short_term.get_context()
        self.episodic.store_episode(task, actions, outcome)

class HybridMemory:
    """
    Complete memory system combining short-term, long-term, and episodic.
    """
    def __init__(self, user_id: str):
        self.short_term = ShortTermMemory()
        self.long_term = LongTermMemory(user_id, get_db())
        self.episodic = EpisodicMemory(user_id, get_vector_db())
    
    def prepare_context(self, task: str) -> dict:
        """Prepare complete context for agent"""
        return {
            # Current conversation
            "recent_messages": self.short_term.get_context(),
            
            # User facts and preferences
            "user_facts": self.long_term.get_all_facts(),
            
            # Similar past successes
            "similar_episodes": self.episodic.retrieve_similar_episodes(task),
            
            # Learned strategies
            "successful_strategies": self.episodic.get_successful_strategies(task)
        }
    
    def update(self, role: str, content: str, metadata: dict = None):
        """Update all memory types"""
        # Update short-term
        self.short_term.add_message(role, content)
        
        # Extract and store facts
        if facts := extract_facts(content):
            for key, value in facts.items():
                self.long_term.store_fact(key, value)
    
    def finalize_episode(self, task: str, outcome: dict):
        """Store complete episode after task completion"""
        actions = self.short_term.get_context()
        self.episodic.store_episode(task, actions, outcome)

Memory selection guide:

Need	Memory Type
Current conversation	Short-term
User preferences	Long-term
Past successful strategies	Episodic
Temporary task state	Short-term
Learned behaviors	Long-term + Episodic
Session-specific context	Short-term
Cross-session facts	Long-term
Strategy learning	Episodic

Observations vs Actions: The Critical Distinction

This seems obvious until you’re debugging a broken agent. Did it fail because it didn’t observe the right information, or because it took the wrong action based on correct observations?

The distinction:

Observations are information inputs:

Current task description
Conversation history
Available tools
Previous results
Memory context
System state

Actions are operations:

Tool calls
Final answer generation
Follow-up questions
State updates
Memory writes

Why this matters for debugging:

# Example debugging scenario
task = "Find weather in San Francisco and convert temperature to Celsius"

# Agent fails - but where?

# Possibility 1: Observation failure
# - Task not in context
# - Tool description missing
# - Previous result not included

# Possibility 2: Action failure
# - Selected wrong tool
# - Provided invalid arguments
# - Didn't chain actions properly

# Example debugging scenario
task = "Find weather in San Francisco and convert temperature to Celsius"

# Agent fails - but where?

# Possibility 1: Observation failure
# - Task not in context
# - Tool description missing
# - Previous result not included

# Possibility 2: Action failure
# - Selected wrong tool
# - Provided invalid arguments
# - Didn't chain actions properly

Systematic debugging approach:

1. Check observations:

def debug_observations(state: dict):
    """Verify observation quality"""
    observation = observe(state)
    
    checks = {
        "task_present": "task" in observation,
        "tools_described": len(observation.get("available_tools", [])) > 0,
        "history_included": len(observation.get("history", [])) > 0,
        "previous_result": "previous_result" in observation
    }
    
    print("Observation Quality:")
    for check, passed in checks.items():
        status = "✓" if passed else "✗"
        print(f"  {status} {check}")
    
    return observation

def debug_observations(state: dict):
    """Verify observation quality"""
    observation = observe(state)
    
    checks = {
        "task_present": "task" in observation,
        "tools_described": len(observation.get("available_tools", [])) > 0,
        "history_included": len(observation.get("history", [])) > 0,
        "previous_result": "previous_result" in observation
    }
    
    print("Observation Quality:")
    for check, passed in checks.items():
        status = "✓" if passed else "✗"
        print(f"  {status} {check}")
    
    return observation

2. Check reasoning:

def debug_reasoning(observation: dict, reasoning: str):
    """Verify reasoning quality"""
    checks = {
        "task_referenced": observation["task"] in reasoning,
        "tools_considered": any(tool["name"] in reasoning 
                                for tool in observation["available_tools"]),
        "explicit_decision": any(marker in reasoning 
                                 for marker in ["I will", "I should", "Next step"]),
        "reasoning_present": len(reasoning) > 100
    }
    
    print("Reasoning Quality:")
    for check, passed in checks.items():
        status = "✓" if passed else "✗"
        print(f"  {status} {check}")

def debug_reasoning(observation: dict, reasoning: str):
    """Verify reasoning quality"""
    checks = {
        "task_referenced": observation["task"] in reasoning,
        "tools_considered": any(tool["name"] in reasoning 
                                for tool in observation["available_tools"]),
        "explicit_decision": any(marker in reasoning 
                                 for marker in ["I will", "I should", "Next step"]),
        "reasoning_present": len(reasoning) > 100
    }
    
    print("Reasoning Quality:")
    for check, passed in checks.items():
        status = "✓" if passed else "✗"
        print(f"  {status} {check}")

3. Check actions:

def debug_action(action: dict):
    """Verify action validity"""
    checks = {
        "valid_type": action["type"] in ["tool_call", "final_answer"],
        "tool_exists": action.get("tool") in get_available_tools(),
        "has_arguments": "arguments" in action if action["type"] == "tool_call" else True,
        "arguments_valid": validate_arguments(action) if action["type"] == "tool_call" else True
    }
    
    print("Action Quality:")
    for check, passed in checks.items():
        status = "✓" if passed else "✗"
        print(f"  {status} {check}")

def debug_action(action: dict):
    """Verify action validity"""
    checks = {
        "valid_type": action["type"] in ["tool_call", "final_answer"],
        "tool_exists": action.get("tool") in get_available_tools(),
        "has_arguments": "arguments" in action if action["type"] == "tool_call" else True,
        "arguments_valid": validate_arguments(action) if action["type"] == "tool_call" else True
    }
    
    print("Action Quality:")
    for check, passed in checks.items():
        status = "✓" if passed else "✗"
        print(f"  {status} {check}")

Common failure patterns:

Observation failures:

Missing tool descriptions → Agent doesn’t know what’s available
Truncated history → Lost context from earlier conversation
No previous result → Repeats failed actions
Task not included → Goal drift

Reasoning failures:

Generic thinking → No specific strategy
Ignores tools → Tries to answer without external data
No step-by-step breakdown → Jumps to conclusions
Contradictory logic → Internal inconsistency

Action failures:

Hallucinated tools → Tries to call non-existent functions
Invalid arguments → Wrong types or missing required parameters
Wrong tool selection → Has right tools but picks wrong one
No action → Gets stuck in analysis paralysis

The debugging workflow:

Figure 5: Agent Failure Debugging Flow

Building a Production Agent: Complete Implementation

Let’s tie everything together with a complete, production-ready agent implementation:

import logging
from datetime import datetime
from typing import Dict, List, Any
import json

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class ProductionAgent:
    """
    Complete agent implementation with:
    - Multiple tools
    - Conversation memory
    - Error handling
    - Execution tracking
    - Debug capabilities
    """
    
    def __init__(
        self,
        llm,
        tools: List[Tool],
        max_iterations: int = 10,
        max_cost: float = 1.0
    ):
        self.llm = llm
        self.tools = {tool.name: tool for tool in tools}
        self.max_iterations = max_iterations
        self.max_cost = max_cost
        self.memory = ShortTermMemory()
        
        # Execution tracking
        self.stats = {
            "total_iterations": 0,
            "successful_completions": 0,
            "tool_calls": 0,
            "errors": 0,
            "total_cost": 0.0
        }
    
    def run(self, task: str, debug: bool = False) -> Dict[str, Any]:
        """
        Execute agent loop for given task.
        
        Args:
            task: The task to accomplish
            debug: Enable debug output
            
        Returns:
            Result dictionary with answer and metadata
        """
        # Initialize state
        state = {
            "task": task,
            "iteration": 0,
            "completed": False,
            "start_time": datetime.now()
        }
        
        logger.info(f"Starting task: {task}")
        
        try:
            # Main agent loop
            while not self._should_terminate(state):
                if debug:
                    print(f"\n=== Iteration {state['iteration']} ===")
                
                # OBSERVE
                observation = self._observe(state)
                if debug:
                    print(f"Observation: {json.dumps(observation, indent=2)}")
                
                # THINK
                reasoning = self._think(observation)
                if debug:
                    print(f"Reasoning: {reasoning[:200]}...")
                
                # DECIDE
                action = self._decide(reasoning)
                if debug:
                    print(f"Action: {action}")
                
                # ACT
                result = self._act(action)
                if debug:
                    print(f"Result: {result}")
                
                # UPDATE STATE
                state = self._update_state(state, action, result)
                
                # Check completion
                if result.get("final"):
                    state["completed"] = True
                    state["final_answer"] = result["result"]
                
                state["iteration"] += 1
                self.stats["total_iterations"] += 1
            
            # Extract final answer
            answer = self._extract_answer(state)
            
            if state["completed"]:
                self.stats["successful_completions"] += 1
            
            return {
                "success": True,
                "answer": answer,
                "iterations": state["iteration"],
                "execution_time": (datetime.now() - state["start_time"]).total_seconds(),
                "termination_reason": self._get_termination_reason(state)
            }
        
        except Exception as e:
            logger.exception("Agent execution failed")
            self.stats["errors"] += 1
            return {
                "success": False,
                "error": str(e),
                "iterations": state["iteration"]
            }
    
    def _observe(self, state: dict) -> dict:
        """Gather context for decision making"""
        return {
            "task": state["task"],
            "conversation": self.memory.get_context(),
            "available_tools": [
                {
                    "name": tool.name,
                    "description": tool.description,
                    "parameters": tool.schema["parameters"]
                }
                for tool in self.tools.values()
            ],
            "iteration": state["iteration"],
            "max_iterations": self.max_iterations,
            "previous_result": state.get("last_result")
        }
    
    def _think(self, observation: dict) -> str:
        """LLM reasoning step"""
        prompt = self._build_prompt(observation)
        
        # Track cost
        response = self.llm.generate(prompt)
        self.stats["total_cost"] += estimate_cost(response)
        
        return response
    
    def _build_prompt(self, observation: dict) -> str:
        """Construct prompt for LLM"""
        tools_desc = "\n".join([
            f"- {t['name']}: {t['description']}"
            for t in observation["available_tools"]
        ])
        
        history = "\n".join([
            f"{msg['role']}: {msg['content']}"
            for msg in observation["conversation"][-5:]
        ])
        
        return f"""You are a helpful agent that can use tools to accomplish tasks.

Task: {observation['task']}

Available tools:
{tools_desc}

Conversation history:
{history}

Previous result: {observation.get('previous_result', 'None')}

You are on iteration {observation['iteration']} of {observation['max_iterations']}.

Think step by step:
1. What is the current situation?
2. What information do I have?
3. What information do I need?
4. Should I use a tool or provide a final answer?

If using a tool, respond with:
Tool: <tool_name>
Arguments: <arguments_as_json>

If providing final answer, respond with:
Final Answer: <your_answer>

Your reasoning:"""
    
    def _decide(self, reasoning: str) -> dict:
        """Parse reasoning into structured action"""
        try:
            if "Tool:" in reasoning:
                # Extract tool call
                tool_line = [l for l in reasoning.split("\n") if l.startswith("Tool:")][0]
                tool_name = tool_line.split("Tool:")[1].strip()
                
                args_line = [l for l in reasoning.split("\n") if l.startswith("Arguments:")][0]
                args_json = args_line.split("Arguments:")[1].strip()
                arguments = json.loads(args_json)
                
                return {
                    "type": "tool_call",
                    "tool": tool_name,
                    "arguments": arguments
                }
            
            elif "Final Answer:" in reasoning:
                # Extract final answer
                answer = reasoning.split("Final Answer:")[1].strip()
                return {
                    "type": "final_answer",
                    "content": answer
                }
            
            else:
                return {
                    "type": "continue",
                    "message": "No clear action determined"
                }
        
        except Exception as e:
            logger.error(f"Failed to parse action: {e}")
            return {
                "type": "error",
                "message": f"Could not parse action: {str(e)}"
            }
    
    def _act(self, action: dict) -> dict:
        """Execute action"""
        try:
            if action["type"] == "tool_call":
                # Validate tool exists
                if action["tool"] not in self.tools:
                    return {
                        "success": False,
                        "error": f"Tool '{action['tool']}' not found"
                    }
                
                # Execute tool
                tool = self.tools[action["tool"]]
                result = tool.execute(**action["arguments"])
                
                self.stats["tool_calls"] += 1
                
                return result
            
            elif action["type"] == "final_answer":
                return {
                    "success": True,
                    "result": action["content"],
                    "final": True
                }
            
            elif action["type"] == "continue":
                return {
                    "success": False,
                    "error": "No action taken - agent is uncertain"
                }
            
            elif action["type"] == "error":
                return {
                    "success": False,
                    "error": action["message"]
                }
        
        except Exception as e:
            logger.exception("Action execution failed")
            return {
                "success": False,
                "error": str(e)
            }
    
    def _update_state(self, state: dict, action: dict, result: dict) -> dict:
        """Update state with action outcome"""
        # Add to memory
        self.memory.add_message(
            role="assistant",
            content=f"Action: {action['type']} | Result: {result.get('result', result.get('error'))}"
        )
        
        # Store last result
        state["last_result"] = result
        
        return state
    
    def _should_terminate(self, state: dict) -> bool:
        """Check termination conditions"""
        # Success
        if state.get("completed"):
            return True
        
        # Max iterations
        if state["iteration"] >= self.max_iterations:
            logger.warning("Max iterations reached")
            return True
        
        # Cost limit
        if self.stats["total_cost"] >= self.max_cost:
            logger.warning("Cost limit exceeded")
            return True
        
        # Time limit (5 minutes)
        elapsed = (datetime.now() - state["start_time"]).total_seconds()
        if elapsed > 300:
            logger.warning("Time limit exceeded")
            return True
        
        return False
    
    def _extract_answer(self, state: dict) -> str:
        """Extract final answer from state"""
        if "final_answer" in state:
            return state["final_answer"]
        
        # Fallback for incomplete tasks
        last_result = state.get("last_result", {})
        if last_result.get("success"):
            return f"Task incomplete. Last result: {last_result['result']}"
        else:
            return f"Task incomplete. Last error: {last_result.get('error', 'Unknown')}"
    
    def _get_termination_reason(self, state: dict) -> str:
        """Determine why execution terminated"""
        if state.get("completed"):
            return "task_completed"
        elif state["iteration"] >= self.max_iterations:
            return "max_iterations"
        elif self.stats["total_cost"] >= self.max_cost:
            return "cost_limit"
        else:
            return "unknown"
    
    def get_stats(self) -> dict:
        """Get execution statistics"""
        return self.stats.copy()
    
    def reset_stats(self):
        """Reset execution statistics"""
        for key in self.stats:
            self.stats[key] = 0 if isinstance(self.stats[key], (int, float)) else 0.0

import logging
from datetime import datetime
from typing import Dict, List, Any
import json

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class ProductionAgent:
    """
    Complete agent implementation with:
    - Multiple tools
    - Conversation memory
    - Error handling
    - Execution tracking
    - Debug capabilities
    """
    
    def __init__(
        self,
        llm,
        tools: List[Tool],
        max_iterations: int = 10,
        max_cost: float = 1.0
    ):
        self.llm = llm
        self.tools = {tool.name: tool for tool in tools}
        self.max_iterations = max_iterations
        self.max_cost = max_cost
        self.memory = ShortTermMemory()
        
        # Execution tracking
        self.stats = {
            "total_iterations": 0,
            "successful_completions": 0,
            "tool_calls": 0,
            "errors": 0,
            "total_cost": 0.0
        }
    
    def run(self, task: str, debug: bool = False) -> Dict[str, Any]:
        """
        Execute agent loop for given task.
        
        Args:
            task: The task to accomplish
            debug: Enable debug output
            
        Returns:
            Result dictionary with answer and metadata
        """
        # Initialize state
        state = {
            "task": task,
            "iteration": 0,
            "completed": False,
            "start_time": datetime.now()
        }
        
        logger.info(f"Starting task: {task}")
        
        try:
            # Main agent loop
            while not self._should_terminate(state):
                if debug:
                    print(f"\n=== Iteration {state['iteration']} ===")
                
                # OBSERVE
                observation = self._observe(state)
                if debug:
                    print(f"Observation: {json.dumps(observation, indent=2)}")
                
                # THINK
                reasoning = self._think(observation)
                if debug:
                    print(f"Reasoning: {reasoning[:200]}...")
                
                # DECIDE
                action = self._decide(reasoning)
                if debug:
                    print(f"Action: {action}")
                
                # ACT
                result = self._act(action)
                if debug:
                    print(f"Result: {result}")
                
                # UPDATE STATE
                state = self._update_state(state, action, result)
                
                # Check completion
                if result.get("final"):
                    state["completed"] = True
                    state["final_answer"] = result["result"]
                
                state["iteration"] += 1
                self.stats["total_iterations"] += 1
            
            # Extract final answer
            answer = self._extract_answer(state)
            
            if state["completed"]:
                self.stats["successful_completions"] += 1
            
            return {
                "success": True,
                "answer": answer,
                "iterations": state["iteration"],
                "execution_time": (datetime.now() - state["start_time"]).total_seconds(),
                "termination_reason": self._get_termination_reason(state)
            }
        
        except Exception as e:
            logger.exception("Agent execution failed")
            self.stats["errors"] += 1
            return {
                "success": False,
                "error": str(e),
                "iterations": state["iteration"]
            }
    
    def _observe(self, state: dict) -> dict:
        """Gather context for decision making"""
        return {
            "task": state["task"],
            "conversation": self.memory.get_context(),
            "available_tools": [
                {
                    "name": tool.name,
                    "description": tool.description,
                    "parameters": tool.schema["parameters"]
                }
                for tool in self.tools.values()
            ],
            "iteration": state["iteration"],
            "max_iterations": self.max_iterations,
            "previous_result": state.get("last_result")
        }
    
    def _think(self, observation: dict) -> str:
        """LLM reasoning step"""
        prompt = self._build_prompt(observation)
        
        # Track cost
        response = self.llm.generate(prompt)
        self.stats["total_cost"] += estimate_cost(response)
        
        return response
    
    def _build_prompt(self, observation: dict) -> str:
        """Construct prompt for LLM"""
        tools_desc = "\n".join([
            f"- {t['name']}: {t['description']}"
            for t in observation["available_tools"]
        ])
        
        history = "\n".join([
            f"{msg['role']}: {msg['content']}"
            for msg in observation["conversation"][-5:]
        ])
        
        return f"""You are a helpful agent that can use tools to accomplish tasks.

Task: {observation['task']}

Available tools:
{tools_desc}

Conversation history:
{history}

Previous result: {observation.get('previous_result', 'None')}

You are on iteration {observation['iteration']} of {observation['max_iterations']}.

Think step by step:
1. What is the current situation?
2. What information do I have?
3. What information do I need?
4. Should I use a tool or provide a final answer?

If using a tool, respond with:
Tool: <tool_name>
Arguments: <arguments_as_json>

If providing final answer, respond with:
Final Answer: <your_answer>

Your reasoning:"""
    
    def _decide(self, reasoning: str) -> dict:
        """Parse reasoning into structured action"""
        try:
            if "Tool:" in reasoning:
                # Extract tool call
                tool_line = [l for l in reasoning.split("\n") if l.startswith("Tool:")][0]
                tool_name = tool_line.split("Tool:")[1].strip()
                
                args_line = [l for l in reasoning.split("\n") if l.startswith("Arguments:")][0]
                args_json = args_line.split("Arguments:")[1].strip()
                arguments = json.loads(args_json)
                
                return {
                    "type": "tool_call",
                    "tool": tool_name,
                    "arguments": arguments
                }
            
            elif "Final Answer:" in reasoning:
                # Extract final answer
                answer = reasoning.split("Final Answer:")[1].strip()
                return {
                    "type": "final_answer",
                    "content": answer
                }
            
            else:
                return {
                    "type": "continue",
                    "message": "No clear action determined"
                }
        
        except Exception as e:
            logger.error(f"Failed to parse action: {e}")
            return {
                "type": "error",
                "message": f"Could not parse action: {str(e)}"
            }
    
    def _act(self, action: dict) -> dict:
        """Execute action"""
        try:
            if action["type"] == "tool_call":
                # Validate tool exists
                if action["tool"] not in self.tools:
                    return {
                        "success": False,
                        "error": f"Tool '{action['tool']}' not found"
                    }
                
                # Execute tool
                tool = self.tools[action["tool"]]
                result = tool.execute(**action["arguments"])
                
                self.stats["tool_calls"] += 1
                
                return result
            
            elif action["type"] == "final_answer":
                return {
                    "success": True,
                    "result": action["content"],
                    "final": True
                }
            
            elif action["type"] == "continue":
                return {
                    "success": False,
                    "error": "No action taken - agent is uncertain"
                }
            
            elif action["type"] == "error":
                return {
                    "success": False,
                    "error": action["message"]
                }
        
        except Exception as e:
            logger.exception("Action execution failed")
            return {
                "success": False,
                "error": str(e)
            }
    
    def _update_state(self, state: dict, action: dict, result: dict) -> dict:
        """Update state with action outcome"""
        # Add to memory
        self.memory.add_message(
            role="assistant",
            content=f"Action: {action['type']} | Result: {result.get('result', result.get('error'))}"
        )
        
        # Store last result
        state["last_result"] = result
        
        return state
    
    def _should_terminate(self, state: dict) -> bool:
        """Check termination conditions"""
        # Success
        if state.get("completed"):
            return True
        
        # Max iterations
        if state["iteration"] >= self.max_iterations:
            logger.warning("Max iterations reached")
            return True
        
        # Cost limit
        if self.stats["total_cost"] >= self.max_cost:
            logger.warning("Cost limit exceeded")
            return True
        
        # Time limit (5 minutes)
        elapsed = (datetime.now() - state["start_time"]).total_seconds()
        if elapsed > 300:
            logger.warning("Time limit exceeded")
            return True
        
        return False
    
    def _extract_answer(self, state: dict) -> str:
        """Extract final answer from state"""
        if "final_answer" in state:
            return state["final_answer"]
        
        # Fallback for incomplete tasks
        last_result = state.get("last_result", {})
        if last_result.get("success"):
            return f"Task incomplete. Last result: {last_result['result']}"
        else:
            return f"Task incomplete. Last error: {last_result.get('error', 'Unknown')}"
    
    def _get_termination_reason(self, state: dict) -> str:
        """Determine why execution terminated"""
        if state.get("completed"):
            return "task_completed"
        elif state["iteration"] >= self.max_iterations:
            return "max_iterations"
        elif self.stats["total_cost"] >= self.max_cost:
            return "cost_limit"
        else:
            return "unknown"
    
    def get_stats(self) -> dict:
        """Get execution statistics"""
        return self.stats.copy()
    
    def reset_stats(self):
        """Reset execution statistics"""
        for key in self.stats:
            self.stats[key] = 0 if isinstance(self.stats[key], (int, float)) else 0.0

Usage example:

# Define tools
calculator = Tool(
    name="calculator",
    description="Perform mathematical calculations",
    function=calculator_function,
    schema=calculator_schema
)

weather = Tool(
    name="weather",
    description="Get current weather for a location",
    function=weather_function,
    schema=weather_schema
)

search = Tool(
    name="search_web",
    description="Search the web for information",
    function=search_function,
    schema=search_schema
)

# Create agent
agent = ProductionAgent(
    llm=get_llm(),
    tools=[calculator, weather, search],
    max_iterations=10,
    max_cost=0.50
)

# Run task
result = agent.run(
    task="What's the weather in San Francisco? Convert the temperature to Celsius.",
    debug=True
)

print(f"Answer: {result['answer']}")
print(f"Iterations: {result['iterations']}")
print(f"Time: {result['execution_time']:.2f}s")
print(f"Reason: {result['termination_reason']}")

# Check stats
print("\nExecution Statistics:")
print(json.dumps(agent.get_stats(), indent=2))

# Define tools
calculator = Tool(
    name="calculator",
    description="Perform mathematical calculations",
    function=calculator_function,
    schema=calculator_schema
)

weather = Tool(
    name="weather",
    description="Get current weather for a location",
    function=weather_function,
    schema=weather_schema
)

search = Tool(
    name="search_web",
    description="Search the web for information",
    function=search_function,
    schema=search_schema
)

# Create agent
agent = ProductionAgent(
    llm=get_llm(),
    tools=[calculator, weather, search],
    max_iterations=10,
    max_cost=0.50
)

# Run task
result = agent.run(
    task="What's the weather in San Francisco? Convert the temperature to Celsius.",
    debug=True
)

print(f"Answer: {result['answer']}")
print(f"Iterations: {result['iterations']}")
print(f"Time: {result['execution_time']:.2f}s")
print(f"Reason: {result['termination_reason']}")

# Check stats
print("\nExecution Statistics:")
print(json.dumps(agent.get_stats(), indent=2))

This implementation includes:

✅ Complete agent loop
✅ Multiple tools with validation
✅ Conversation memory
✅ Error handling at every step
✅ Execution tracking and statistics
✅ Debug mode for development
✅ Multiple termination conditions
✅ Cost tracking
✅ Comprehensive logging

Testing and Debugging Strategies

Production agents require systematic testing. Here’s how to validate each component:

Unit Tests

Test individual functions:

def test_observation():
    """Test observation gathering"""
    state = {
        "task": "Test task",
        "conversation_history": [
            {"role": "user", "content": "Hello"}
        ],
        "iteration": 1
    }
    
    observation = observe(state)
    
    assert "task" in observation
    assert observation["task"] == "Test task"
    assert len(observation["history"]) == 1
    assert "available_tools" in observation

def test_tool_execution():
    """Test tool execution"""
    tool = calculator_tool
    result = tool.execute(expression="2 + 2")
    
    assert result["success"] == True
    assert "4" in result["result"]

def test_memory():
    """Test memory operations"""
    memory = ShortTermMemory()
    memory.add_message("user", "My name is Alice")
    
    context = memory.get_context()
    assert len(context) == 1
    assert "Alice" in str(context)

def test_observation():
    """Test observation gathering"""
    state = {
        "task": "Test task",
        "conversation_history": [
            {"role": "user", "content": "Hello"}
        ],
        "iteration": 1
    }
    
    observation = observe(state)
    
    assert "task" in observation
    assert observation["task"] == "Test task"
    assert len(observation["history"]) == 1
    assert "available_tools" in observation

def test_tool_execution():
    """Test tool execution"""
    tool = calculator_tool
    result = tool.execute(expression="2 + 2")
    
    assert result["success"] == True
    assert "4" in result["result"]

def test_memory():
    """Test memory operations"""
    memory = ShortTermMemory()
    memory.add_message("user", "My name is Alice")
    
    context = memory.get_context()
    assert len(context) == 1
    assert "Alice" in str(context)

Integration Tests

Test component interactions:

def test_agent_with_calculator():
    """Test agent executing calculator tool"""
    agent = ProductionAgent(
        llm=get_test_llm(),
        tools=[calculator_tool],
        max_iterations=5
    )
    
    result = agent.run("What is 15 * 23?")
    
    assert result["success"] == True
    assert "345" in result["answer"]
    assert result["iterations"] <= 3

def test_agent_multi_step():
    """Test multi-step reasoning"""
    agent = ProductionAgent(
        llm=get_test_llm(),
        tools=[calculator_tool, weather_tool],
        max_iterations=10
    )
    
    result = agent.run(
        "Get weather in Boston. If temperature is above 20C, calculate 20 * 3."
    )
    
    assert result["success"] == True
    stats = agent.get_stats()
    assert stats["tool_calls"] >= 2  # Weather + calculator

def test_agent_with_calculator():
    """Test agent executing calculator tool"""
    agent = ProductionAgent(
        llm=get_test_llm(),
        tools=[calculator_tool],
        max_iterations=5
    )
    
    result = agent.run("What is 15 * 23?")
    
    assert result["success"] == True
    assert "345" in result["answer"]
    assert result["iterations"] <= 3

def test_agent_multi_step():
    """Test multi-step reasoning"""
    agent = ProductionAgent(
        llm=get_test_llm(),
        tools=[calculator_tool, weather_tool],
        max_iterations=10
    )
    
    result = agent.run(
        "Get weather in Boston. If temperature is above 20C, calculate 20 * 3."
    )
    
    assert result["success"] == True
    stats = agent.get_stats()
    assert stats["tool_calls"] >= 2  # Weather + calculator

End-to-End Tests

Test complete user flows:

def test_conversation_memory():
    """Test memory across multiple turns"""
    agent = ProductionAgent(
        llm=get_test_llm(),
        tools=[],
        max_iterations=5
    )
    
    # First turn
    result1 = agent.run("My name is Alice")
    assert result1["success"] == True
    
    # Second turn - should remember name
    result2 = agent.run("What's my name?")
    assert result2["success"] == True
    assert "Alice" in result2["answer"]

def test_error_recovery():
    """Test agent handling tool errors"""
    faulty_tool = Tool(
        name="faulty",
        description="A tool that fails",
        function=lambda x: raise_exception(),
        schema={"parameters": {"properties": {}}}
    )
    
    agent = ProductionAgent(
        llm=get_test_llm(),
        tools=[faulty_tool, calculator_tool],
        max_iterations=10
    )
    
    result = agent.run("Try the faulty tool, then calculate 2+2")
    
    assert result["success"] == True  # Should recover and complete
    assert "4" in result["answer"]

def test_conversation_memory():
    """Test memory across multiple turns"""
    agent = ProductionAgent(
        llm=get_test_llm(),
        tools=[],
        max_iterations=5
    )
    
    # First turn
    result1 = agent.run("My name is Alice")
    assert result1["success"] == True
    
    # Second turn - should remember name
    result2 = agent.run("What's my name?")
    assert result2["success"] == True
    assert "Alice" in result2["answer"]

def test_error_recovery():
    """Test agent handling tool errors"""
    faulty_tool = Tool(
        name="faulty",
        description="A tool that fails",
        function=lambda x: raise_exception(),
        schema={"parameters": {"properties": {}}}
    )
    
    agent = ProductionAgent(
        llm=get_test_llm(),
        tools=[faulty_tool, calculator_tool],
        max_iterations=10
    )
    
    result = agent.run("Try the faulty tool, then calculate 2+2")
    
    assert result["success"] == True  # Should recover and complete
    assert "4" in result["answer"]

Performance Tests

Test under load and edge cases:

def test_max_iterations():
    """Test iteration limit enforcement"""
    agent = ProductionAgent(
        llm=get_test_llm(),
        tools=[],
        max_iterations=3
    )
    
    result = agent.run("Keep thinking forever")
    
    assert result["iterations"] == 3
    assert result["termination_reason"] == "max_iterations"

def test_cost_limit():
    """Test cost limit enforcement"""
    agent = ProductionAgent(
        llm=get_expensive_test_llm(),
        tools=[],
        max_iterations=100,
        max_cost=0.01
    )
    
    result = agent.run("Complex task")
    
    assert result["termination_reason"] == "cost_limit"
    assert agent.get_stats()["total_cost"] <= 0.01

def test_concurrent_execution():
    """Test multiple agents running simultaneously"""
    agent1 = ProductionAgent(llm=get_test_llm(), tools=[calculator_tool])
    agent2 = ProductionAgent(llm=get_test_llm(), tools=[weather_tool])
    
    with ThreadPoolExecutor(max_workers=2) as executor:
        future1 = executor.submit(agent1.run, "Calculate 5 * 5")
        future2 = executor.submit(agent2.run, "Weather in NYC")
        
        result1 = future1.result()
        result2 = future2.result()
    
    assert result1["success"] == True
    assert result2["success"] == True

def test_max_iterations():
    """Test iteration limit enforcement"""
    agent = ProductionAgent(
        llm=get_test_llm(),
        tools=[],
        max_iterations=3
    )
    
    result = agent.run("Keep thinking forever")
    
    assert result["iterations"] == 3
    assert result["termination_reason"] == "max_iterations"

def test_cost_limit():
    """Test cost limit enforcement"""
    agent = ProductionAgent(
        llm=get_expensive_test_llm(),
        tools=[],
        max_iterations=100,
        max_cost=0.01
    )
    
    result = agent.run("Complex task")
    
    assert result["termination_reason"] == "cost_limit"
    assert agent.get_stats()["total_cost"] <= 0.01

def test_concurrent_execution():
    """Test multiple agents running simultaneously"""
    agent1 = ProductionAgent(llm=get_test_llm(), tools=[calculator_tool])
    agent2 = ProductionAgent(llm=get_test_llm(), tools=[weather_tool])
    
    with ThreadPoolExecutor(max_workers=2) as executor:
        future1 = executor.submit(agent1.run, "Calculate 5 * 5")
        future2 = executor.submit(agent2.run, "Weather in NYC")
        
        result1 = future1.result()
        result2 = future2.result()
    
    assert result1["success"] == True
    assert result2["success"] == True

Common Pitfalls and Solutions

Pitfall 1: Infinite loops

Problem: Agent repeats same action indefinitely

Solution:

def detect_loop(state: dict, window: int = 3) -> bool:
    """Detect repeated actions"""
    if len(state["history"]) < window:
        return False
    
    recent = state["history"][-window:]
    actions = [h["action"] for h in recent]
    
    # All identical
    if all(a == actions[0] for a in actions):
        return True
    
    return False

def detect_loop(state: dict, window: int = 3) -> bool:
    """Detect repeated actions"""
    if len(state["history"]) < window:
        return False
    
    recent = state["history"][-window:]
    actions = [h["action"] for h in recent]
    
    # All identical
    if all(a == actions[0] for a in actions):
        return True
    
    return False

Pitfall 2: Context window overflow

Problem: Too much history exceeds token limits

Solution:

def manage_context(history: list, max_tokens: int = 4000) -> list:
    """Keep context within token limits"""
    while estimate_tokens(history) > max_tokens:
        if len(history) <= 2:  # Keep minimum context
            break
        
        # Remove oldest message
        history = history[1:]
    
    return history

def manage_context(history: list, max_tokens: int = 4000) -> list:
    """Keep context within token limits"""
    while estimate_tokens(history) > max_tokens:
        if len(history) <= 2:  # Keep minimum context
            break
        
        # Remove oldest message
        history = history[1:]
    
    return history

Pitfall 3: Tool hallucination

Problem: LLM invents non-existent tools

Solution:

def validate_tool_call(tool_name: str, available_tools: list) -> bool:
    """Validate tool exists before execution"""
    if tool_name not in [t.name for t in available_tools]:
        logger.warning(f"Attempted to call non-existent tool: {tool_name}")
        return False
    return True

def validate_tool_call(tool_name: str, available_tools: list) -> bool:
    """Validate tool exists before execution"""
    if tool_name not in [t.name for t in available_tools]:
        logger.warning(f"Attempted to call non-existent tool: {tool_name}")
        return False
    return True

Pitfall 4: Poor error messages

Problem: Generic errors make debugging impossible

Solution:

class ToolError(Exception):
    """Rich error with context"""
    def __init__(self, tool_name: str, error: str, context: dict):
        self.tool_name = tool_name
        self.error = error
        self.context = context
        super().__init__(f"Tool '{tool_name}' failed: {error}")

class ToolError(Exception):
    """Rich error with context"""
    def __init__(self, tool_name: str, error: str, context: dict):
        self.tool_name = tool_name
        self.error = error
        self.context = context
        super().__init__(f"Tool '{tool_name}' failed: {error}")

Production Deployment Checklist

Before deploying agents to production:

Code Quality:

[ ] All functions have type hints
[ ] Comprehensive error handling
[ ] Logging at appropriate levels
[ ] Unit tests for all components
[ ] Integration tests for workflows
[ ] Code review completed

Performance:

[ ] Token usage optimized
[ ] Cost limits configured
[ ] Timeout handling implemented
[ ] Concurrent execution tested
[ ] Load testing completed

Reliability:

[ ] Retry logic for transient failures
[ ] Circuit breakers for external services
[ ] Graceful degradation strategies
[ ] Monitoring and alerting configured
[ ] Incident response procedures documented

Security:

[ ] Input validation on all tools
[ ] SQL injection prevention
[ ] API key management
[ ] Rate limiting implemented
[ ] Audit logging enabled

Observability:

[ ] Structured logging
[ ] Metrics collection
[ ] Distributed tracing
[ ] Debug mode for development
[ ] Performance profiling

Key Takeaways

The agent loop is fundamental: Every agent implements observe → think → decide → act → update state. Understanding this pattern helps you work with any framework.

Tools enable action: Without properly designed tools, agents are just chatbots. Invest time in clear descriptions, robust schemas, and comprehensive error handling.

Memory separates demos from production: Short-term memory maintains conversations. Long-term memory persists facts. Episodic memory enables learning.

Observations ≠ Actions: When debugging, distinguish between information gathering failures and execution failures. They require different fixes.

Production requires robustness: Max iterations, cost limits, timeouts, error handling, and logging aren’t optional – they’re essential.

Start simple, add complexity: Build single-loop agents first. Master the basics before moving to multi-agent systems and complex workflows.

What’s Next: LangGraph and Deterministic Flows

You now understand agent building blocks. But there’s a problem: the loop we built is still somewhat opaque.

Questions remain:

How do you guarantee certain steps happen in order?
How do you create branches (if-then logic)?
How do you make agent behavior deterministic and testable?
How do you visualize complex workflows?

The next blog will introduce LangGraph – a framework for building agents as explicit state machines. You’ll learn:

Why graphs beat loops for complex agents
How to define states, nodes, and edges
Conditional routing and branching logic
Checkpointing and retry mechanisms
Building deterministic, debuggable workflows

The key shift: From implicit loops to explicit state graphs

Instead of a while loop where logic is hidden in functions, you’ll define explicit graphs showing exactly how the agent moves through states. This makes complex behaviors clear, testable, and debuggable.

Conclusion: From Components to Systems

Building production-ready agents isn’t about calling agent.run() and hoping for the best. It’s about understanding each component – the execution loop, tool interfaces, memory architecture, and state management – and how they work together.

This guide gave you working implementations of every pattern. You’ve seen:

The canonical agent loop with all five steps
Tool design with schemas, validation, and error handling
Memory systems for short-term, long-term, and episodic storage
The observation-action distinction for systematic debugging
A complete production agent with tracking and statistics

The code isn’t pseudocode or simplified examples. It’s production-grade implementation you can adapt for real systems.

Start building: Take the patterns here and apply them to your problems. Build tools for your APIs. Implement memory for your users. Create agents that handle real tasks reliably.

The fundamentals transfer across frameworks. Whether you use LangChain, LangGraph, or custom solutions, you’ll recognize these patterns. More importantly, you’ll know how to debug them when they break.

Next up: LangGraph for deterministic, visual workflows. But first, implement the patterns here. Build a single-loop agent. Add tools. Test memory. Experience the challenges firsthand.

That’s how you master agent development.

Additional Resources

LangChain Documentation:

Research Papers:

ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2023) – The foundational paper on ReAct prompting
Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al., 2023) – How LLMs learn tool usage

Code Repository: Full working code that you can extend: https://github.com/ranjankumar-gh/building-real-world-agentic-ai-systems-with-langgraph-codebase/tree/main/module-02

Building real-world agentic AI systems requires understanding fundamentals. This guide gave you those fundamentals. Now go build something remarkable.

Agent Building Blocks: Build Production-Ready AI Agents with LangChain | Complete Developer Guide

Introduction: From Theory to Implementation

The Fundamental Truth About Agents

Anatomy of the Agent Execution Loop

Step 1: Observe – Information Gathering

Step 2: Think – LLM Reasoning

Step 3: Decide – Action Selection

Step 4: Act – Execution

Step 5: Update State – Memory Management

Termination Check – Knowing When to Stop

Tool Calling: The Action Interface

Memory Architecture: Short-term, Long-term, and Episodic

Short-term Memory: Conversation Context

Long-term Memory: Persistent Storage

Episodic Memory: Learning from Past Interactions

Hybrid Memory System

Observations vs Actions: The Critical Distinction

Building a Production Agent: Complete Implementation

Testing and Debugging Strategies

Unit Tests

Integration Tests

End-to-End Tests

Performance Tests

Common Pitfalls and Solutions

Production Deployment Checklist

Key Takeaways

What’s Next: LangGraph and Deterministic Flows

Conclusion: From Components to Systems

Additional Resources

Leave a Comment Cancel reply