The Problem: Your Monitoring Stack Was Built for a Different World
Here is the situation most platform teams find themselves in around six months after deploying their first agentic system. Latency is up. Costs are unpredictable. Some requests succeed in three tool calls; others spiral into fifteen before either completing or crashing. Users report wrong answers. Engineers open dashboards and see... nothing useful. P95 latency is 12 seconds. Error rate is 2.3%. CPU and memory are fine.
The system is misbehaving, and the monitors are all green.
This is not a tooling gap. It is a conceptual mismatch. Traditional monitoring was designed for deterministic systems: a request arrives, code executes a fixed path, a response is returned. The interesting variables are latency, throughput, and error rate. You instrument function calls, count exceptions, and measure queue depths. This works because the behavior of the system is fully determined by the code you wrote.
Agents break every assumption in that model.
An agent does not execute a fixed path. It reasons about what to do next at each step. It calls tools, evaluates results, decides whether to retry or pivot, and accumulates context across multiple LLM invocations. The same input can produce wildly different execution paths on different runs—not because of bugs, but because the model's reasoning led somewhere different. A subtly different phrasing in a retrieved document can cause the agent to take a completely different planning branch. A tool returning a slightly slower response might push the model toward a fallback that produces a worse result.
You cannot monitor behavior you cannot observe. And you cannot observe agentic behavior with request-level metrics.
The deeper problem is that failures in agentic systems are often not errors in the traditional sense. The system does not throw an exception. It completes successfully. It just does the wrong thing—uses a deprecated tool, hallucinates a fact the tool never returned, reasons itself into a loop and exits early, or produces an answer that is technically coherent but grounded in incorrect intermediate conclusions. These failures are invisible to log aggregators, Application Performance Monitoring (APM) tools, and error tracking systems because they look like successful requests.
What you actually need is visibility into reasoning: what the agent decided, why it decided it, what evidence it used, and where the chain of logic broke down. That is a fundamentally different data model than what existing observability tools provide.
Mental Model: Agents Are State Machines with a Probabilistic Transition Function
The right way to think about an agent is as a state machine where the transition function is a language model. The state is the accumulated context—conversation history, tool results, intermediate outputs, scratchpad reasoning. The transitions are the model's decisions: which tool to call, what to ask, whether to continue or terminate.
The critical difference from a standard state machine is that the transitions are not deterministic. The same state can produce different transitions on different runs. This is not a bug; it is the feature. The model generalizes across situations it has never seen. But it means the space of possible execution paths is not enumerable. You cannot write tests that cover all behaviors. You cannot predict what the agent will do in a novel situation. And you cannot assume that a path that worked yesterday will work the same way today after a model update.
This has three concrete implications for observability.
Reasoning traces are the primary artifact. In a deterministic service, the code is the ground truth—you can reconstruct what happened from the execution path alone. In an agent, the model's internal reasoning is the ground truth. If you do not capture it explicitly (chain-of-thought, scratchpad, tool call rationale), you cannot diagnose why the agent made the choices it did. The log entry tool_call: search_database tells you nothing useful. The reasoning I need to find the user's account balance before I can determine eligibility tells you everything.
Steps, not requests, are the unit of measurement. Latency and cost for an agentic system are not request-level properties—they are the aggregate of a variable number of steps, each with its own LLM call, tool invocation, and context window. A request that takes 45 seconds and costs $0.12 might have involved 8 planning steps, 4 tool calls, and 2 self-correction loops. You need per-step visibility to understand what drove that cost and where the time went.
Execution graphs, not linear traces. Traditional distributed tracing assumes a DAG of service calls that terminates in a response. Agents can loop. They can branch conditionally. They can revisit steps. They can spawn sub-agents. The trace structure needs to represent this—a flat list of spans sorted by timestamp will not help you understand why the agent looped back to document retrieval three times before answering.
What you are building, functionally, is a flight data recorder for decision-making processes. Every frame of the reasoning must be captured, structured, and queryable. The goal is not just to know that a request failed—it is to reconstruct the exact decision sequence that led to the outcome.
Architecture: Observability Layer for Agentic Systems
The architecture below reflects what a production observability layer for agents needs to look like. It is organized around the concept of a trace tree: a hierarchical structure where each node represents one step in the agent's execution, with its input state, output state, reasoning, and metadata.
Architecture: Observability Layer for Agentic Systems
Layer 1 — Agent Runtime (Violet): What the Agent Does
This is your agent doing its job. A user request comes in, the Planning Loop takes over, and at each step the agent asks itself: what should I do next? It has three options — call a tool, do another round of LLM reasoning, or decide it has enough information and terminate.
Notice the arrows going back into the Planning Loop from both Tool Executor and LLM Gateway. That is the loop. The agent does something, gets a result, thinks about it, and decides what to do next. This can repeat five times or fifty times depending on the complexity of the request. The loop only exits when the agent hits the Terminate branch and the Response Generator sends back the final answer.
This looping, non-linear behaviour is exactly what makes traditional monitoring useless here. A single user request is not one event — it is a sequence of decisions and actions, each influencing the next.
Layer 2 — Instrumentation Layer (Blue): The Observer Sitting Inside the Loop
The Instrumentation Layer is a silent observer attached to everything happening inside the Agent Runtime. It does not change what the agent does — it just watches and records. Think of it as a flight recorder that is always running in the background.
It has four jobs, each handled by a separate component:
Step Interceptor fires at the start of every iteration of the Planning Loop. Before the agent decides what to do next, the Step Interceptor takes a snapshot: what step number is this, what time is it, what is the current state of the context. This is the backbone of the entire trace — every other component hangs off the step records this one creates.
Context Snapshotter saves a copy of the entire context window — everything the agent currently knows and remembers — at key moments during the run. This is the most storage-expensive component because context windows can be large, often tens of thousands of tokens. You do not need a snapshot at every step. Capture at the start of the run, after each tool call returns a result, and at any branching decision point. That is enough to reconstruct exactly what the agent knew at any moment in time, which is essential when you are trying to understand a decision that went wrong.
Reasoning Extractor grabs the agent's chain-of-thought output — the internal monologue the model produces before it commits to an action — before it gets thrown away. Most agent frameworks discard this text after the action is chosen, treating it as a temporary scratchpad. That is a mistake. The reasoning text is the only record of why the agent made the decision it did. Without it, you can see what the agent did but not why. The Reasoning Extractor intercepts this text and saves it before it disappears. If you are using a model with a dedicated reasoning field — Claude's extended thinking or OpenAI's o-series reasoning tokens — you pull from that field directly. Otherwise you prompt explicitly for structured reasoning and parse it out.
Tool Call Logger records the full conversation between the agent and every external tool it calls. Not just "the agent called the database" — but the exact query parameters the agent constructed, the exact response the tool returned, and how long it took. This is where the most common class of agent failure becomes visible in hindsight: the agent constructed a slightly malformed query, got back an ambiguous or empty result, misinterpreted it, and went down the wrong reasoning path from that point on. Without the Tool Call Logger you would never know the root cause was a bad query parameter in step 3.
Layer 3 — Trace Store (Green): Where Everything Gets Saved
The Instrumentation Layer produces a continuous stream of data — step records, reasoning text, tool call logs, context snapshots. The Trace Store is where all of it lands and gets organised for later retrieval. It has three distinct storage components because the data serves three different purposes and needs three different query patterns.
The Trace Store (PostgreSQL) holds the full structured record of every step, organised as a tree that reflects the actual branching execution path of the agent. The Step Metrics Store holds the numbers — tokens, latency, cost — in a flat table optimised for fast aggregation. The Reasoning Index holds vector embeddings of every reasoning trace, enabling semantic search across what the agent was thinking. Each of these is covered in detail in the sections below.
Layer 4 — Observability Plane (Orange): Tools That Make the Data Useful
Raw data in a database is not observability. The Observability Plane is the set of tools that sit on top of the Trace Store and turn stored data into actionable insight.
Trace Explorer UI lets engineers navigate individual traces visually — see the full tree of steps for a given request, expand any node to read the reasoning text, inspect tool call inputs and outputs, and follow the exact path the agent took from user input to final response. This is what you open when a user reports a wrong answer and you need to understand what went wrong.
Anomaly Detector runs continuously against incoming step records, looking for patterns that indicate something is going wrong — a planning loop that has repeated the same action three times in a row, a context window approaching the model's limit, tool errors on consecutive steps, or cost for a single trace blowing past the expected budget. It feeds its findings to the Alert Engine. Critically, this runs during the agent's execution, not after — so it can interrupt a runaway trace before the damage compounds.
Cost Analyzer aggregates the Step Metrics Store to give you the financial picture: cost per agent type, cost per user, cost trend over time, which tools are the most expensive to call, which request patterns drive the longest traces. Without this you are flying blind on infrastructure spend — agent costs are notoriously unpredictable because the number of steps per request varies so much.
Alert Engine is the notification layer. When the Anomaly Detector flags a pattern — or when Cost Analyzer detects a budget threshold being crossed — Alert Engine fires a notification to wherever your team is watching: PagerDuty, Slack, email. This is the difference between catching a problem during a trace and finding out about it from an angry user two hours later.
Trace Store: Saving What the Agent Did and Why
Think of the Trace Store like a detailed diary of everything your agent did during a single run. Not just "it searched the database" — but when it searched, what it was thinking before it decided to search, what the database returned, and what it decided to do next based on that result. Every single step, in order, with full context.
The reason this needs to be a tree rather than a flat list comes down to how agents actually behave. An agent is not a straight line — it branches, retries, and sometimes circles back. Imagine a single user request where the agent first tries to fetch account data, gets an empty result, decides to try a different query, gets partial data, then calls two tools in parallel to fill the gaps. That is not a list of five events. It is a tree: one root decision that spawned two branches, one of which retried once.
The way you store this in a database is simple: every step row has a parent_step_id column that points back to the step that triggered it. The root planning step has no parent. Every tool call it spawned points to that root. Every retry points to the failed attempt it is replacing. This gives you the full picture of what happened and in what order, without any special graph database — plain PostgreSQL handles this well.
Example Trace Tree
When you need to query this — say, "show me all traces where the agent retried more than twice" — PostgreSQL's WITH RECURSIVE lets you walk the tree in a single SQL query. You do not need to write loops in application code or pull the entire trace into memory to analyse it.
When to use a managed platform instead: If you are already using LangSmith or Arize Phoenix, they handle this tree storage for you out of the box. Every LLM call, tool call, and reasoning step is stored with full parent-child lineage, and you get a visual timeline on top of it. The tradeoff is that your trace data — which may contain user conversations, retrieved documents, and intermediate reasoning — leaves your infrastructure. For most product teams this is fine. For healthcare, finance, or any regulated environment where user data is sensitive, self-hosting on PostgreSQL is the safer default.
Step Metrics Store: Fast Numbers Without Digging Through the Tree
The Trace Store is great for understanding what happened. But it is a poor fit for answering operational questions like "what is our average cost per agent run this week?" or "which tool is the slowest across all traces?". To answer those you would have to load and traverse thousands of trees — expensive and slow.
The Step Metrics Store solves this by maintaining a separate, flat table that contains only the numbers, written at the same time each step is saved. No joins, no tree traversal — just straight aggregation queries.
Think of it like the difference between a detailed journal and a spreadsheet summary. The journal (Trace Store) tells you the full story. The spreadsheet (Step Metrics Store) tells you the totals at a glance.
Each row in this table is simple: trace_id, step_type, tool_name, tokens_used, latency_ms, cost_usd, timestamp. That is enough to power dashboards showing cost trends, latency percentiles by tool, and token consumption by agent type — the numbers your engineering and product teams will check every morning.
TimescaleDB is worth mentioning here if you are already on PostgreSQL. It is an extension that adds time-series compression and automatic rollups on top of regular Postgres tables, without requiring you to operate a separate database. Practically speaking: your metrics queries stay fast even when the table grows to hundreds of millions of rows, because TimescaleDB compresses old data automatically and pre-computes hourly and daily summaries.
Reasoning Index: Searching What the Agent Was Thinking
This is the most powerful — and most underused — component of the observability stack.
Here is the problem it solves. After an incident, your team wants to know: "Did the agent ever decide on its own to skip the verification step?" With traditional logs you are stuck. There is no field called skipped_verification. You cannot grep for it because the agent expresses this decision in natural language — sometimes as "I'll proceed without confirming since the data looks complete," sometimes as "verification seems unnecessary here," sometimes as "moving to the next step directly." Same decision, completely different words every time.
The Reasoning Index solves this by turning the agent's reasoning text into numbers — specifically, vector embeddings. An embedding is a way of representing meaning as a list of numbers, so that text with similar meaning ends up with similar numbers, regardless of exact wording. You store these embeddings in a vector database alongside the step records.
At query time, you take your question — "steps where the agent skipped verification" — convert it to an embedding the same way, and ask the database: "find me stored embeddings that are numerically close to this one." The database returns the steps whose reasoning semantically matches your question, even if none of them use the word "verification" at all.
In practice the setup is straightforward. When saving each planning step, call an embedding model — OpenAI's text-embedding-3-small costs $0.02 per million tokens on the standard API, or $0.01 per million tokens via the Batch API. Since embedding reasoning traces is a background job with no latency requirement, you should always use the Batch API here and pay half the price — and store the resulting vector. Use pgvector if you are already on PostgreSQL (it adds vector search as a simple extension), or Qdrant if you want a dedicated vector store with more filtering options.
The payoff goes beyond debugging. This is what enables behavioral audits: "show me every decision in the last 90 days where the agent chose to proceed without user confirmation." Your security team will ask exactly this question after the first incident. Having the Reasoning Index means you can answer it in seconds rather than spending a week manually reviewing logs.
Implementation: Building the Observability Layer
Step Record Schema
Start with a schema that captures everything you need for post-hoc diagnosis.
from dataclasses import dataclass, fieldfrom typing import Any, Optionalfrom datetime import datetimeimport uuid@dataclassclass AgentStep: trace_id: str step_id: str = field(default_factory=lambda: str(uuid.uuid4())) parent_step_id: Optional[str] = None agent_id: str = "" step_type: str = "" # "planning", "tool_call", "llm_reasoning", "terminate" timestamp: datetime = field(default_factory=datetime.utcnow) # Input state context_window_hash: str = "" context_token_count: int = 0 input_summary: dict = field(default_factory=dict) # Decision reasoning_text: str = "" # Raw chain-of-thought action_chosen: str = "" action_parameters: dict = field(default_factory=dict) # Output action_result: Any = None output_token_count: int = 0 latency_ms: int = 0 # Cost tracking model_name: str = "" input_tokens: int = 0 output_tokens: int = 0 estimated_cost_usd: float = 0.0 # Diagnostics error: Optional[str] = None retry_count: int = 0 anomaly_flags: list = field(default_factory=list)
This schema is deliberately verbose. In production, you will compress the fields you query rarely and store raw context snapshots in a blob store with references. But you need all of this to debug effectively.
Instrumentation Wrapper
The cleanest way to instrument an agent without rewriting it is an execution wrapper that intercepts the planning loop.
import functoolsimport timeimport hashlibimport jsonfrom typing import Callable, Anyclass AgentObservabilityWrapper: def __init__(self, agent, trace_store, step_store): self.agent = agent self.trace_store = trace_store self.step_store = step_store self._current_trace_id = None def run(self, user_input: str, **kwargs) -> Any: trace_id = str(uuid.uuid4()) self._current_trace_id = trace_id # Monkey-patch the agent's internal hooks original_plan = self.agent._execute_planning_step original_tool = self.agent._call_tool self.agent._execute_planning_step = self._wrap_planning( original_plan, trace_id ) self.agent._call_tool = self._wrap_tool_call( original_tool, trace_id ) try: result = self.agent.run(user_input, **kwargs) self.trace_store.mark_complete(trace_id, result) return result except Exception as e: self.trace_store.mark_failed(trace_id, str(e)) raise finally: # Restore originals self.agent._execute_planning_step = original_plan self.agent._call_tool = original_tool def _wrap_planning(self, original_fn: Callable, trace_id: str): @functools.wraps(original_fn) def instrumented(*args, **kwargs): step = AgentStep( trace_id=trace_id, step_type="planning", ) start = time.monotonic() # Capture context state before reasoning if hasattr(self.agent, 'context'): ctx_str = json.dumps(self.agent.context, default=str) step.context_window_hash = hashlib.sha256( ctx_str.encode() ).hexdigest()[:16] step.context_token_count = len(ctx_str.split()) * 1.3 # rough result = original_fn(*args, **kwargs) step.latency_ms = int((time.monotonic() - start) * 1000) # Extract reasoning if present if hasattr(result, 'reasoning'): step.reasoning_text = result.reasoning elif hasattr(result, 'scratchpad'): step.reasoning_text = result.scratchpad step.action_chosen = getattr(result, 'action', str(result)) step.action_parameters = getattr(result, 'action_input', {}) self.step_store.save(step) return result return instrumented def _wrap_tool_call(self, original_fn: Callable, trace_id: str): @functools.wraps(original_fn) def instrumented(tool_name: str, tool_input: dict, **kwargs): step = AgentStep( trace_id=trace_id, step_type="tool_call", action_chosen=tool_name, action_parameters=tool_input, ) start = time.monotonic() try: result = original_fn(tool_name, tool_input, **kwargs) step.action_result = result except Exception as e: step.error = str(e) step.anomaly_flags.append("tool_error") raise finally: step.latency_ms = int((time.monotonic() - start) * 1000) self.step_store.save(step) return result return instrumented
This is a simplified example. In practice you will use hooks or callbacks built into your agent framework—LangGraph has StateGraph callbacks, LangChain has BaseCallbackHandler, CrewAI has task callbacks. The principle is the same: intercept at the planning step boundary and at every tool invocation.
Anomaly Detection on Traces
Once you have structured step records, you can write detectors that fire on patterns impossible to catch in traditional monitoring.
class AgentAnomalyDetector: def detect_planning_loop(self, steps: list[AgentStep], threshold: int = 3) -> bool: """Agent is looping: same action chosen 3+ times consecutively.""" if len(steps) < threshold: return False recent = steps[-threshold:] return len(set(s.action_chosen for s in recent)) == 1 def detect_context_explosion(self, steps: list[AgentStep], max_tokens: int = 80000) -> bool: """Context window growing unbounded—will hit limit and fail.""" return any(s.context_token_count > max_tokens for s in steps) def detect_tool_result_ignored(self, steps: list[AgentStep]) -> bool: """Agent called a tool but reasoning in next step doesn't reference it.""" for i in range(len(steps) - 1): if steps[i].step_type == "tool_call" and steps[i].action_result: next_reasoning = steps[i + 1].reasoning_text.lower() # Check if any key terms from tool result appear in reasoning result_str = str(steps[i].action_result).lower() result_terms = set(result_str.split()[:20]) # first 20 words if not any(term in next_reasoning for term in result_terms if len(term) > 4): return True return False def detect_cost_runaway(self, steps: list[AgentStep], budget_usd: float = 0.50) -> bool: """Cumulative cost for this trace exceeds budget.""" total = sum(s.estimated_cost_usd for s in steps) return total > budget_usd def analyze_trace(self, trace_id: str, steps: list[AgentStep]) -> list[str]: flags = [] if self.detect_planning_loop(steps): flags.append("PLANNING_LOOP_DETECTED") if self.detect_context_explosion(steps): flags.append("CONTEXT_EXPLOSION_RISK") if self.detect_tool_result_ignored(steps): flags.append("TOOL_RESULT_POTENTIALLY_IGNORED") if self.detect_cost_runaway(steps): flags.append("COST_BUDGET_EXCEEDED") return flags
These detectors run against the trace in near-real-time—either in a sidecar process reading from the step store, or inline before returning each response. The planning loop detector alone will catch a category of failure that is completely invisible to traditional monitoring.
LangGraph Integration Example
If you are using LangGraph, the callback mechanism is cleaner than monkey-patching.
from langchain_core.callbacks import BaseCallbackHandlerfrom langchain_core.outputs import LLMResultfrom typing import Unionclass LangGraphObservabilityHandler(BaseCallbackHandler): def __init__(self, trace_id: str, step_store): self.trace_id = trace_id self.step_store = step_store self._step_start_times = {} def on_llm_start(self, serialized, prompts, **kwargs): run_id = str(kwargs.get('run_id', uuid.uuid4())) self._step_start_times[run_id] = time.monotonic() def on_llm_end(self, response: LLMResult, **kwargs): run_id = str(kwargs.get('run_id', '')) latency = int( (time.monotonic() - self._step_start_times.pop(run_id, 0)) * 1000 ) step = AgentStep( trace_id=self.trace_id, step_type="llm_reasoning", latency_ms=latency, model_name=response.llm_output.get('model_name', 'unknown'), output_token_count=response.llm_output.get( 'token_usage', {} ).get('completion_tokens', 0), input_tokens=response.llm_output.get( 'token_usage', {} ).get('prompt_tokens', 0), ) # Extract reasoning from response if response.generations: gen = response.generations[0][0] step.reasoning_text = getattr(gen, 'text', '') self.step_store.save(step) def on_tool_start(self, serialized, input_str, **kwargs): run_id = str(kwargs.get('run_id', uuid.uuid4())) self._step_start_times[run_id] = time.monotonic() def on_tool_end(self, output: str, **kwargs): run_id = str(kwargs.get('run_id', '')) latency = int( (time.monotonic() - self._step_start_times.pop(run_id, 0)) * 1000 ) step = AgentStep( trace_id=self.trace_id, step_type="tool_call", action_result=output, latency_ms=latency, ) self.step_store.save(step)
Pass this handler to your LangGraph graph invocation via the config parameter. Every LLM call and tool invocation will be captured without modifying agent logic.
Pitfalls and Failure Modes
Logging Tokens Without Logging Reasoning
This is the most common mistake and the most expensive. Teams instrument LLM calls and capture token counts, latency, and cost—but discard the actual content of the reasoning. You end up with perfect cost visibility and zero diagnostic capability. You know a trace cost $0.23 and took 18 seconds. You have no idea why.
The fix is mandatory reasoning capture. Store the full text of every model output, not just the final action. Yes, this increases storage costs. It is worth it. If you are on a tight budget, store reasoning for a sampled percentage of traces and all traces that trigger anomaly flags.
Treating Agent Failures as Request Failures
An agent can complete a 15-step trace, return a 200 OK, and have produced a completely wrong answer. If your alerting is based on HTTP error rates, you will not catch this. Agentic systems require outcome-based evaluation, not just completion-based monitoring.
This means building evaluation pipelines that run against sampled traces: LLM judges that assess answer correctness, heuristic checks against known-good patterns, human review queues for flagged traces. The observability layer provides the data; evaluation provides the signal.
Missing the Loop Detection Window
Planning loop detectors that run only at the end of a trace are nearly useless—by the time you detect the loop, the agent has already spent the budget and timed out. Detectors need to run inline, after each step, with the ability to interrupt the agent's planning loop before it compounds. This requires your observability layer to be in the critical path, not just a passive observer.
Context Window Creep
Agents that retrieve documents, call tools, and accumulate history will silently grow their context windows across a long conversation until they hit the model's limit and the request fails—or until performance degrades due to the attention mechanism scaling quadratically with context length. You need per-step context token monitoring with alerts well before the limit (at 60-70% is reasonable) so you can trigger context compression before it becomes a problem.
Model Version Drift
A model update on the provider's end can change agent behavior without any code change on your side. An agent that worked reliably on one model version may loop differently after a provider updates the underlying weights — even if the model name in your config hasn't changed. Your step records must capture the exact model version used on every LLM call. Without this, diagnosing behavior regressions after a provider update is nearly impossible.
The Retry Storm Pattern
An agent that hits a tool error will often retry. If the retry logic is in the planning loop rather than in the tool executor, each retry generates a full planning step with a new LLM call. A flaky external API can trigger a cascade where the agent calls the LLM ten times trying to recover from a single transient failure, burning tokens on reasoning about a problem that was never a reasoning problem. Your anomaly detector should flag consecutive tool errors on the same tool within a trace, and your tool executor should handle retries with backoff before surfacing failures to the planning loop.
Summary and Next Steps
The central insight is simple: agents are not services. They are reasoning processes. The data you need to observe them is not metrics and logs—it is structured reasoning traces that capture what the agent decided, why, and on what evidence.
The architecture is straightforward: intercept every planning step and tool call, capture reasoning text and context state, store step records in a trace tree, and run anomaly detectors inline. The hard part is organizational—persuading your team to invest in this before something breaks badly enough in production to make the case for you.
Where to start: instrument your highest-traffic agent with step recording this week. You do not need the full anomaly detection suite. Just capturing reasoning text and tool call inputs and outputs will reveal more about your system's behavior in one day than a month of reviewing HTTP access logs.
From there, build the planning loop detector first—it catches the most expensive class of failure. Then add context token monitoring. Then build the evaluation pipeline for outcome quality. In that order.
The field is moving fast. OpenTelemetry's GenAI semantic conventions are being finalized and will provide a standard schema for LLM spans. Frameworks like LangSmith, Phoenix (Arize), and Weave (Weights & Biases) are building purpose-built trace storage for agents. The infrastructure is maturing. But none of it replaces the fundamental work of capturing reasoning—that is something you have to instrument in your own system, regardless of what observability vendor you use.
Agents that you cannot observe are agents you cannot trust in production. Build the flight recorder before you need it.
**Disclaimer: Token pricing, hardware costs, and model version names referenced in this article reflect publicly available information as of February 2026. Embedding pricing for
text-embedding-3-smallis $0.02 per million tokens (standard) and $0.01 per million tokens (Batch API), sourced from https://platform.openai.com/docs/pricing. LLM provider pricing changes frequently — verify current rates before making infrastructure or cost projections. All token count estimates are approximate and depend on tokenizer implementation.
References and Further Reading
- OpenTelemetry GenAI Semantic Conventions — https://opentelemetry.io/docs/specs/semconv/gen-ai/
- LangSmith documentation on tracing LangChain and LangGraph agents — https://docs.smith.langchain.com
- Arize Phoenix: open-source LLM observability — https://github.com/Arize-ai/phoenix
- Weights & Biases Weave for LLM tracing — https://weave-docs.wandb.ai
- Ouyang et al. (2022), "Training language models to follow instructions with human feedback" — https://arxiv.org/abs/2203.02155
- "ReAct: Synergizing Reasoning and Acting in Language Models" — Yao et al. (2022), https://arxiv.org/abs/2210.03629
- "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" — Wei et al. (2022), https://arxiv.org/abs/2201.11903
- LangGraph documentation — https://langchain-ai.github.io/langgraph/
- "Observability Engineering" — Charity Majors, Liz Fong-Jones, George Miranda (O'Reilly 2022) — foundational text, apply the core principles to agent systems
Related Articles
- Agent Audit Trails: Logging Context, Not Just Actions
- The Agent Trust Problem: Why Security Theater Won't Save Us from Agentic AI
Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications: