An invoice-processing agent ran all weekend. No alerts fired. No errors surfaced. The system showed green across every dashboard. On Monday, the team discovered it had burned several hundred dollars in API credits looping on a validation error that didn't exist - the agent kept retrying a condition it had incorrectly inferred, and nothing in the monitoring stack could see the reasoning path that drove it there.
The team spent two days reconstructing what happened by hand. They had traces. They had logs. They had latency dashboards. What they didn't have was any visibility into the sequence of state transitions that explained why the agent kept going.
That is the gap this article is about.
The Observability Illusion in Agent Fleets
Here is the belief most platform teams hold: "We have LangSmith set up, so we have observability."
This is wrong. Not because LangSmith is inadequate - it's a capable tool - but because having traces is not the same as having observability. Traditional APM was designed for stateless services with deterministic control flow. You know the code paths. You can enumerate failure modes. Latency is a proxy for correctness.
Agent systems violate every assumption that traditional monitoring was built on:
- They carry state across turns and tool calls
- The same input produces different reasoning paths depending on context
- A technically successful response (200 OK, tokens returned) can be a behavioral failure
- Failures compound silently across agent-to-agent handoffs before any single agent raises an error
The thesis of this series: governing a fleet of agents in production requires a control plane - and the first component of that control plane is a three-tier metric layer that most teams are missing two thirds of.
Why Teams Keep Getting Burned
The failure pattern repeats. A team instruments their LangGraph agents with LangSmith. They wire up Prometheus to scrape infrastructure metrics. They build a Grafana dashboard with P99 latency and token usage by model. They call this "done."
Six weeks into production, an agent starts silently hallucinating tool parameters - calling real APIs with plausible-looking but incorrect payloads. The infrastructure layer shows nothing wrong. The harness layer shows token costs are stable. The behavioral layer doesn't exist, so nobody sees that the agent's tool-call accuracy has degraded 23% over the past week. By the time a user reports a problem, there are hundreds of corrupted downstream records to trace.
This is not a tooling gap. Every component needed to catch that failure existed. It's an architecture gap - the failure to organize metrics into the right conceptual tiers and then instrument all three.
The Three-Tier Control Plane Metric Model
Three Tiers: {
# Change this number to increase or decrease the space
vertical-gap: 10
grid-rows: 3
Tier3: {
label: "Tier 3: Behavioral Metrics\n(Is the agent doing the right thing?)"
style: {
fill: "#f3e5f5"
stroke: "#7b1fa2"
}
}
Tier2: {
label: "Tier 2: Harness Metrics\n(Is the agent executing correctly?)"
style: {
fill: "#e3f2fd"
stroke: "#1976d2"
}
}
Tier1: {
label: "Tier 1: Infrastructure Metrics\n(Is the system running?)"
style: {
fill: "#e8f5e9"
stroke: "#388e3c"
}
}
}
The control plane for an agent fleet requires three distinct tiers of metrics, each answering a different question.
Tier 1 - Infrastructure Metrics ask: Is the system running?
This is traditional APM territory. CPU, memory, GPU utilization, network I/O, error rates on HTTP services, pod health in Kubernetes. Your existing Datadog or Prometheus setup covers this. The mistake is thinking this tier is sufficient. An agent can be "running" at 100% uptime while making systematically wrong decisions at every step.
Tier 2 - Harness Metrics ask: Is the agent executing correctly?
This is what the Harness Engineering series (starting with the foundational layer article) treats as the instrumentation concern for individual agents. Token cost per node, per-step latency, tool call success rates, retry counts, gate trigger rates from the Gated Execution layer, validation failure rates from the Validation Layer, circuit breaker trip counts from Retry and Fallback infrastructure.
For a single agent, these metrics surface in LangSmith naturally. For a fleet of 10-50 agents, they need to be aggregated, normalized, and queryable across agent types in one plane.
Tier 3 - Behavioral Metrics ask: Is the agent doing the right thing?
This tier is where most teams have a complete blind spot. Behavioral metrics track the gap between technical success and actual correctness: goal completion rate, tool-call accuracy, output schema compliance, quality drift over time, reasoning path anomalies. A hiring agent can process 10,000 resumes with zero errors and still be systematically biased. Standard dashboards show green. Behavioral metrics don't.
The three tiers together define the Telemetry Surface Gap - the coverage difference between what your current monitoring captures and the full three-tier signal surface required to govern an agent fleet. The Telemetry Surface Gap is why production incidents in agent systems take so long to diagnose: the signal that would have surfaced the failure exists in principle, but no one is collecting or querying it.
| Infrastructure | Harness | Behavioral |
|---|---|---|
| CPU/Memory | Token cost/node | Goal completion |
| HTTP error rates | Step latency | Tool accuracy |
| Pod health | Tool call rate | Output quality |
| Uptime | Gate triggers | Reasoning drift |
| GPU utilization | Retry counts | Schema compliance |
Most teams have Tier 1. Some teams have Tier 2 for their primary agent. Almost no teams have Tier 3, and almost no teams have any of the three tiers unified across their full fleet. The Telemetry Surface Gap is the measurable difference between where your instrumentation ends and where you need it to be.
The Session Reconstruction Problem
Before looking at implementation, the framing matters.
The dominant mental model for observability is: collect logs, search logs when something breaks, correlate manually. This works for microservices. It fails for agents.
Here's why. In a microservice, a request has a clear start and end. The execution path is deterministic. You can reproduce failures reliably. When something breaks, you look at the logs around the timestamp and find the error.
An agent run is not a request. It's a sequence of state transitions. A tool response changes the state. A planner decision changes the state. A human approval changes the state. A retry changes the state. A delegated sub-task changes the state. To understand why an agent did what it did, you need to reconstruct the full sequence of those transitions - not just the terminal output, not just the error message at the end.
Agent observability is not a logging problem. It is a session reconstruction problem.This distinction changes how you architect the metric layer. Instead of treating each agent action as an isolated log event, you track it as a state transition in a session. The session is the unit of analysis. Debugging is replaying the session to find the transition that sent the agent down the wrong path.
For the invoice-processing agent that burned through API credits over a weekend, the failure wasn't visible in any individual log line. It was visible in the pattern across hundreds of state transitions - the agent repeatedly re-entering the same validation loop. That pattern is only visible when you can replay the session.
Wrong Way: Framework-Siloed Metric Collection
Here is the naive instrumentation approach most teams start with.
# Wrong way: per-framework, siloed metric collection# Each agent type is instrumented independently.# There is no unified fleet view.import osfrom langgraph.graph import StateGraphfrom langchain_openai import ChatOpenAI# LangSmith auto-instrumentation via env varos.environ["LANGCHAIN_TRACING_V2"] = "true"os.environ["LANGCHAIN_PROJECT"] = "invoice-agent"# Second agent - different project, no shared contextos.environ["LANGCHAIN_PROJECT"] = "approval-agent"# Problems with this approach:# 1. Two separate LangSmith projects - no cross-agent trace correlation# 2. No fleet-level token cost aggregation# 3. No behavioral metrics anywhere# 4. Infrastructure metrics live in a separate Datadog dashboard# 5. To answer "what is the total cost of processing one invoice# end-to-end across all agents?" requires manual joins across# three different systems.The siloed approach has a specific failure signature: when a multi-agent pipeline fails, you know something went wrong, but you cannot trace where in the agent handoff chain the failure originated without manually correlating trace IDs across systems. This is a multi-hour exercise in production incidents.
A second common mistake - using a single shared LangSmith project for all agents:
# Also wrong: single project, no agent identityos.environ["LANGCHAIN_PROJECT"] = "production-agents"# Now all agents share one project.# You cannot query "show me cost breakdown by agent type"# or "which agent is responsible for the P99 latency spike?"# Everything is one undifferentiated blob.Right Way: Unified Fleet Telemetry with OTel and Agent Identity
The correct architecture separates three concerns:
- Agent identity - every span knows which agent type and version produced it
- Session correlation - all spans for one end-to-end pipeline execution share a trace ID
- Tier-aware metric emission - harness metrics and behavioral metrics are emitted as typed signals, not free-text logs
LangGraph 1.1.2 (March 2026) with LangSmith Fleet and OpenTelemetry GenAI Semantic Conventions v1.37 gives you the foundation. Here is the instrumentation pattern:
# Right way: unified fleet telemetry with OTel + agent identityfrom __future__ import annotationsimport osimport timeimport uuidfrom dataclasses import dataclass, fieldfrom typing import Any, TypedDictfrom langchain_openai import ChatOpenAIfrom langgraph.graph import END, StateGraphfrom opentelemetry import metrics, tracefrom opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporterfrom opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporterfrom opentelemetry.sdk.metrics import MeterProviderfrom opentelemetry.sdk.metrics.export import PeriodicExportingMetricReaderfrom opentelemetry.sdk.trace import TracerProviderfrom opentelemetry.sdk.trace.export import BatchSpanProcessor# --- OTel setup ---def setup_telemetry(otlp_endpoint: str) -> tuple[trace.Tracer, metrics.Meter]: tracer_provider = TracerProvider() tracer_provider.add_span_processor( BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint)) ) trace.set_tracer_provider(tracer_provider) reader = PeriodicExportingMetricReader( OTLPMetricExporter(endpoint=otlp_endpoint), export_interval_millis=30_000, ) meter_provider = MeterProvider(metric_readers=[reader]) metrics.set_meter_provider(meter_provider) tracer = trace.get_tracer("ai.control_plane") meter = metrics.get_meter("ai.control_plane") return tracer, meter# --- Fleet-level metric instruments ---@dataclassclass FleetMetrics: """ Typed metric instruments for all three tiers. One instance shared across the fleet. """ # Tier 2 - Harness Metrics token_cost_counter: metrics.Counter # cost per agent, per model node_latency_histogram: metrics.Histogram # latency per node, per agent type tool_call_counter: metrics.Counter # tool calls by name, agent, result gate_trigger_counter: metrics.Counter # gate layer trips by policy retry_counter: metrics.Counter # retries by agent, error type validation_failure_counter: metrics.Counter # validation failures by schema # Tier 3 - Behavioral Metrics goal_completion_counter: metrics.Counter # success/failure per goal type tool_accuracy_histogram: metrics.Histogram # accuracy of tool param generation quality_score_histogram: metrics.Histogram # LLM-as-judge scores by agent schema_compliance_counter: metrics.Counter # output schema pass/fail reasoning_loop_counter: metrics.Counter # repeated reasoning state entries @classmethod def create(cls, meter: metrics.Meter) -> "FleetMetrics": return cls( # Tier 2 token_cost_counter=meter.create_counter( "gen_ai.agent.token_cost", description="Token cost in USD per agent node execution", unit="USD", ), node_latency_histogram=meter.create_histogram( "gen_ai.agent.node_latency", description="Latency per agent node in milliseconds", unit="ms", ), tool_call_counter=meter.create_counter( "gen_ai.agent.tool_calls", description="Tool call count by name, agent, and result", ), gate_trigger_counter=meter.create_counter( "gen_ai.agent.gate_triggers", description="Gated execution policy trips", ), retry_counter=meter.create_counter( "gen_ai.agent.retries", description="Retry count by agent and error type", ), validation_failure_counter=meter.create_counter( "gen_ai.agent.validation_failures", description="Output validation failures by schema", ), # Tier 3 goal_completion_counter=meter.create_counter( "gen_ai.agent.goal_completions", description="Goal completion outcomes by agent and goal type", ), tool_accuracy_histogram=meter.create_histogram( "gen_ai.agent.tool_accuracy", description="Tool parameter accuracy score (0.0-1.0)", ), quality_score_histogram=meter.create_histogram( "gen_ai.agent.quality_score", description="LLM-as-judge quality score per agent run", ), schema_compliance_counter=meter.create_counter( "gen_ai.agent.schema_compliance", description="Output schema compliance pass/fail", ), reasoning_loop_counter=meter.create_counter( "gen_ai.agent.reasoning_loops", description="Repeated reasoning state entries per session", ), )# --- Agent identity carrier ---@dataclassclass AgentIdentity: """ Carries agent type, version, and fleet position into every span. Maps to OTel GenAI SemConv v1.37 gen_ai.agent.* attributes. """ agent_type: str # e.g. "invoice_extractor", "approval_router" agent_version: str # e.g. "1.3.0" fleet_id: str # e.g. "invoice_processing_fleet" model: str # e.g. "gpt-4o-2024-11-20"# --- Instrumented agent state ---class InvoiceState(TypedDict): invoice_text: str extracted_fields: dict[str, Any] validation_errors: list[str] approval_status: str session_id: str # shared across entire pipeline execution agent_type: str # current agent type for metric attribution# --- Instrumented node wrapper ---class InstrumentedNode: """ Wraps any LangGraph node to emit Tier 2 and Tier 3 metrics with full agent identity and session correlation. """ def __init__( self, identity: AgentIdentity, fleet_metrics: FleetMetrics, tracer: trace.Tracer, ) -> None: self.identity = identity self.metrics = fleet_metrics self.tracer = tracer def _base_attributes(self, session_id: str) -> dict[str, str]: """OTel GenAI SemConv v1.37 compliant attribute set.""" return { "gen_ai.agent.name": self.identity.agent_type, "gen_ai.agent.version": self.identity.agent_version, "gen_ai.system": self.identity.fleet_id, "gen_ai.request.model": self.identity.model, "session.id": session_id, } def run(self, state: InvoiceState, node_fn) -> InvoiceState: session_id = state["session_id"] attrs = self._base_attributes(session_id) with self.tracer.start_as_current_span( f"gen_ai.agent.{self.identity.agent_type}", attributes=attrs, ) as span: start_ms = time.monotonic() * 1000 try: result = node_fn(state) span.set_attribute("gen_ai.agent.outcome", "success") return result except Exception as exc: span.set_attribute("gen_ai.agent.outcome", "error") span.set_attribute("error.type", type(exc).__name__) raise finally: elapsed_ms = time.monotonic() * 1000 - start_ms self.metrics.node_latency_histogram.record(elapsed_ms, attrs) def record_token_cost(self, cost_usd: float, session_id: str) -> None: self.metrics.token_cost_counter.add( cost_usd, self._base_attributes(session_id) ) def record_tool_call( self, tool_name: str, success: bool, session_id: str ) -> None: attrs = { **self._base_attributes(session_id), "tool.name": tool_name, "tool.result": "success" if success else "failure", } self.metrics.tool_call_counter.add(1, attrs) def record_goal_completion( self, goal_type: str, completed: bool, session_id: str ) -> None: attrs = { **self._base_attributes(session_id), "goal.type": goal_type, "goal.outcome": "completed" if completed else "failed", } self.metrics.goal_completion_counter.add(1, attrs) def record_reasoning_loop(self, session_id: str) -> None: self.metrics.reasoning_loop_counter.add( 1, self._base_attributes(session_id) ) def record_quality_score(self, score: float, session_id: str) -> None: self.metrics.quality_score_histogram.record( score, self._base_attributes(session_id) )# --- Example: fleet-instrumented extraction agent ---def build_invoice_extraction_agent( identity: AgentIdentity, fleet_metrics: FleetMetrics, tracer: trace.Tracer,) -> StateGraph: """ Builds an extraction agent with full Tier 2 + Tier 3 instrumentation. The session_id in state ties this agent's spans to the full pipeline trace. """ instrumented = InstrumentedNode(identity, fleet_metrics, tracer) llm = ChatOpenAI(model=identity.model, temperature=0) def extract_node(state: InvoiceState) -> InvoiceState: """ Clean node function: instrumented.run wraps this for span + latency. All metric recording happens inside this function directly. """ def _run(state: InvoiceState) -> InvoiceState: response = llm.invoke( f"Extract invoice fields as JSON from:\n{state['invoice_text']}" ) # Tier 2: record token cost from response usage metadata estimated_cost = len(response.content) * 0.000015 instrumented.record_token_cost(estimated_cost, state["session_id"]) # Tier 3: record goal completion extracted = {"vendor": "Acme Corp", "amount": 4200.00} instrumented.record_goal_completion( goal_type="field_extraction", completed=bool(extracted), session_id=state["session_id"], ) return {**state, "extracted_fields": extracted} return instrumented.run(state, _run) graph = StateGraph(InvoiceState) graph.add_node("extract", extract_node) graph.set_entry_point("extract") graph.add_edge("extract", END) return graph.compile()The key differences from the siloed approach:
- Every span carries
gen_ai.agent.name,gen_ai.agent.version,gen_ai.system, andsession.id- queryable attributes that let you aggregate across the full fleet in any OTel-compatible backend - Harness metrics (Tier 2) and behavioral metrics (Tier 3) are separate, typed metric instruments - not log lines you have to parse
- The
session_idpropagates from the first agent in the pipeline to the last, so one query can reconstruct the full end-to-end execution across multiple agents
Detecting Reasoning Loops Before They Cost You
The gen_ai.agent.reasoning_loops metric deserves special attention because it closes the gap the invoice-processing story exposed. Here is the minimal implementation.
# Reasoning loop detection: track state fingerprints per session.# When the same reasoning state appears more than N times, emit the# loop metric and optionally interrupt the agent.import hashlibfrom collections import defaultdictclass ReasoningLoopDetector: """ Tracks state fingerprints within a session. Emits gen_ai.agent.reasoning_loops when a fingerprint repeats. """ def __init__( self, instrumented_node: InstrumentedNode, loop_threshold: int = 3, ) -> None: self.node = instrumented_node self.threshold = loop_threshold self._fingerprints: dict[str, int] = defaultdict(int) def _fingerprint(self, state: InvoiceState) -> str: """Stable hash of reasoning-relevant state fields.""" key = f"{state.get('extracted_fields')}|{state.get('validation_errors')}" return hashlib.sha256(key.encode()).hexdigest()[:16] def check(self, state: InvoiceState) -> bool: """ Returns True if the current state has exceeded the loop threshold. Call this at the start of each node execution. """ fp = self._fingerprint(state) self._fingerprints[fp] += 1 count = self._fingerprints[fp] if count > 1: # Emit the Tier 3 behavioral metric on every repeat self.node.record_reasoning_loop(state["session_id"]) if count >= self.threshold: # This is the alert condition. # In production: set span status to error, surface to on-call. return True # caller should interrupt the agent return FalseWire this into the node at the start of execution. When check() returns True, the node should return an error state rather than calling the LLM again. The metric counter fires at every repeat - so alerting at "reasoning_loops > 3 in one session" catches the failure within the 4th loop, not after a weekend of runaway API calls.
The Control Plane Telemetry Routing Layer
Single-agent instrumentation closes the Telemetry Surface Gap for one agent. A control plane operates at fleet scope. That requires a routing layer - the telemetry routing layer - that takes OTel spans from all agents regardless of framework and normalizes them into one queryable backend. This is what makes the control plane framework-agnostic.
# Fleet telemetry routing: one OTel Collector config# that receives from LangGraph agents, OpenAI Agents SDK agents,# and any custom orchestrators, and routes to one backend.# otel-collector-config.yaml (abbreviated)# receivers:# otlp:# protocols:# grpc:# endpoint: 0.0.0.0:4317## processors:# attributes/enrich:# actions:# - key: fleet.environment# value: production# action: insert# - key: fleet.version# from_attribute: gen_ai.agent.version# action: insert# filter/pii:# spans:# exclude:# match_type: regexp# attributes:# - key: gen_ai.input.messages# value: ".*\\b\\d{16}\\b.*" # strip card numbers from prompt traces## exporters:# otlphttp/langsmith:# endpoint: https://api.smith.langchain.com/otel# headers:# x-api-key: ${LANGSMITH_API_KEY}# prometheus:# endpoint: 0.0.0.0:8889 # Tier 2 + Tier 3 metrics to Grafana## service:# pipelines:# traces:# receivers: [otlp]# processors: [attributes/enrich, filter/pii]# exporters: [otlphttp/langsmith]# metrics:# receivers: [otlp]# processors: [attributes/enrich]# exporters: [prometheus]# The result: LangSmith gets trace data for debugging.# Prometheus + Grafana gets metric data for dashboards and alerts.# Both views are correlated by session_id and agent identity attributes.# PII is stripped before leaving the network boundary.This is the architecture that makes a control plane possible. The OTel Collector is the normalization layer. It doesn't care whether the span came from a LangGraph agent, an OpenAI Agents SDK agent, or a custom Python orchestrator. If the span follows OTel GenAI SemConv v1.37 conventions, the collector processes it identically.
The Harness Metric Surface - What to Instrument Per Agent
Building on the execution primitives from the Harness Engineering series, every agent in a fleet should emit this minimum metric surface:
Tier 2 - Harness Metrics (per agent node)
| Metric | Description |
|---|---|
gen_ai.agent.token_cost | Cost in USD per invocation |
gen_ai.agent.node_latency | p50/p99 latency per node type |
gen_ai.agent.tool_calls | Count by tool name + success/failure |
gen_ai.agent.gate_triggers | Gated Execution policy trip count |
gen_ai.agent.retries | Retry count by error class |
gen_ai.agent.validation_failures | Schema/content validation failures |
gen_ai.agent.checkpoint_writes | State Management checkpoint ops |
Tier 3 - Behavioral Metrics (per session)
| Metric | Description |
|---|---|
gen_ai.agent.goal_completions | Success/failure by goal type |
gen_ai.agent.tool_accuracy | Param accuracy score (0.0-1.0) |
gen_ai.agent.quality_score | LLM-as-judge score per run |
gen_ai.agent.schema_compliance | Output schema pass/fail rate |
gen_ai.agent.reasoning_loops | Repeated reasoning state entries (loop detection - the invoice agent bug) |
The gen_ai.agent.reasoning_loops metric is the one that would have caught the invoice-processing failure within minutes instead of two days. When a counter for "agent re-entered the same reasoning state" exceeds a threshold - say, 3 occurrences in one session - it fires an alert. The session is flagged for human review before it burns through a weekend of credits.
Diagram: Three-Tier Metric Architecture
flowchart TD
subgraph Fleet["Agent Fleet"]
A1["Invoice Extractor\nAgent v1.3.0"]
A2["Approval Router\nAgent v2.1.0"]
A3["Ledger Sync\nAgent v1.0.5"]
end
subgraph OTelSDK["OTel SDK\nper agent"]
S1["Tier 1\nInfra Spans"]
S2["Tier 2\nHarness Metrics"]
S3["Tier 3\nBehavioral Metrics"]
end
subgraph Collector["OTel Collector"]
E["Enrich\nagent identity\n+ session_id"]
P["Filter PII\nfrom prompts"]
R["Route to\nbackends"]
end
subgraph Backends["Observability Backends"]
LS["LangSmith Fleet\nTrace debugging\nSession replay"]
PR["Prometheus\n+ Grafana\nMetric dashboards\nAlerts"]
end
A1 & A2 & A3 --> S1 & S2 & S3
S1 & S2 & S3 --> E --> P --> R
R -->|"traces"| LS
R -->|"metrics"| PR
style A1 fill:#4A90E2,color:#fff
style A2 fill:#4A90E2,color:#fff
style A3 fill:#4A90E2,color:#fff
style S1 fill:#95A5A6,color:#fff
style S2 fill:#7B68EE,color:#fff
style S3 fill:#6BCF7F,color:#fff
style E fill:#FFD93D,color:#333
style P fill:#E74C3C,color:#fff
style R fill:#98D8C8,color:#333
style LS fill:#9B59B6,color:#fff
style PR fill:#9B59B6,color:#fff
The collector is the normalization point that makes the control plane framework-agnostic. Without it, you have N observability setups for N agent frameworks. With it, you have one.
LangSmith Fleet and the Agent Identity Model
LangSmith Fleet (launched March 2026, formerly Agent Builder) introduced a formal agent identity model: agents have an identity, a version, sharing permissions, and fleet-scoped access controls. This is the platform-level answer to the identity problem described above.
The ABAC (Attribute-Based Access Control) layer in LangSmith now lets you apply tag-based allow/deny policies on top of RBAC roles - so you can control which teams can query which agent's traces. Audit Logs provide a tamper-resistant record of every administrative action across the fleet, queryable via API.
For teams already on the LangChain stack, Fleet is the right starting point for fleet-level identity management. For teams with mixed frameworks - some on LangGraph, some on OpenAI Agents SDK, some custom - the OTel Collector approach above is the framework-neutral path.
Tiered Storage for Session Replay
A common objection to full session-level observability: "If we store every state transition for every agent session, the storage bill will exceed the inference bill."
This is true if you treat every event as permanently hot data. The right model is tiered:
- Hot tier (0-7 days): Full session traces in LangSmith or your OTel backend. Queryable in seconds. Used for active incident response.
- Warm tier (7-90 days): Compressed session replay data in object storage (S3, GCS). Takes seconds to load. Used for debugging recurring issues and regression analysis.
- Cold tier (90+ days): Archived sessions in low-cost object storage. Used for compliance, audit, and long-horizon quality analysis.
The goal is replay-on-demand without paying hot-tier prices for cold data. Most platforms (LangSmith Enterprise, Langfuse self-hosted) support this pattern natively. For custom stacks, a Kafka-backed event stream with tiered consumers is the production-grade architecture.
Instrumentation Checklist for Agent Fleet Observability
Before your first agent goes to production:
- OTel SDK installed and configured with OTLP exporter
- Every agent emits
gen_ai.agent.name,gen_ai.agent.version, andgen_ai.systemon every span -
session_idpropagates from pipeline entry to final agent - one ID per end-to-end execution - Tier 2 harness metrics (token cost, node latency, tool calls, retries, gate triggers) defined as typed OTel metric instruments, not log lines
- Tier 3 behavioral metrics (goal completion, tool accuracy, quality score, schema compliance) defined and emitted per session
-
gen_ai.agent.reasoning_loopscounter implemented with alert at threshold >3 per session - OTel Collector deployed with PII filter processor before traces leave network boundary
- LangSmith (or equivalent) receives trace data for session debugging
- Prometheus + Grafana (or equivalent) receives metric data for dashboards and fleet-level alerts
- Tiered storage policy defined: hot/warm/cold with retention periods
Before adding the second agent type to your fleet:
- Both agents share the same
fleet_idattribute value - Fleet-level dashboards query by
gen_ai.system = {fleet_id}- not by individual project - Cost attribution is queryable: "total cost per pipeline execution" across all contributing agents
- Alert thresholds defined at fleet scope, not per-agent
What Comes Next
The opening thesis was this: teams think they have observability because they have traces. They don't. They have logging without the behavioral and session-level signal that governs a fleet.
What you now have, after applying the three-tier metric model and the telemetry routing architecture, is the foundation to close the Telemetry Surface Gap - the ability to query fleet-wide harness metrics, detect behavioral failures before users do, and replay sessions to reconstruct why an agent made the decisions it made. That is observability. Traces are one input to it.
The rest of the control plane depends on this foundation. Without it, you cannot enforce global policies you can verify, you cannot detect which agent version introduced a regression, and you cannot attribute cost to the right pipeline. Every subsequent part of this series assumes you have all three tiers instrumented and the routing layer running.
Part 2 covers Global Policy Enforcement vs. Per-Agent Gate Rules - how to separate fleet-wide policy (no PII in external tool calls, hard budget caps, model version pinning) from the per-agent gate logic introduced in the Gated Execution layer. These are distinct concerns that most teams collapse into one, and the collision causes both systems to fail.
References
- LangChain. (March 2026). March 2026 Newsletter: LangSmith Fleet, LangGraph v1.1. https://blog.langchain.com/march-2026-langchain-newsletter/
- LangChain. (October 2025). View Agent Deployment Metrics in LangGraph Platform. https://changelog.langchain.com/announcements/view-deployment-metrics-in-langgraph-platform
- LangChain. (March 2026). LangGraph Changelog - v1.1. https://docs.langchain.com/oss/python/releases/changelog
- LangChain. LangSmith Cost Tracking. https://docs.langchain.com/langsmith/cost-tracking
- LangChain. LangSmith Observability. https://www.langchain.com/langsmith/observability
- LangChain. LangSmith Agent Observability - Production Monitoring Conceptual Guide. https://www.langchain.com/conceptual-guides/production-monitoring
- OpenTelemetry. (2025). AI Agent Observability - Evolving Standards and Best Practices. https://opentelemetry.io/blog/2025/ai-agent-observability/
- OpenTelemetry. Semantic Conventions for GenAI Agent and Framework Spans. https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/
- OpenTelemetry. GitHub Issue #2664. (August 2025). Semantic Conventions for Generative AI Agentic Systems. https://github.com/open-telemetry/semantic-conventions/issues/2664
- Datadog. (December 2025). Datadog LLM Observability Natively Supports OpenTelemetry GenAI Semantic Conventions. https://www.datadoghq.com/blog/llm-otel-semantic-convention/
- Langfuse. (April 2026). langfuse 4.0.6 Python SDK. https://pypi.org/project/langfuse/
- Langfuse. Open Source Observability for LangGraph. https://langfuse.com/guides/cookbook/integration_langgraph
- ActiveWizards. LLM Observability: A Guide to Monitoring with LangSmith, Prometheus, and Grafana. https://activewizards.com/blog/llm-observability-a-guide-to-monitoring-with-langsmith
- Fordel Studios. (April 2026). The Future of Multi-Agent Systems in Enterprise Software. https://fordelstudios.com/research/future-multi-agent-systems-enterprise
- The New Stack. (March 2026). Why Agentic AI Stalls in Production - and How a Control Plane Fixes It. https://thenewstack.io/agentic-ai-control-plane-production/
- Novatechflow. (March 2026). Agent Observability for Multi-Agent Systems: How to Trace Agent Workflows in Production. https://www.novatechflow.com/2026/03/agent-observability-for-multi-agent.html
- QCon San Francisco. (November 2025). The Future of Agentic AI: Architecting the Global Control Plane. https://qconsf.com/presentation/nov2025/future-agentic-ai-architecting-global-control-plane
- getmaxim.ai. (December 2025). Top 5 AI Agent Observability Platforms in 2026. https://www.getmaxim.ai/articles/top-5-ai-agent-observability-platforms-in-2026/
- getmaxim.ai. (November 2025). The Role of Observability in Maintaining AI Agent Performance. https://www.getmaxim.ai/articles/the-role-of-observability-in-maintaining-ai-agent-performance/
- Ranjan Kumar. (April 2026). Harness Engineering: The Missing Layer Between LLMs and Production Systems. https://ranjankumar.in/harness-engineering-the-missing-layer-between-llms-and-production-systems
- Ranjan Kumar. (April 2026). Gated Execution: Why Your Agent Should Never Act Without Permission. https://ranjankumar.in/harness-engineering-gated-execution-llm-agents-policy-safety
- Ranjan Kumar. (April 2026). Validation Layer Design: Building the Reflex That Catches What the Model Gets Wrong. https://ranjankumar.in/harness-engineering-validation-layer-design-llm-output-repair
- Ranjan Kumar. (April 2026). Retry, Fallback, and Circuit Breaking: Building LLM Infrastructure That Survives Outages. https://ranjankumar.in/harness-engineering-retry-fallback-circuit-breaking-llm-resilience
- Ranjan Kumar. (April 2026). State Management for Agentic Systems: How to Build Agents That Don't Start Over. https://ranjankumar.in/harness-engineering-state-management-agentic-systems-checkpoint-memory
Related Articles
- Global Policy Enforcement vs. Per-Agent Gate Rules: Two Layers That Must Not Collapse Into One
- Building a Production MCP Server: Architecture, Pitfalls, and Best Practices
- Building Production-Ready AI Agent Services: FastAPI + LangGraph Template Deep Dive