Unified Observability Across Agent Fleets: Building the Control Plane Metric Layer

An invoice-processing agent ran all weekend. No alerts fired. No errors surfaced. The system showed green across every dashboard. On Monday, the team discovered it had burned several hundred dollars in API credits looping on a validation error that didn't exist - the agent kept retrying a condition it had incorrectly inferred, and nothing in the monitoring stack could see the reasoning path that drove it there.

The team spent two days reconstructing what happened by hand. They had traces. They had logs. They had latency dashboards. What they didn't have was any visibility into the sequence of state transitions that explained why the agent kept going.

That is the gap this article is about.

The Observability Illusion in Agent Fleets

Here is the belief most platform teams hold: "We have LangSmith set up, so we have observability."

This is wrong. Not because LangSmith is inadequate - it's a capable tool - but because having traces is not the same as having observability. Traditional APM was designed for stateless services with deterministic control flow. You know the code paths. You can enumerate failure modes. Latency is a proxy for correctness.

Agent systems violate every assumption that traditional monitoring was built on:

They carry state across turns and tool calls
The same input produces different reasoning paths depending on context
A technically successful response (200 OK, tokens returned) can be a behavioral failure
Failures compound silently across agent-to-agent handoffs before any single agent raises an error

The thesis of this series: governing a fleet of agents in production requires a control plane - and the first component of that control plane is a three-tier metric layer that most teams are missing two thirds of.

Why Teams Keep Getting Burned

The failure pattern repeats. A team instruments their LangGraph agents with LangSmith. They wire up Prometheus to scrape infrastructure metrics. They build a Grafana dashboard with P99 latency and token usage by model. They call this "done."

Six weeks into production, an agent starts silently hallucinating tool parameters - calling real APIs with plausible-looking but incorrect payloads. The infrastructure layer shows nothing wrong. The harness layer shows token costs are stable. The behavioral layer doesn't exist, so nobody sees that the agent's tool-call accuracy has degraded 23% over the past week. By the time a user reports a problem, there are hundreds of corrupted downstream records to trace.

This is not a tooling gap. Every component needed to catch that failure existed. It's an architecture gap - the failure to organize metrics into the right conceptual tiers and then instrument all three.

The Three-Tier Control Plane Metric Model

Three Tiers: {
  # Change this number to increase or decrease the space
  vertical-gap: 10
  grid-rows: 3

  Tier3: {
    label: "Tier 3: Behavioral Metrics\n(Is the agent doing the right thing?)"
    style: {
      fill: "#f3e5f5"
      stroke: "#7b1fa2"
    }
  }

  Tier2: {
    label: "Tier 2: Harness Metrics\n(Is the agent executing correctly?)"
    style: {
      fill: "#e3f2fd"
      stroke: "#1976d2"
    }
  }

  Tier1: {
    label: "Tier 1: Infrastructure Metrics\n(Is the system running?)"
    style: {
      fill: "#e8f5e9"
      stroke: "#388e3c"
    }
  }
}

The control plane for an agent fleet requires three distinct tiers of metrics, each answering a different question.

Tier 1 - Infrastructure Metrics ask: Is the system running?

This is traditional APM territory. CPU, memory, GPU utilization, network I/O, error rates on HTTP services, pod health in Kubernetes. Your existing Datadog or Prometheus setup covers this. The mistake is thinking this tier is sufficient. An agent can be "running" at 100% uptime while making systematically wrong decisions at every step.

Tier 2 - Harness Metrics ask: Is the agent executing correctly?

This is what the Harness Engineering series (starting with the foundational layer article) treats as the instrumentation concern for individual agents. Token cost per node, per-step latency, tool call success rates, retry counts, gate trigger rates from the Gated Execution layer, validation failure rates from the Validation Layer, circuit breaker trip counts from Retry and Fallback infrastructure.

For a single agent, these metrics surface in LangSmith naturally. For a fleet of 10-50 agents, they need to be aggregated, normalized, and queryable across agent types in one plane.

Tier 3 - Behavioral Metrics ask: Is the agent doing the right thing?

This tier is where most teams have a complete blind spot. Behavioral metrics track the gap between technical success and actual correctness: goal completion rate, tool-call accuracy, output schema compliance, quality drift over time, reasoning path anomalies. A hiring agent can process 10,000 resumes with zero errors and still be systematically biased. Standard dashboards show green. Behavioral metrics don't.

The three tiers together define the Telemetry Surface Gap - the coverage difference between what your current monitoring captures and the full three-tier signal surface required to govern an agent fleet. The Telemetry Surface Gap is why production incidents in agent systems take so long to diagnose: the signal that would have surfaced the failure exists in principle, but no one is collecting or querying it.

Infrastructure	Harness	Behavioral
CPU/Memory	Token cost/node	Goal completion
HTTP error rates	Step latency	Tool accuracy
Pod health	Tool call rate	Output quality
Uptime	Gate triggers	Reasoning drift
GPU utilization	Retry counts	Schema compliance

Most teams have Tier 1. Some teams have Tier 2 for their primary agent. Almost no teams have Tier 3, and almost no teams have any of the three tiers unified across their full fleet. The Telemetry Surface Gap is the measurable difference between where your instrumentation ends and where you need it to be.

The Session Reconstruction Problem

Before looking at implementation, the framing matters.

The dominant mental model for observability is: collect logs, search logs when something breaks, correlate manually. This works for microservices. It fails for agents.

Here's why. In a microservice, a request has a clear start and end. The execution path is deterministic. You can reproduce failures reliably. When something breaks, you look at the logs around the timestamp and find the error.

An agent run is not a request. It's a sequence of state transitions. A tool response changes the state. A planner decision changes the state. A human approval changes the state. A retry changes the state. A delegated sub-task changes the state. To understand why an agent did what it did, you need to reconstruct the full sequence of those transitions - not just the terminal output, not just the error message at the end.

Agent observability is not a logging problem. It is a session reconstruction problem.

This distinction changes how you architect the metric layer. Instead of treating each agent action as an isolated log event, you track it as a state transition in a session. The session is the unit of analysis. Debugging is replaying the session to find the transition that sent the agent down the wrong path.

For the invoice-processing agent that burned through API credits over a weekend, the failure wasn't visible in any individual log line. It was visible in the pattern across hundreds of state transitions - the agent repeatedly re-entering the same validation loop. That pattern is only visible when you can replay the session.

Wrong Way: Framework-Siloed Metric Collection

Here is the naive instrumentation approach most teams start with.

code

# Wrong way: per-framework, siloed metric collection# Each agent type is instrumented independently.# There is no unified fleet view.import osfrom langgraph.graph import StateGraphfrom langchain_openai import ChatOpenAI# LangSmith auto-instrumentation via env varos.environ["LANGCHAIN_TRACING_V2"] = "true"os.environ["LANGCHAIN_PROJECT"] = "invoice-agent"# Second agent - different project, no shared contextos.environ["LANGCHAIN_PROJECT"] = "approval-agent"# Problems with this approach:# 1. Two separate LangSmith projects - no cross-agent trace correlation# 2. No fleet-level token cost aggregation# 3. No behavioral metrics anywhere# 4. Infrastructure metrics live in a separate Datadog dashboard# 5. To answer "what is the total cost of processing one invoice#    end-to-end across all agents?" requires manual joins across#    three different systems.

The siloed approach has a specific failure signature: when a multi-agent pipeline fails, you know something went wrong, but you cannot trace where in the agent handoff chain the failure originated without manually correlating trace IDs across systems. This is a multi-hour exercise in production incidents.

A second common mistake - using a single shared LangSmith project for all agents:

code

# Also wrong: single project, no agent identityos.environ["LANGCHAIN_PROJECT"] = "production-agents"# Now all agents share one project.# You cannot query "show me cost breakdown by agent type"# or "which agent is responsible for the P99 latency spike?"# Everything is one undifferentiated blob.

Right Way: Unified Fleet Telemetry with OTel and Agent Identity

The correct architecture separates three concerns:

Agent identity - every span knows which agent type and version produced it
Session correlation - all spans for one end-to-end pipeline execution share a trace ID
Tier-aware metric emission - harness metrics and behavioral metrics are emitted as typed signals, not free-text logs

LangGraph 1.1.2 (March 2026) with LangSmith Fleet and OpenTelemetry GenAI Semantic Conventions v1.37 gives you the foundation. Here is the instrumentation pattern:

code

# Right way: unified fleet telemetry with OTel + agent identityfrom __future__ import annotationsimport osimport timeimport uuidfrom dataclasses import dataclass, fieldfrom typing import Any, TypedDictfrom langchain_openai import ChatOpenAIfrom langgraph.graph import END, StateGraphfrom opentelemetry import metrics, tracefrom opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporterfrom opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporterfrom opentelemetry.sdk.metrics import MeterProviderfrom opentelemetry.sdk.metrics.export import PeriodicExportingMetricReaderfrom opentelemetry.sdk.trace import TracerProviderfrom opentelemetry.sdk.trace.export import BatchSpanProcessor# --- OTel setup ---def setup_telemetry(otlp_endpoint: str) -> tuple[trace.Tracer, metrics.Meter]:    tracer_provider = TracerProvider()    tracer_provider.add_span_processor(        BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint))    )    trace.set_tracer_provider(tracer_provider)    reader = PeriodicExportingMetricReader(        OTLPMetricExporter(endpoint=otlp_endpoint),        export_interval_millis=30_000,    )    meter_provider = MeterProvider(metric_readers=[reader])    metrics.set_meter_provider(meter_provider)    tracer = trace.get_tracer("ai.control_plane")    meter = metrics.get_meter("ai.control_plane")    return tracer, meter# --- Fleet-level metric instruments ---@dataclassclass FleetMetrics:    """    Typed metric instruments for all three tiers.    One instance shared across the fleet.    """    # Tier 2 - Harness Metrics    token_cost_counter: metrics.Counter           # cost per agent, per model    node_latency_histogram: metrics.Histogram     # latency per node, per agent type    tool_call_counter: metrics.Counter            # tool calls by name, agent, result    gate_trigger_counter: metrics.Counter         # gate layer trips by policy    retry_counter: metrics.Counter                # retries by agent, error type    validation_failure_counter: metrics.Counter   # validation failures by schema    # Tier 3 - Behavioral Metrics    goal_completion_counter: metrics.Counter      # success/failure per goal type    tool_accuracy_histogram: metrics.Histogram    # accuracy of tool param generation    quality_score_histogram: metrics.Histogram    # LLM-as-judge scores by agent    schema_compliance_counter: metrics.Counter    # output schema pass/fail    reasoning_loop_counter: metrics.Counter       # repeated reasoning state entries    @classmethod    def create(cls, meter: metrics.Meter) -> "FleetMetrics":        return cls(            # Tier 2            token_cost_counter=meter.create_counter(                "gen_ai.agent.token_cost",                description="Token cost in USD per agent node execution",                unit="USD",            ),            node_latency_histogram=meter.create_histogram(                "gen_ai.agent.node_latency",                description="Latency per agent node in milliseconds",                unit="ms",            ),            tool_call_counter=meter.create_counter(                "gen_ai.agent.tool_calls",                description="Tool call count by name, agent, and result",            ),            gate_trigger_counter=meter.create_counter(                "gen_ai.agent.gate_triggers",                description="Gated execution policy trips",            ),            retry_counter=meter.create_counter(                "gen_ai.agent.retries",                description="Retry count by agent and error type",            ),            validation_failure_counter=meter.create_counter(                "gen_ai.agent.validation_failures",                description="Output validation failures by schema",            ),            # Tier 3            goal_completion_counter=meter.create_counter(                "gen_ai.agent.goal_completions",                description="Goal completion outcomes by agent and goal type",            ),            tool_accuracy_histogram=meter.create_histogram(                "gen_ai.agent.tool_accuracy",                description="Tool parameter accuracy score (0.0-1.0)",            ),            quality_score_histogram=meter.create_histogram(                "gen_ai.agent.quality_score",                description="LLM-as-judge quality score per agent run",            ),            schema_compliance_counter=meter.create_counter(                "gen_ai.agent.schema_compliance",                description="Output schema compliance pass/fail",            ),            reasoning_loop_counter=meter.create_counter(                "gen_ai.agent.reasoning_loops",                description="Repeated reasoning state entries per session",            ),        )# --- Agent identity carrier ---@dataclassclass AgentIdentity:    """    Carries agent type, version, and fleet position into every span.    Maps to OTel GenAI SemConv v1.37 gen_ai.agent.* attributes.    """    agent_type: str        # e.g. "invoice_extractor", "approval_router"    agent_version: str     # e.g. "1.3.0"    fleet_id: str          # e.g. "invoice_processing_fleet"    model: str             # e.g. "gpt-4o-2024-11-20"# --- Instrumented agent state ---class InvoiceState(TypedDict):    invoice_text: str    extracted_fields: dict[str, Any]    validation_errors: list[str]    approval_status: str    session_id: str        # shared across entire pipeline execution    agent_type: str        # current agent type for metric attribution# --- Instrumented node wrapper ---class InstrumentedNode:    """    Wraps any LangGraph node to emit Tier 2 and Tier 3 metrics    with full agent identity and session correlation.    """    def __init__(        self,        identity: AgentIdentity,        fleet_metrics: FleetMetrics,        tracer: trace.Tracer,    ) -> None:        self.identity = identity        self.metrics = fleet_metrics        self.tracer = tracer    def _base_attributes(self, session_id: str) -> dict[str, str]:        """OTel GenAI SemConv v1.37 compliant attribute set."""        return {            "gen_ai.agent.name": self.identity.agent_type,            "gen_ai.agent.version": self.identity.agent_version,            "gen_ai.system": self.identity.fleet_id,            "gen_ai.request.model": self.identity.model,            "session.id": session_id,        }    def run(self, state: InvoiceState, node_fn) -> InvoiceState:        session_id = state["session_id"]        attrs = self._base_attributes(session_id)        with self.tracer.start_as_current_span(            f"gen_ai.agent.{self.identity.agent_type}",            attributes=attrs,        ) as span:            start_ms = time.monotonic() * 1000            try:                result = node_fn(state)                span.set_attribute("gen_ai.agent.outcome", "success")                return result            except Exception as exc:                span.set_attribute("gen_ai.agent.outcome", "error")                span.set_attribute("error.type", type(exc).__name__)                raise            finally:                elapsed_ms = time.monotonic() * 1000 - start_ms                self.metrics.node_latency_histogram.record(elapsed_ms, attrs)    def record_token_cost(self, cost_usd: float, session_id: str) -> None:        self.metrics.token_cost_counter.add(            cost_usd, self._base_attributes(session_id)        )    def record_tool_call(        self, tool_name: str, success: bool, session_id: str    ) -> None:        attrs = {            **self._base_attributes(session_id),            "tool.name": tool_name,            "tool.result": "success" if success else "failure",        }        self.metrics.tool_call_counter.add(1, attrs)    def record_goal_completion(        self, goal_type: str, completed: bool, session_id: str    ) -> None:        attrs = {            **self._base_attributes(session_id),            "goal.type": goal_type,            "goal.outcome": "completed" if completed else "failed",        }        self.metrics.goal_completion_counter.add(1, attrs)    def record_reasoning_loop(self, session_id: str) -> None:        self.metrics.reasoning_loop_counter.add(            1, self._base_attributes(session_id)        )    def record_quality_score(self, score: float, session_id: str) -> None:        self.metrics.quality_score_histogram.record(            score, self._base_attributes(session_id)        )# --- Example: fleet-instrumented extraction agent ---def build_invoice_extraction_agent(    identity: AgentIdentity,    fleet_metrics: FleetMetrics,    tracer: trace.Tracer,) -> StateGraph:    """    Builds an extraction agent with full Tier 2 + Tier 3 instrumentation.    The session_id in state ties this agent's spans to the full pipeline trace.    """    instrumented = InstrumentedNode(identity, fleet_metrics, tracer)    llm = ChatOpenAI(model=identity.model, temperature=0)    def extract_node(state: InvoiceState) -> InvoiceState:        """        Clean node function: instrumented.run wraps this for span + latency.        All metric recording happens inside this function directly.        """        def _run(state: InvoiceState) -> InvoiceState:            response = llm.invoke(                f"Extract invoice fields as JSON from:\n{state['invoice_text']}"            )            # Tier 2: record token cost from response usage metadata            estimated_cost = len(response.content) * 0.000015            instrumented.record_token_cost(estimated_cost, state["session_id"])            # Tier 3: record goal completion            extracted = {"vendor": "Acme Corp", "amount": 4200.00}            instrumented.record_goal_completion(                goal_type="field_extraction",                completed=bool(extracted),                session_id=state["session_id"],            )            return {**state, "extracted_fields": extracted}        return instrumented.run(state, _run)    graph = StateGraph(InvoiceState)    graph.add_node("extract", extract_node)    graph.set_entry_point("extract")    graph.add_edge("extract", END)    return graph.compile()

The key differences from the siloed approach:

Every span carries gen_ai.agent.name, gen_ai.agent.version, gen_ai.system, and session.id - queryable attributes that let you aggregate across the full fleet in any OTel-compatible backend
Harness metrics (Tier 2) and behavioral metrics (Tier 3) are separate, typed metric instruments - not log lines you have to parse
The session_id propagates from the first agent in the pipeline to the last, so one query can reconstruct the full end-to-end execution across multiple agents

Detecting Reasoning Loops Before They Cost You

The gen_ai.agent.reasoning_loops metric deserves special attention because it closes the gap the invoice-processing story exposed. Here is the minimal implementation.

code

# Reasoning loop detection: track state fingerprints per session.# When the same reasoning state appears more than N times, emit the# loop metric and optionally interrupt the agent.import hashlibfrom collections import defaultdictclass ReasoningLoopDetector:    """    Tracks state fingerprints within a session.    Emits gen_ai.agent.reasoning_loops when a fingerprint repeats.    """    def __init__(        self,        instrumented_node: InstrumentedNode,        loop_threshold: int = 3,    ) -> None:        self.node = instrumented_node        self.threshold = loop_threshold        self._fingerprints: dict[str, int] = defaultdict(int)    def _fingerprint(self, state: InvoiceState) -> str:        """Stable hash of reasoning-relevant state fields."""        key = f"{state.get('extracted_fields')}|{state.get('validation_errors')}"        return hashlib.sha256(key.encode()).hexdigest()[:16]    def check(self, state: InvoiceState) -> bool:        """        Returns True if the current state has exceeded the loop threshold.        Call this at the start of each node execution.        """        fp = self._fingerprint(state)        self._fingerprints[fp] += 1        count = self._fingerprints[fp]        if count > 1:            # Emit the Tier 3 behavioral metric on every repeat            self.node.record_reasoning_loop(state["session_id"])        if count >= self.threshold:            # This is the alert condition.            # In production: set span status to error, surface to on-call.            return True  # caller should interrupt the agent        return False

Wire this into the node at the start of execution. When check() returns True, the node should return an error state rather than calling the LLM again. The metric counter fires at every repeat - so alerting at "reasoning_loops > 3 in one session" catches the failure within the 4th loop, not after a weekend of runaway API calls.

The Control Plane Telemetry Routing Layer

Single-agent instrumentation closes the Telemetry Surface Gap for one agent. A control plane operates at fleet scope. That requires a routing layer - the telemetry routing layer - that takes OTel spans from all agents regardless of framework and normalizes them into one queryable backend. This is what makes the control plane framework-agnostic.

code

# Fleet telemetry routing: one OTel Collector config# that receives from LangGraph agents, OpenAI Agents SDK agents,# and any custom orchestrators, and routes to one backend.# otel-collector-config.yaml (abbreviated)# receivers:#   otlp:#     protocols:#       grpc:#         endpoint: 0.0.0.0:4317## processors:#   attributes/enrich:#     actions:#       - key: fleet.environment#         value: production#         action: insert#       - key: fleet.version#         from_attribute: gen_ai.agent.version#         action: insert#   filter/pii:#     spans:#       exclude:#         match_type: regexp#         attributes:#           - key: gen_ai.input.messages#             value: ".*\\b\\d{16}\\b.*"   # strip card numbers from prompt traces## exporters:#   otlphttp/langsmith:#     endpoint: https://api.smith.langchain.com/otel#     headers:#       x-api-key: ${LANGSMITH_API_KEY}#   prometheus:#     endpoint: 0.0.0.0:8889              # Tier 2 + Tier 3 metrics to Grafana## service:#   pipelines:#     traces:#       receivers: [otlp]#       processors: [attributes/enrich, filter/pii]#       exporters: [otlphttp/langsmith]#     metrics:#       receivers: [otlp]#       processors: [attributes/enrich]#       exporters: [prometheus]# The result: LangSmith gets trace data for debugging.# Prometheus + Grafana gets metric data for dashboards and alerts.# Both views are correlated by session_id and agent identity attributes.# PII is stripped before leaving the network boundary.

This is the architecture that makes a control plane possible. The OTel Collector is the normalization layer. It doesn't care whether the span came from a LangGraph agent, an OpenAI Agents SDK agent, or a custom Python orchestrator. If the span follows OTel GenAI SemConv v1.37 conventions, the collector processes it identically.

The Harness Metric Surface - What to Instrument Per Agent

Building on the execution primitives from the Harness Engineering series, every agent in a fleet should emit this minimum metric surface:

Tier 2 - Harness Metrics (per agent node)

Metric	Description
`gen_ai.agent.token_cost`	Cost in USD per invocation
`gen_ai.agent.node_latency`	p50/p99 latency per node type
`gen_ai.agent.tool_calls`	Count by tool name + success/failure
`gen_ai.agent.gate_triggers`	Gated Execution policy trip count
`gen_ai.agent.retries`	Retry count by error class
`gen_ai.agent.validation_failures`	Schema/content validation failures
`gen_ai.agent.checkpoint_writes`	State Management checkpoint ops

Tier 3 - Behavioral Metrics (per session)

Metric	Description
`gen_ai.agent.goal_completions`	Success/failure by goal type
`gen_ai.agent.tool_accuracy`	Param accuracy score (0.0-1.0)
`gen_ai.agent.quality_score`	LLM-as-judge score per run
`gen_ai.agent.schema_compliance`	Output schema pass/fail rate
`gen_ai.agent.reasoning_loops`	Repeated reasoning state entries (loop detection - the invoice agent bug)

The gen_ai.agent.reasoning_loops metric is the one that would have caught the invoice-processing failure within minutes instead of two days. When a counter for "agent re-entered the same reasoning state" exceeds a threshold - say, 3 occurrences in one session - it fires an alert. The session is flagged for human review before it burns through a weekend of credits.

Diagram: Three-Tier Metric Architecture

mermaid

flowchart TD
    subgraph Fleet["Agent Fleet"]
        A1["Invoice Extractor\nAgent v1.3.0"]
        A2["Approval Router\nAgent v2.1.0"]
        A3["Ledger Sync\nAgent v1.0.5"]
    end

    subgraph OTelSDK["OTel SDK\nper agent"]
        S1["Tier 1\nInfra Spans"]
        S2["Tier 2\nHarness Metrics"]
        S3["Tier 3\nBehavioral Metrics"]
    end

    subgraph Collector["OTel Collector"]
        E["Enrich\nagent identity\n+ session_id"]
        P["Filter PII\nfrom prompts"]
        R["Route to\nbackends"]
    end

    subgraph Backends["Observability Backends"]
        LS["LangSmith Fleet\nTrace debugging\nSession replay"]
        PR["Prometheus\n+ Grafana\nMetric dashboards\nAlerts"]
    end

    A1 & A2 & A3 --> S1 & S2 & S3
    S1 & S2 & S3 --> E --> P --> R
    R -->|"traces"| LS
    R -->|"metrics"| PR

    style A1 fill:#4A90E2,color:#fff
    style A2 fill:#4A90E2,color:#fff
    style A3 fill:#4A90E2,color:#fff
    style S1 fill:#95A5A6,color:#fff
    style S2 fill:#7B68EE,color:#fff
    style S3 fill:#6BCF7F,color:#fff
    style E fill:#FFD93D,color:#333
    style P fill:#E74C3C,color:#fff
    style R fill:#98D8C8,color:#333
    style LS fill:#9B59B6,color:#fff
    style PR fill:#9B59B6,color:#fff

The collector is the normalization point that makes the control plane framework-agnostic. Without it, you have N observability setups for N agent frameworks. With it, you have one.

LangSmith Fleet and the Agent Identity Model

LangSmith Fleet (launched March 2026, formerly Agent Builder) introduced a formal agent identity model: agents have an identity, a version, sharing permissions, and fleet-scoped access controls. This is the platform-level answer to the identity problem described above.

The ABAC (Attribute-Based Access Control) layer in LangSmith now lets you apply tag-based allow/deny policies on top of RBAC roles - so you can control which teams can query which agent's traces. Audit Logs provide a tamper-resistant record of every administrative action across the fleet, queryable via API.

For teams already on the LangChain stack, Fleet is the right starting point for fleet-level identity management. For teams with mixed frameworks - some on LangGraph, some on OpenAI Agents SDK, some custom - the OTel Collector approach above is the framework-neutral path.

Tiered Storage for Session Replay

A common objection to full session-level observability: "If we store every state transition for every agent session, the storage bill will exceed the inference bill."

This is true if you treat every event as permanently hot data. The right model is tiered:

Hot tier (0-7 days): Full session traces in LangSmith or your OTel backend. Queryable in seconds. Used for active incident response.
Warm tier (7-90 days): Compressed session replay data in object storage (S3, GCS). Takes seconds to load. Used for debugging recurring issues and regression analysis.
Cold tier (90+ days): Archived sessions in low-cost object storage. Used for compliance, audit, and long-horizon quality analysis.

The goal is replay-on-demand without paying hot-tier prices for cold data. Most platforms (LangSmith Enterprise, Langfuse self-hosted) support this pattern natively. For custom stacks, a Kafka-backed event stream with tiered consumers is the production-grade architecture.

Instrumentation Checklist for Agent Fleet Observability

Before your first agent goes to production:

Before adding the second agent type to your fleet:

Both agents share the same fleet_id attribute value
Fleet-level dashboards query by gen_ai.system = {fleet_id} - not by individual project
Cost attribution is queryable: "total cost per pipeline execution" across all contributing agents
Alert thresholds defined at fleet scope, not per-agent

What Comes Next

The opening thesis was this: teams think they have observability because they have traces. They don't. They have logging without the behavioral and session-level signal that governs a fleet.

What you now have, after applying the three-tier metric model and the telemetry routing architecture, is the foundation to close the Telemetry Surface Gap - the ability to query fleet-wide harness metrics, detect behavioral failures before users do, and replay sessions to reconstruct why an agent made the decisions it made. That is observability. Traces are one input to it.

The rest of the control plane depends on this foundation. Without it, you cannot enforce global policies you can verify, you cannot detect which agent version introduced a regression, and you cannot attribute cost to the right pipeline. Every subsequent part of this series assumes you have all three tiers instrumented and the routing layer running.

Part 2 covers Global Policy Enforcement vs. Per-Agent Gate Rules - how to separate fleet-wide policy (no PII in external tool calls, hard budget caps, model version pinning) from the per-agent gate logic introduced in the Gated Execution layer. These are distinct concerns that most teams collapse into one, and the collision causes both systems to fail.

References

LangChain. (March 2026). March 2026 Newsletter: LangSmith Fleet, LangGraph v1.1. https://blog.langchain.com/march-2026-langchain-newsletter/
LangChain. (October 2025). View Agent Deployment Metrics in LangGraph Platform. https://changelog.langchain.com/announcements/view-deployment-metrics-in-langgraph-platform
LangChain. (March 2026). LangGraph Changelog - v1.1. https://docs.langchain.com/oss/python/releases/changelog
LangChain. LangSmith Cost Tracking. https://docs.langchain.com/langsmith/cost-tracking
LangChain. LangSmith Observability. https://www.langchain.com/langsmith/observability
LangChain. LangSmith Agent Observability - Production Monitoring Conceptual Guide. https://www.langchain.com/conceptual-guides/production-monitoring
OpenTelemetry. (2025). AI Agent Observability - Evolving Standards and Best Practices. https://opentelemetry.io/blog/2025/ai-agent-observability/
OpenTelemetry. Semantic Conventions for GenAI Agent and Framework Spans. https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/
OpenTelemetry. GitHub Issue #2664. (August 2025). Semantic Conventions for Generative AI Agentic Systems. https://github.com/open-telemetry/semantic-conventions/issues/2664
Datadog. (December 2025). Datadog LLM Observability Natively Supports OpenTelemetry GenAI Semantic Conventions. https://www.datadoghq.com/blog/llm-otel-semantic-convention/
Langfuse. (April 2026). langfuse 4.0.6 Python SDK. https://pypi.org/project/langfuse/
Langfuse. Open Source Observability for LangGraph. https://langfuse.com/guides/cookbook/integration_langgraph
ActiveWizards. LLM Observability: A Guide to Monitoring with LangSmith, Prometheus, and Grafana. https://activewizards.com/blog/llm-observability-a-guide-to-monitoring-with-langsmith
Fordel Studios. (April 2026). The Future of Multi-Agent Systems in Enterprise Software. https://fordelstudios.com/research/future-multi-agent-systems-enterprise
The New Stack. (March 2026). Why Agentic AI Stalls in Production - and How a Control Plane Fixes It. https://thenewstack.io/agentic-ai-control-plane-production/
Novatechflow. (March 2026). Agent Observability for Multi-Agent Systems: How to Trace Agent Workflows in Production. https://www.novatechflow.com/2026/03/agent-observability-for-multi-agent.html
QCon San Francisco. (November 2025). The Future of Agentic AI: Architecting the Global Control Plane. https://qconsf.com/presentation/nov2025/future-agentic-ai-architecting-global-control-plane
getmaxim.ai. (December 2025). Top 5 AI Agent Observability Platforms in 2026. https://www.getmaxim.ai/articles/top-5-ai-agent-observability-platforms-in-2026/
getmaxim.ai. (November 2025). The Role of Observability in Maintaining AI Agent Performance. https://www.getmaxim.ai/articles/the-role-of-observability-in-maintaining-ai-agent-performance/
Ranjan Kumar. (April 2026). Harness Engineering: The Missing Layer Between LLMs and Production Systems. https://ranjankumar.in/harness-engineering-the-missing-layer-between-llms-and-production-systems
Ranjan Kumar. (April 2026). Gated Execution: Why Your Agent Should Never Act Without Permission. https://ranjankumar.in/harness-engineering-gated-execution-llm-agents-policy-safety
Ranjan Kumar. (April 2026). Validation Layer Design: Building the Reflex That Catches What the Model Gets Wrong. https://ranjankumar.in/harness-engineering-validation-layer-design-llm-output-repair
Ranjan Kumar. (April 2026). Retry, Fallback, and Circuit Breaking: Building LLM Infrastructure That Survives Outages. https://ranjankumar.in/harness-engineering-retry-fallback-circuit-breaking-llm-resilience
Ranjan Kumar. (April 2026). State Management for Agentic Systems: How to Build Agents That Don't Start Over. https://ranjankumar.in/harness-engineering-state-management-agentic-systems-checkpoint-memory

AI Engineering

MLOPS

Agentic AI Observability: Why Traditional Monitoring Breaks with Autonomous Systems

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:

The Observability Illusion in Agent Fleets

Why Teams Keep Getting Burned

The Three-Tier Control Plane Metric Model

The Session Reconstruction Problem

Wrong Way: Framework-Siloed Metric Collection

Right Way: Unified Fleet Telemetry with OTel and Agent Identity

Detecting Reasoning Loops Before They Cost You

The Control Plane Telemetry Routing Layer

The Harness Metric Surface - What to Instrument Per Agent

Diagram: Three-Tier Metric Architecture

LangSmith Fleet and the Agent Identity Model

Tiered Storage for Session Replay

Instrumentation Checklist for Agent Fleet Observability

What Comes Next

References

Related Articles

Comments