← Back to Guides
5

Series

AI Control Plane· Part 5

GuideFor: AI Engineers, ML Engineers, Platform Engineers, AI Systems Architects

Cost Governance and Budget Allocation Across Agent Types: Token Spend Is Infrastructure Spend

Most teams discover their agent fleet's true cost on the invoice. By then, three budget cycles of misconfigured pipelines have already run.

#cost-governance#token-budget#finops#agent-fleet#production-ai#control-plane

ICONIQ Capital's 2026 State of AI report found that inference costs run at 23% of revenue for scaling AI-native companies. IDC's FutureScape 2026 report warned that even organizations with dedicated FinOps teams will underestimate AI infrastructure costs by up to 30%. For teams without dedicated governance infrastructure, the underestimation gap is almost certainly higher.

The specific failure mode is this: a team deploys a fleet of agents, each with a per-request token limit that seems reasonable in isolation. The fleet runs for 30 days. The invoice arrives with a number nobody budgeted. Investigation reveals that one agent - an enrichment agent with a large context window - was running on every request when it should have been running on 20% of requests. The routing logic had a bug. The per-request limit didn't catch it because each individual request was within budget. The aggregate wasn't.

This is what Mavvrik called "AI bill shock" when they launched agent-level cost tracking in March 2026: cost volatility that arrives as a surprise on invoices rather than as a predictable line item. The solution is not tighter per-request limits. It is the Three-Layer Budget Model: budget enforcement at per-request, per-agent-type, and per-pipeline scope simultaneously.

Wrong Way: One Limit, Wrong Level

The naive approach sets a per-request token limit and calls it governance.

code
# Wrong way: single-layer per-request limit# No per-agent-type daily tracking. No per-pipeline cap.# Catches individual runaway calls. Misses routing bugs, loop bugs,# and compounding costs across a multi-agent pipeline.MAX_TOKENS_PER_REQUEST = 2000def call_llm_naive(prompt: str, model: str) -> str:    response = openai_client.chat.completions.create(        model=model,        messages=[{"role": "user", "content": prompt}],        max_tokens=MAX_TOKENS_PER_REQUEST,  # only protection    )    return response.choices[0].message.content# What this misses:# 1. Enrichment agent misconfigured to run on 100% of requests instead of 20%:#    each individual call is under limit, daily spend is 5x budget, nobody notices# 2. Pipeline loop bug: agent calls itself 50 times, each within the per-request limit,#    total pipeline cost 50x expected, no alert fires# 3. Cost attribution: no idea which team, product, or pipeline is driving spend# 4. No forecast: spend only visible on the monthly invoice

The Three-Layer Budget Model

Token budgets need to be enforced at different granularities because different failure modes manifest at different levels.

Per-request limits catch individual runaway calls. A single LLM call that receives an unexpectedly large input and generates a 10,000-token response should be capped before it completes. This is the circuit breaker pattern applied to cost, directly analogous to the retry and circuit breaking layer from Harness Engineering Part 6.

Per-agent-type limits catch routing and loop failures. If an enrichment agent that normally handles 20% of requests suddenly handles 80% due to a routing bug, the per-request limit won't fire - each request is within budget. But the per-agent-type daily limit fires at 4x expected volume.

Per-pipeline limits catch compounding costs across a full execution. A pipeline that calls 5 agents, each within their individual budget, can still exceed the total cost envelope for the business transaction it's executing. A $2 invoice extraction pipeline that costs $0.40 to run is economically sound. A $12 run that does the same extraction is a cost failure even if every individual agent stayed within limits.

code
# budget_enforcer.py# Three-layer budget enforcement with real-time tracking.# Integrates with the OTel fleet telemetry from Part 1.from __future__ import annotationsimport timefrom dataclasses import dataclass, fieldfrom enum import Enumfrom threading import Lockfrom typing import Optionalfrom opentelemetry import metricsclass BudgetAction(Enum):    ALLOW = "allow"    WARN = "warn"               # usage above warning threshold, proceed    HALT = "halt"               # hard limit hit, stop execution    ESCALATE_TO_HUMAN = "escalate"  # soft limit hit, require approval@dataclassclass BudgetResult:    action: BudgetAction    reason: str    current_spend_usd: float    limit_usd: float    utilization_pct: float@dataclassclass AgentBudgetConfig:    """Budget configuration per agent type. Set by platform team, not product teams."""    agent_type: str    per_request_limit_usd: float       # hard limit per single LLM call    daily_limit_usd: float             # hard limit per agent type per day    warn_threshold_pct: float = 0.80   # warn at 80% of any limit@dataclassclass PipelineBudgetConfig:    """Budget configuration per pipeline execution."""    pipeline_type: str    per_execution_limit_usd: float     # max cost for one end-to-end pipeline run    warn_threshold_pct: float = 0.80class BudgetEnforcer:    """    Enforces three-layer budgets: per-request, per-agent-type, per-pipeline.    Thread-safe. Emits OTel metrics on every decision for fleet dashboard.    """    def __init__(        self,        agent_configs: dict[str, AgentBudgetConfig],        pipeline_configs: dict[str, PipelineBudgetConfig],        meter: metrics.Meter,    ) -> None:        self._agent_configs = agent_configs        self._pipeline_configs = pipeline_configs        self._lock = Lock()        # Daily spend tracking per agent type - reset at midnight UTC        self._agent_daily_spend: dict[str, float] = {}        self._daily_reset_timestamp: float = self._today_midnight_utc()        # Per-pipeline-execution spend tracking        # session_id -> cumulative spend        self._pipeline_spend: dict[str, float] = {}        # OTel metric instruments        self._budget_decision_counter = meter.create_counter(            "gen_ai.fleet.budget_decisions",            description="Budget enforcement decisions by agent, action, and level",        )        self._spend_histogram = meter.create_histogram(            "gen_ai.fleet.spend_usd",            description="Token spend in USD per agent invocation",            unit="USD",        )    def check_and_record(        self,        agent_type: str,        session_id: str,        pipeline_type: str,        estimated_cost_usd: float,    ) -> BudgetResult:        """        Check all three budget layers before allowing an agent to execute.        Call BEFORE the LLM call, not after.        """        with self._lock:            self._maybe_reset_daily(agent_type)            config = self._agent_configs.get(agent_type)            pipeline_config = self._pipeline_configs.get(pipeline_type)            # Layer 1: Per-request limit            if config and estimated_cost_usd > config.per_request_limit_usd:                return self._decision(                    BudgetAction.HALT,                    f"Request cost ${estimated_cost_usd:.4f} exceeds per-request limit ${config.per_request_limit_usd:.4f}",                    estimated_cost_usd,                    config.per_request_limit_usd,                    agent_type, session_id, "per_request",                )            # Layer 2: Per-agent-type daily limit            if config:                daily_spend = self._agent_daily_spend.get(agent_type, 0)                projected_daily = daily_spend + estimated_cost_usd                utilization = projected_daily / config.daily_limit_usd                if projected_daily > config.daily_limit_usd:                    return self._decision(                        BudgetAction.HALT,                        f"Daily limit for {agent_type} would be exceeded: ${projected_daily:.2f} > ${config.daily_limit_usd:.2f}",                        projected_daily,                        config.daily_limit_usd,                        agent_type, session_id, "daily",                    )                if utilization > config.warn_threshold_pct:                    # Record the cost and warn - don't halt                    self._agent_daily_spend[agent_type] = projected_daily                    self._spend_histogram.record(estimated_cost_usd, {"gen_ai.agent.name": agent_type})                    return self._decision(                        BudgetAction.WARN,                        f"Daily budget {utilization*100:.0f}% utilized for {agent_type}",                        projected_daily,                        config.daily_limit_usd,                        agent_type, session_id, "daily",                    )            # Layer 3: Per-pipeline execution limit            if pipeline_config:                pipeline_spend = self._pipeline_spend.get(session_id, 0)                projected_pipeline = pipeline_spend + estimated_cost_usd                if projected_pipeline > pipeline_config.per_execution_limit_usd:                    return self._decision(                        BudgetAction.HALT,                        f"Pipeline execution budget would be exceeded: ${projected_pipeline:.4f} > ${pipeline_config.per_execution_limit_usd:.4f}",                        projected_pipeline,                        pipeline_config.per_execution_limit_usd,                        agent_type, session_id, "pipeline",                    )            # All layers cleared - record spend and allow            self._agent_daily_spend[agent_type] = self._agent_daily_spend.get(agent_type, 0) + estimated_cost_usd            self._pipeline_spend[session_id] = self._pipeline_spend.get(session_id, 0) + estimated_cost_usd            self._spend_histogram.record(estimated_cost_usd, {"gen_ai.agent.name": agent_type})            return BudgetResult(                action=BudgetAction.ALLOW,                reason="Within all budget limits",                current_spend_usd=estimated_cost_usd,                limit_usd=config.per_request_limit_usd if config else float("inf"),                utilization_pct=0.0,            )    def release_session(self, session_id: str) -> float:        """Call when a pipeline session completes. Returns total session spend."""        with self._lock:            return self._pipeline_spend.pop(session_id, 0.0)    def _decision(        self,        action: BudgetAction,        reason: str,        current: float,        limit: float,        agent_type: str,        session_id: str,        level: str,    ) -> BudgetResult:        utilization = current / limit if limit > 0 else 1.0        self._budget_decision_counter.add(1, {            "gen_ai.agent.name": agent_type,            "budget.action": action.value,            "budget.level": level,            "session.id": session_id,        })        return BudgetResult(action, reason, current, limit, utilization)    def _maybe_reset_daily(self, agent_type: str) -> None:        now_midnight = self._today_midnight_utc()        if now_midnight > self._daily_reset_timestamp:            self._agent_daily_spend = {}            self._daily_reset_timestamp = now_midnight    @staticmethod    def _today_midnight_utc() -> float:        import datetime        today = datetime.datetime.now(datetime.timezone.utc).date()        midnight = datetime.datetime.combine(today, datetime.time(), tzinfo=datetime.timezone.utc)        return midnight.timestamp()

Budget Enforcement Architecture

mermaid
flowchart TD
    REQ["LLM Call Request\nfrom agent node"]

    subgraph L1["Layer 1: Per-Request"]
        R1{"Cost estimate\n> per-request limit?"}
    end

    subgraph L2["Layer 2: Per-Agent-Type Daily"]
        R2{"Projected daily spend\n> daily limit?"}
        WARN["Warn at 80% threshold\nproceed but alert"]
    end

    subgraph L3["Layer 3: Per-Pipeline Execution"]
        R3{"Pipeline total\n> execution limit?"}
    end

    ALLOW["LLM Call Executes\nSpend recorded"]
    HALT["Budget Halt\nPipeline halt signal set\nAudit record emitted"]

    REQ --> R1
    R1 -->|"yes"| HALT
    R1 -->|"no"| R2
    R2 -->|"yes"| HALT
    R2 -->|"80-100%"| WARN --> R3
    R2 -->|"< 80%"| R3
    R3 -->|"yes"| HALT
    R3 -->|"no"| ALLOW

    style REQ fill:#95A5A6,color:#fff
    style R1 fill:#7B68EE,color:#fff
    style R2 fill:#7B68EE,color:#fff
    style R3 fill:#7B68EE,color:#fff
    style WARN fill:#FFD93D,color:#333
    style ALLOW fill:#6BCF7F,color:#fff
    style HALT fill:#E74C3C,color:#fff

A budget halt integrates directly with the pipeline halt protocol from Part 3: when the budget enforcer returns BudgetAction.HALT, the calling node sets halt=True and halt_reason=HaltReason.CIRCUIT_BREAKER_TRIPPED in the pipeline state. The halt propagates downstream. The audit node emits the event. The same mechanism that catches semantic failures catches cost overruns.

Not all agents should share the same budget tier. The right allocation model matches the model tier to the task value, not to the engineer's default choice.

code
# budget_configs.py# Fleet-wide budget configuration. Owned by platform team.# Enforced by BudgetEnforcer at runtime.AGENT_BUDGET_CONFIGS = {    # Frontier model agents: high value, high cost - strict limits    "invoice_extractor": AgentBudgetConfig(        agent_type="invoice_extractor",        per_request_limit_usd=0.05,    # ~3,300 tokens at GPT-4o pricing        daily_limit_usd=50.00,        warn_threshold_pct=0.80,    ),    # Mid-tier agents: standard tasks, cheaper models    "vendor_enricher": AgentBudgetConfig(        agent_type="vendor_enricher",        per_request_limit_usd=0.01,    # uses a smaller model        daily_limit_usd=20.00,        warn_threshold_pct=0.80,    ),    # Classification agents: should use smallest viable model    "invoice_classifier": AgentBudgetConfig(        agent_type="invoice_classifier",        per_request_limit_usd=0.002,   # classification: gpt-4o-mini or equivalent        daily_limit_usd=5.00,        warn_threshold_pct=0.90,       # tighter warning for cheap tasks    ),}PIPELINE_BUDGET_CONFIGS = {    "invoice_processing": PipelineBudgetConfig(        pipeline_type="invoice_processing",        per_execution_limit_usd=0.10,  # full pipeline should cost <10 cents        warn_threshold_pct=0.75,    ),}

The allocation principle: match the model to the task value. A classification decision that routes an invoice to one of three buckets does not require a frontier model. Using gpt-4o-mini or an equivalent small model for classification and reserving frontier model calls for extraction and reasoning cuts per-pipeline costs by 40-60% in most invoice processing workloads - with no measurable quality loss on the classification step.

The Deloitte Insights report from January 2026 described this shift as "FinOps for AI" - applying the same rigor to token spend that cloud teams apply to compute: forecast demand, enforce ROI thresholds, and treat tokens as a cost category with the same discipline as EC2 hours.

Cost Attribution to Business Units

A fleet-level cost view answers "how much did our agent fleet cost this month?" Attribution answers "which team, product, or business unit drove that cost?" Without attribution, cost governance has no mechanism for accountability.

The OTel attribute model from Part 1's fleet telemetry architecture makes attribution a tagging problem, not an instrumentation problem. Add business-unit tags to every span:

code
# cost_attribution.py# Business unit tagging for cost attribution.# Tags flow through OTel spans to the metrics backend.# Grafana can then answer: "cost by team, by product, by pipeline type."from dataclasses import dataclass@dataclassclass CostAttributionContext:    """Carries business unit metadata into every OTel span."""    team: str           # e.g. "accounts-payable"    product: str        # e.g. "invoice-automation"    cost_center: str    # e.g. "CC-4402"    environment: str    # "production" | "staging" | "development"def attribution_attributes(ctx: CostAttributionContext) -> dict[str, str]:    """Returns OTel span attributes for cost attribution."""    return {        "business.team": ctx.team,        "business.product": ctx.product,        "business.cost_center": ctx.cost_center,        "fleet.environment": ctx.environment,    }# Usage: merge into every span's attribute set alongside gen_ai.agent.* attributes.# In Grafana: sum(gen_ai.fleet.spend_usd) by business.team# Result: monthly spend broken down by team, billable to each cost center.

Vantage's FinOps team documented in 2026 that AI token spend is actually more attributable than cloud infrastructure spend at one level: providers like Anthropic and OpenAI expose per-API-key spend natively, giving individual-developer granularity before any internal tagging is applied. The team-level and product-level attribution above extends that granularity into the fleet's operational context.

The Escalation Gate: Expensive Model as a Fallback

The most effective cost reduction pattern is also the most underused: make the expensive model a fallback, not a default.

code
# model_escalation.py# Tiered model selection: attempt cheap model first, escalate on failure.# Reduces frontier model usage to cases where cheaper models genuinely fail.from langchain_openai import ChatOpenAIfrom typing import Optionalclass TieredModelSelector:    """    Attempts task with a cheaper model first.    Escalates to a frontier model only if:    - Cheap model returns low-confidence output    - Cheap model fails schema validation    - Budget enforcer allows the escalation cost    """    def __init__(        self,        cheap_model: str = "gpt-4o-mini",        frontier_model: str = "gpt-4o-2024-11-20",        confidence_threshold: float = 0.85,    ) -> None:        self.cheap = ChatOpenAI(model=cheap_model, temperature=0)        self.frontier = ChatOpenAI(model=frontier_model, temperature=0)        self.confidence_threshold = confidence_threshold    def invoke(self, prompt: str, budget_enforcer: BudgetEnforcer,               agent_type: str, session_id: str, pipeline_type: str) -> tuple[str, str]:        """        Returns (response_content, model_used).        Falls back to frontier model only if cheap model output is below threshold.        """        # Try cheap model first        cheap_cost_estimate = 0.001  # estimate before call        cheap_check = budget_enforcer.check_and_record(            agent_type, session_id, pipeline_type, cheap_cost_estimate        )        if cheap_check.action == BudgetAction.HALT:            raise RuntimeError(f"Budget halted cheap model call: {cheap_check.reason}")        cheap_response = self.cheap.invoke(prompt)        confidence = self._estimate_confidence(cheap_response.content)        if confidence >= self.confidence_threshold:            return cheap_response.content, "cheap"        # Cheap model confidence too low - escalate        frontier_cost_estimate = 0.01        frontier_check = budget_enforcer.check_and_record(            agent_type, session_id, pipeline_type, frontier_cost_estimate        )        if frontier_check.action == BudgetAction.HALT:            # Can't escalate - return cheap model output with low-confidence flag            return cheap_response.content + "\n[LOW_CONFIDENCE]", "cheap_unescalated"        frontier_response = self.frontier.invoke(prompt)        return frontier_response.content, "frontier"    def _estimate_confidence(self, content: str) -> float:        """        Heuristic confidence estimation.        In production: use logprobs if available, or a dedicated confidence model.        """        if "[UNCERTAIN]" in content or "I'm not sure" in content:            return 0.5        if len(content.strip()) < 10:            return 0.4        return 0.9

SoftwareSeni's FinOps framework for AI (March 2026) documented that "escalation gates before frontier model calls - require cheaper model steps to attempt the task first" is one of the three most effective cost levers, alongside token budget limits per agent call and cost caps per workflow. Teams that implement all three consistently achieve 40-60% cost reduction versus teams with only per-request limits.

Diagram: Fleet Cost Governance Architecture

The full cost governance system spans three control points: the model tier decision at the agent level, the three-layer budget enforcer at the call level, and the attribution pipeline that routes spend data to dashboards and finance.

mermaid
flowchart TD
    subgraph AgentFleet["Agent Fleet"]
        CL["Classifier\nAgent\ngpt-4o-mini"]
        EX["Extractor\nAgent\ngpt-4o"]
        AN["Analysis\nAgent\nClaude Sonnet"]
    end

    subgraph TieredModel["Tiered Model Selector"]
        CHEAP["Cheap Model\nFirst attempt"]
        CONF{"Confidence\n≥ threshold?"}
        FRONT["Frontier Model\nFallback only"]
    end

    subgraph BudgetEnforcer["Three-Layer Budget Enforcer"]
        L1["Layer 1\nPer-Request\n< $0.05"]
        L2["Layer 2\nPer-Agent Daily\n< $50/day"]
        L3["Layer 3\nPer-Pipeline\n< $0.10/run"]
        HALT["Budget Halt\nPipeline stops"]
        WARN["Warn at 80%\nAlert fires"]
    end

    subgraph Attribution["Cost Attribution"]
        OTEL["OTel Span\nteam + product\ncost_center tags"]
        PROM["Prometheus\nMetrics"]
        GRAF["Grafana\nDashboard"]
        FIN["Finance Report\nMonthly by BU"]
    end

    CL & EX & AN --> CHEAP --> CONF
    CONF -->|"yes"| L1
    CONF -->|"no"| FRONT --> L1

    L1 -->|"over limit"| HALT
    L1 -->|"ok"| L2
    L2 -->|"over limit"| HALT
    L2 -->|"80-100%"| WARN --> L3
    L2 -->|"ok"| L3
    L3 -->|"over limit"| HALT
    L3 -->|"ok"| OTEL

    OTEL --> PROM --> GRAF --> FIN

    style CL fill:#98D8C8,color:#333
    style EX fill:#4A90E2,color:#fff
    style AN fill:#9B59B6,color:#fff
    style CHEAP fill:#98D8C8,color:#333
    style CONF fill:#7B68EE,color:#fff
    style FRONT fill:#9B59B6,color:#fff
    style L1 fill:#4A90E2,color:#fff
    style L2 fill:#4A90E2,color:#fff
    style L3 fill:#4A90E2,color:#fff
    style HALT fill:#E74C3C,color:#fff
    style WARN fill:#FFD93D,color:#333
    style OTEL fill:#6BCF7F,color:#fff
    style PROM fill:#6BCF7F,color:#fff
    style GRAF fill:#6BCF7F,color:#fff
    style FIN fill:#6BCF7F,color:#fff

The teal nodes are cheap-model paths. Purple nodes are frontier-model paths. Blue nodes are the enforcement layer. Green nodes are the attribution and reporting path. The goal is to route as much volume as possible through teal while keeping the enforcement layer between every model call and production systems.

Cost Governance Checklist

  • Three-layer budget model implemented: per-request, per-agent-type daily, per-pipeline execution
  • BudgetEnforcer.check_and_record called BEFORE every LLM call, not after
  • Budget configs owned by platform team - product teams cannot raise limits without review
  • Agent type allocation: classification and routing tasks use small models; extraction and reasoning use mid-tier; complex analysis escalates to frontier
  • Tiered model selector implemented for high-volume agent types - cheap model attempts first
  • OTel gen_ai.fleet.spend_usd metric emitted per invocation with agent type, team, product, cost center tags
  • gen_ai.fleet.budget_decisions counter queryable in Grafana - budget halt rate per agent type visible
  • Daily budget utilization dashboard built - alert at 80% daily utilization for any agent type
  • Monthly attribution report: cost broken down by team and cost center, delivered to finance
  • 30-day rolling baseline established per agent type - anomaly detection triggers on 2x baseline deviation

What Comes Next

Part 6 covers Compliance, Audit Trails, and Regulatory Requirements for Agentic Systems - the EU AI Act enforcement deadline is August 2, 2026, and the gap between running agents and running auditable agents is larger than most platform teams expect.

References


Agentic AI

MLOPS

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:


Comments