ICONIQ Capital's 2026 State of AI report found that inference costs run at 23% of revenue for scaling AI-native companies. IDC's FutureScape 2026 report warned that even organizations with dedicated FinOps teams will underestimate AI infrastructure costs by up to 30%. For teams without dedicated governance infrastructure, the underestimation gap is almost certainly higher.
The specific failure mode is this: a team deploys a fleet of agents, each with a per-request token limit that seems reasonable in isolation. The fleet runs for 30 days. The invoice arrives with a number nobody budgeted. Investigation reveals that one agent - an enrichment agent with a large context window - was running on every request when it should have been running on 20% of requests. The routing logic had a bug. The per-request limit didn't catch it because each individual request was within budget. The aggregate wasn't.
This is what Mavvrik called "AI bill shock" when they launched agent-level cost tracking in March 2026: cost volatility that arrives as a surprise on invoices rather than as a predictable line item. The solution is not tighter per-request limits. It is the Three-Layer Budget Model: budget enforcement at per-request, per-agent-type, and per-pipeline scope simultaneously.
Wrong Way: One Limit, Wrong Level
The naive approach sets a per-request token limit and calls it governance.
# Wrong way: single-layer per-request limit# No per-agent-type daily tracking. No per-pipeline cap.# Catches individual runaway calls. Misses routing bugs, loop bugs,# and compounding costs across a multi-agent pipeline.MAX_TOKENS_PER_REQUEST = 2000def call_llm_naive(prompt: str, model: str) -> str: response = openai_client.chat.completions.create( model=model, messages=[{"role": "user", "content": prompt}], max_tokens=MAX_TOKENS_PER_REQUEST, # only protection ) return response.choices[0].message.content# What this misses:# 1. Enrichment agent misconfigured to run on 100% of requests instead of 20%:# each individual call is under limit, daily spend is 5x budget, nobody notices# 2. Pipeline loop bug: agent calls itself 50 times, each within the per-request limit,# total pipeline cost 50x expected, no alert fires# 3. Cost attribution: no idea which team, product, or pipeline is driving spend# 4. No forecast: spend only visible on the monthly invoiceThe Three-Layer Budget Model
Token budgets need to be enforced at different granularities because different failure modes manifest at different levels.
Per-request limits catch individual runaway calls. A single LLM call that receives an unexpectedly large input and generates a 10,000-token response should be capped before it completes. This is the circuit breaker pattern applied to cost, directly analogous to the retry and circuit breaking layer from Harness Engineering Part 6.
Per-agent-type limits catch routing and loop failures. If an enrichment agent that normally handles 20% of requests suddenly handles 80% due to a routing bug, the per-request limit won't fire - each request is within budget. But the per-agent-type daily limit fires at 4x expected volume.
Per-pipeline limits catch compounding costs across a full execution. A pipeline that calls 5 agents, each within their individual budget, can still exceed the total cost envelope for the business transaction it's executing. A $2 invoice extraction pipeline that costs $0.40 to run is economically sound. A $12 run that does the same extraction is a cost failure even if every individual agent stayed within limits.
# budget_enforcer.py# Three-layer budget enforcement with real-time tracking.# Integrates with the OTel fleet telemetry from Part 1.from __future__ import annotationsimport timefrom dataclasses import dataclass, fieldfrom enum import Enumfrom threading import Lockfrom typing import Optionalfrom opentelemetry import metricsclass BudgetAction(Enum): ALLOW = "allow" WARN = "warn" # usage above warning threshold, proceed HALT = "halt" # hard limit hit, stop execution ESCALATE_TO_HUMAN = "escalate" # soft limit hit, require approval@dataclassclass BudgetResult: action: BudgetAction reason: str current_spend_usd: float limit_usd: float utilization_pct: float@dataclassclass AgentBudgetConfig: """Budget configuration per agent type. Set by platform team, not product teams.""" agent_type: str per_request_limit_usd: float # hard limit per single LLM call daily_limit_usd: float # hard limit per agent type per day warn_threshold_pct: float = 0.80 # warn at 80% of any limit@dataclassclass PipelineBudgetConfig: """Budget configuration per pipeline execution.""" pipeline_type: str per_execution_limit_usd: float # max cost for one end-to-end pipeline run warn_threshold_pct: float = 0.80class BudgetEnforcer: """ Enforces three-layer budgets: per-request, per-agent-type, per-pipeline. Thread-safe. Emits OTel metrics on every decision for fleet dashboard. """ def __init__( self, agent_configs: dict[str, AgentBudgetConfig], pipeline_configs: dict[str, PipelineBudgetConfig], meter: metrics.Meter, ) -> None: self._agent_configs = agent_configs self._pipeline_configs = pipeline_configs self._lock = Lock() # Daily spend tracking per agent type - reset at midnight UTC self._agent_daily_spend: dict[str, float] = {} self._daily_reset_timestamp: float = self._today_midnight_utc() # Per-pipeline-execution spend tracking # session_id -> cumulative spend self._pipeline_spend: dict[str, float] = {} # OTel metric instruments self._budget_decision_counter = meter.create_counter( "gen_ai.fleet.budget_decisions", description="Budget enforcement decisions by agent, action, and level", ) self._spend_histogram = meter.create_histogram( "gen_ai.fleet.spend_usd", description="Token spend in USD per agent invocation", unit="USD", ) def check_and_record( self, agent_type: str, session_id: str, pipeline_type: str, estimated_cost_usd: float, ) -> BudgetResult: """ Check all three budget layers before allowing an agent to execute. Call BEFORE the LLM call, not after. """ with self._lock: self._maybe_reset_daily(agent_type) config = self._agent_configs.get(agent_type) pipeline_config = self._pipeline_configs.get(pipeline_type) # Layer 1: Per-request limit if config and estimated_cost_usd > config.per_request_limit_usd: return self._decision( BudgetAction.HALT, f"Request cost ${estimated_cost_usd:.4f} exceeds per-request limit ${config.per_request_limit_usd:.4f}", estimated_cost_usd, config.per_request_limit_usd, agent_type, session_id, "per_request", ) # Layer 2: Per-agent-type daily limit if config: daily_spend = self._agent_daily_spend.get(agent_type, 0) projected_daily = daily_spend + estimated_cost_usd utilization = projected_daily / config.daily_limit_usd if projected_daily > config.daily_limit_usd: return self._decision( BudgetAction.HALT, f"Daily limit for {agent_type} would be exceeded: ${projected_daily:.2f} > ${config.daily_limit_usd:.2f}", projected_daily, config.daily_limit_usd, agent_type, session_id, "daily", ) if utilization > config.warn_threshold_pct: # Record the cost and warn - don't halt self._agent_daily_spend[agent_type] = projected_daily self._spend_histogram.record(estimated_cost_usd, {"gen_ai.agent.name": agent_type}) return self._decision( BudgetAction.WARN, f"Daily budget {utilization*100:.0f}% utilized for {agent_type}", projected_daily, config.daily_limit_usd, agent_type, session_id, "daily", ) # Layer 3: Per-pipeline execution limit if pipeline_config: pipeline_spend = self._pipeline_spend.get(session_id, 0) projected_pipeline = pipeline_spend + estimated_cost_usd if projected_pipeline > pipeline_config.per_execution_limit_usd: return self._decision( BudgetAction.HALT, f"Pipeline execution budget would be exceeded: ${projected_pipeline:.4f} > ${pipeline_config.per_execution_limit_usd:.4f}", projected_pipeline, pipeline_config.per_execution_limit_usd, agent_type, session_id, "pipeline", ) # All layers cleared - record spend and allow self._agent_daily_spend[agent_type] = self._agent_daily_spend.get(agent_type, 0) + estimated_cost_usd self._pipeline_spend[session_id] = self._pipeline_spend.get(session_id, 0) + estimated_cost_usd self._spend_histogram.record(estimated_cost_usd, {"gen_ai.agent.name": agent_type}) return BudgetResult( action=BudgetAction.ALLOW, reason="Within all budget limits", current_spend_usd=estimated_cost_usd, limit_usd=config.per_request_limit_usd if config else float("inf"), utilization_pct=0.0, ) def release_session(self, session_id: str) -> float: """Call when a pipeline session completes. Returns total session spend.""" with self._lock: return self._pipeline_spend.pop(session_id, 0.0) def _decision( self, action: BudgetAction, reason: str, current: float, limit: float, agent_type: str, session_id: str, level: str, ) -> BudgetResult: utilization = current / limit if limit > 0 else 1.0 self._budget_decision_counter.add(1, { "gen_ai.agent.name": agent_type, "budget.action": action.value, "budget.level": level, "session.id": session_id, }) return BudgetResult(action, reason, current, limit, utilization) def _maybe_reset_daily(self, agent_type: str) -> None: now_midnight = self._today_midnight_utc() if now_midnight > self._daily_reset_timestamp: self._agent_daily_spend = {} self._daily_reset_timestamp = now_midnight @staticmethod def _today_midnight_utc() -> float: import datetime today = datetime.datetime.now(datetime.timezone.utc).date() midnight = datetime.datetime.combine(today, datetime.time(), tzinfo=datetime.timezone.utc) return midnight.timestamp()Budget Enforcement Architecture
flowchart TD
REQ["LLM Call Request\nfrom agent node"]
subgraph L1["Layer 1: Per-Request"]
R1{"Cost estimate\n> per-request limit?"}
end
subgraph L2["Layer 2: Per-Agent-Type Daily"]
R2{"Projected daily spend\n> daily limit?"}
WARN["Warn at 80% threshold\nproceed but alert"]
end
subgraph L3["Layer 3: Per-Pipeline Execution"]
R3{"Pipeline total\n> execution limit?"}
end
ALLOW["LLM Call Executes\nSpend recorded"]
HALT["Budget Halt\nPipeline halt signal set\nAudit record emitted"]
REQ --> R1
R1 -->|"yes"| HALT
R1 -->|"no"| R2
R2 -->|"yes"| HALT
R2 -->|"80-100%"| WARN --> R3
R2 -->|"< 80%"| R3
R3 -->|"yes"| HALT
R3 -->|"no"| ALLOW
style REQ fill:#95A5A6,color:#fff
style R1 fill:#7B68EE,color:#fff
style R2 fill:#7B68EE,color:#fff
style R3 fill:#7B68EE,color:#fff
style WARN fill:#FFD93D,color:#333
style ALLOW fill:#6BCF7F,color:#fff
style HALT fill:#E74C3C,color:#fff
A budget halt integrates directly with the pipeline halt protocol from Part 3: when the budget enforcer returns BudgetAction.HALT, the calling node sets halt=True and halt_reason=HaltReason.CIRCUIT_BREAKER_TRIPPED in the pipeline state. The halt propagates downstream. The audit node emits the event. The same mechanism that catches semantic failures catches cost overruns.
Not all agents should share the same budget tier. The right allocation model matches the model tier to the task value, not to the engineer's default choice.
# budget_configs.py# Fleet-wide budget configuration. Owned by platform team.# Enforced by BudgetEnforcer at runtime.AGENT_BUDGET_CONFIGS = { # Frontier model agents: high value, high cost - strict limits "invoice_extractor": AgentBudgetConfig( agent_type="invoice_extractor", per_request_limit_usd=0.05, # ~3,300 tokens at GPT-4o pricing daily_limit_usd=50.00, warn_threshold_pct=0.80, ), # Mid-tier agents: standard tasks, cheaper models "vendor_enricher": AgentBudgetConfig( agent_type="vendor_enricher", per_request_limit_usd=0.01, # uses a smaller model daily_limit_usd=20.00, warn_threshold_pct=0.80, ), # Classification agents: should use smallest viable model "invoice_classifier": AgentBudgetConfig( agent_type="invoice_classifier", per_request_limit_usd=0.002, # classification: gpt-4o-mini or equivalent daily_limit_usd=5.00, warn_threshold_pct=0.90, # tighter warning for cheap tasks ),}PIPELINE_BUDGET_CONFIGS = { "invoice_processing": PipelineBudgetConfig( pipeline_type="invoice_processing", per_execution_limit_usd=0.10, # full pipeline should cost <10 cents warn_threshold_pct=0.75, ),}The allocation principle: match the model to the task value. A classification decision that routes an invoice to one of three buckets does not require a frontier model. Using gpt-4o-mini or an equivalent small model for classification and reserving frontier model calls for extraction and reasoning cuts per-pipeline costs by 40-60% in most invoice processing workloads - with no measurable quality loss on the classification step.
The Deloitte Insights report from January 2026 described this shift as "FinOps for AI" - applying the same rigor to token spend that cloud teams apply to compute: forecast demand, enforce ROI thresholds, and treat tokens as a cost category with the same discipline as EC2 hours.
Cost Attribution to Business Units
A fleet-level cost view answers "how much did our agent fleet cost this month?" Attribution answers "which team, product, or business unit drove that cost?" Without attribution, cost governance has no mechanism for accountability.
The OTel attribute model from Part 1's fleet telemetry architecture makes attribution a tagging problem, not an instrumentation problem. Add business-unit tags to every span:
# cost_attribution.py# Business unit tagging for cost attribution.# Tags flow through OTel spans to the metrics backend.# Grafana can then answer: "cost by team, by product, by pipeline type."from dataclasses import dataclass@dataclassclass CostAttributionContext: """Carries business unit metadata into every OTel span.""" team: str # e.g. "accounts-payable" product: str # e.g. "invoice-automation" cost_center: str # e.g. "CC-4402" environment: str # "production" | "staging" | "development"def attribution_attributes(ctx: CostAttributionContext) -> dict[str, str]: """Returns OTel span attributes for cost attribution.""" return { "business.team": ctx.team, "business.product": ctx.product, "business.cost_center": ctx.cost_center, "fleet.environment": ctx.environment, }# Usage: merge into every span's attribute set alongside gen_ai.agent.* attributes.# In Grafana: sum(gen_ai.fleet.spend_usd) by business.team# Result: monthly spend broken down by team, billable to each cost center.Vantage's FinOps team documented in 2026 that AI token spend is actually more attributable than cloud infrastructure spend at one level: providers like Anthropic and OpenAI expose per-API-key spend natively, giving individual-developer granularity before any internal tagging is applied. The team-level and product-level attribution above extends that granularity into the fleet's operational context.
The Escalation Gate: Expensive Model as a Fallback
The most effective cost reduction pattern is also the most underused: make the expensive model a fallback, not a default.
# model_escalation.py# Tiered model selection: attempt cheap model first, escalate on failure.# Reduces frontier model usage to cases where cheaper models genuinely fail.from langchain_openai import ChatOpenAIfrom typing import Optionalclass TieredModelSelector: """ Attempts task with a cheaper model first. Escalates to a frontier model only if: - Cheap model returns low-confidence output - Cheap model fails schema validation - Budget enforcer allows the escalation cost """ def __init__( self, cheap_model: str = "gpt-4o-mini", frontier_model: str = "gpt-4o-2024-11-20", confidence_threshold: float = 0.85, ) -> None: self.cheap = ChatOpenAI(model=cheap_model, temperature=0) self.frontier = ChatOpenAI(model=frontier_model, temperature=0) self.confidence_threshold = confidence_threshold def invoke(self, prompt: str, budget_enforcer: BudgetEnforcer, agent_type: str, session_id: str, pipeline_type: str) -> tuple[str, str]: """ Returns (response_content, model_used). Falls back to frontier model only if cheap model output is below threshold. """ # Try cheap model first cheap_cost_estimate = 0.001 # estimate before call cheap_check = budget_enforcer.check_and_record( agent_type, session_id, pipeline_type, cheap_cost_estimate ) if cheap_check.action == BudgetAction.HALT: raise RuntimeError(f"Budget halted cheap model call: {cheap_check.reason}") cheap_response = self.cheap.invoke(prompt) confidence = self._estimate_confidence(cheap_response.content) if confidence >= self.confidence_threshold: return cheap_response.content, "cheap" # Cheap model confidence too low - escalate frontier_cost_estimate = 0.01 frontier_check = budget_enforcer.check_and_record( agent_type, session_id, pipeline_type, frontier_cost_estimate ) if frontier_check.action == BudgetAction.HALT: # Can't escalate - return cheap model output with low-confidence flag return cheap_response.content + "\n[LOW_CONFIDENCE]", "cheap_unescalated" frontier_response = self.frontier.invoke(prompt) return frontier_response.content, "frontier" def _estimate_confidence(self, content: str) -> float: """ Heuristic confidence estimation. In production: use logprobs if available, or a dedicated confidence model. """ if "[UNCERTAIN]" in content or "I'm not sure" in content: return 0.5 if len(content.strip()) < 10: return 0.4 return 0.9SoftwareSeni's FinOps framework for AI (March 2026) documented that "escalation gates before frontier model calls - require cheaper model steps to attempt the task first" is one of the three most effective cost levers, alongside token budget limits per agent call and cost caps per workflow. Teams that implement all three consistently achieve 40-60% cost reduction versus teams with only per-request limits.
Diagram: Fleet Cost Governance Architecture
The full cost governance system spans three control points: the model tier decision at the agent level, the three-layer budget enforcer at the call level, and the attribution pipeline that routes spend data to dashboards and finance.
flowchart TD
subgraph AgentFleet["Agent Fleet"]
CL["Classifier\nAgent\ngpt-4o-mini"]
EX["Extractor\nAgent\ngpt-4o"]
AN["Analysis\nAgent\nClaude Sonnet"]
end
subgraph TieredModel["Tiered Model Selector"]
CHEAP["Cheap Model\nFirst attempt"]
CONF{"Confidence\n≥ threshold?"}
FRONT["Frontier Model\nFallback only"]
end
subgraph BudgetEnforcer["Three-Layer Budget Enforcer"]
L1["Layer 1\nPer-Request\n< $0.05"]
L2["Layer 2\nPer-Agent Daily\n< $50/day"]
L3["Layer 3\nPer-Pipeline\n< $0.10/run"]
HALT["Budget Halt\nPipeline stops"]
WARN["Warn at 80%\nAlert fires"]
end
subgraph Attribution["Cost Attribution"]
OTEL["OTel Span\nteam + product\ncost_center tags"]
PROM["Prometheus\nMetrics"]
GRAF["Grafana\nDashboard"]
FIN["Finance Report\nMonthly by BU"]
end
CL & EX & AN --> CHEAP --> CONF
CONF -->|"yes"| L1
CONF -->|"no"| FRONT --> L1
L1 -->|"over limit"| HALT
L1 -->|"ok"| L2
L2 -->|"over limit"| HALT
L2 -->|"80-100%"| WARN --> L3
L2 -->|"ok"| L3
L3 -->|"over limit"| HALT
L3 -->|"ok"| OTEL
OTEL --> PROM --> GRAF --> FIN
style CL fill:#98D8C8,color:#333
style EX fill:#4A90E2,color:#fff
style AN fill:#9B59B6,color:#fff
style CHEAP fill:#98D8C8,color:#333
style CONF fill:#7B68EE,color:#fff
style FRONT fill:#9B59B6,color:#fff
style L1 fill:#4A90E2,color:#fff
style L2 fill:#4A90E2,color:#fff
style L3 fill:#4A90E2,color:#fff
style HALT fill:#E74C3C,color:#fff
style WARN fill:#FFD93D,color:#333
style OTEL fill:#6BCF7F,color:#fff
style PROM fill:#6BCF7F,color:#fff
style GRAF fill:#6BCF7F,color:#fff
style FIN fill:#6BCF7F,color:#fff
The teal nodes are cheap-model paths. Purple nodes are frontier-model paths. Blue nodes are the enforcement layer. Green nodes are the attribution and reporting path. The goal is to route as much volume as possible through teal while keeping the enforcement layer between every model call and production systems.
Cost Governance Checklist
- Three-layer budget model implemented: per-request, per-agent-type daily, per-pipeline execution
-
BudgetEnforcer.check_and_recordcalled BEFORE every LLM call, not after - Budget configs owned by platform team - product teams cannot raise limits without review
- Agent type allocation: classification and routing tasks use small models; extraction and reasoning use mid-tier; complex analysis escalates to frontier
- Tiered model selector implemented for high-volume agent types - cheap model attempts first
- OTel
gen_ai.fleet.spend_usdmetric emitted per invocation with agent type, team, product, cost center tags -
gen_ai.fleet.budget_decisionscounter queryable in Grafana - budget halt rate per agent type visible - Daily budget utilization dashboard built - alert at 80% daily utilization for any agent type
- Monthly attribution report: cost broken down by team and cost center, delivered to finance
- 30-day rolling baseline established per agent type - anomaly detection triggers on 2x baseline deviation
What Comes Next
Part 6 covers Compliance, Audit Trails, and Regulatory Requirements for Agentic Systems - the EU AI Act enforcement deadline is August 2, 2026, and the gap between running agents and running auditable agents is larger than most platform teams expect.
References
- Deloitte Insights. (January 2026). AI Tokens: How to Navigate AI's New Spend Dynamics. https://www.deloitte.com/us/en/insights/topics/emerging-technologies/ai-tokens-how-to-navigate-spend-dynamics.html
- Mavvrik. (March 2026). Mavvrik Unveils Full Stack AI Cost Governance to Address AI Bill Shock. https://natlawreview.com/press-releases/mavvrik-unveils-full-stack-ai-cost-governance-address-ai-bill-shock
- Vantage. (2026). AI Cost Observability: Measuring and Justifying Token Spend in 2026. https://www.vantage.sh/blog/finops-for-ai-token-costs
- SoftwareSeni. (March 2026). How to Build AI Infrastructure Cost Governance Without a Dedicated FinOps Team. https://www.softwareseni.com/how-to-build-ai-infrastructure-cost-governance-without-a-dedicated-finops-team/
- Runyard. (March 2026). Controlling AI Agent Costs: Swarm Budgets and Circuit Breakers. https://runyard.io/blog/swarm-budgets-cost-control
- CIO. (December 2025). How to Get AI Agent Budgets Right in 2026. https://www.cio.com/article/4099548/how-to-get-ai-agent-budgets-right-in-2026.html
- FinOps Foundation. AI FinOps Framework and Certified Practitioner pathway. https://www.finops.org
- Ranjan Kumar. (April 2026). Unified Observability Across Agent Fleets. https://ranjankumar.in/ai-control-plane-unified-observability-agent-fleet
- Ranjan Kumar. (April 2026). Retry, Fallback, and Circuit Breaking: Building LLM Infrastructure That Survives Outages. https://ranjankumar.in/harness-engineering-retry-fallback-circuit-breaking-llm-resilience
- Ranjan Kumar. (April 2026). Global Policy Enforcement vs. Per-Agent Gate Rules. https://ranjankumar.in/ai-control-plane-global-policy-enforcement-per-agent-gate-rules
Related Articles
- Compliance, Audit Trails, and Regulatory Requirements for Agentic Systems
- Unified Observability Across Agent Fleets: Building the Control Plane Metric Layer
- Agent Versioning and Deployment Strategies: Shipping Agent Updates Without Breaking Running Pipelines