Agent Versioning and Deployment Strategies: Shipping Agent Updates Without Breaking Running Pipelines

Here is a failure mode that happens exactly once at most teams before they take agent versioning seriously.

Agent 1 (extractor) is at v1.3. Agent 2 (enricher) is at v2.0. They have been running together in production for three weeks. The team ships Agent 1 v1.4 - a model change that improves extraction accuracy. The deployment completes cleanly. No errors. Traffic shifts to the new version.

Six hours later, the enricher starts failing at a rate that triggers an alert. Root cause: Agent 1 v1.4 changed the schema of its output object - a field was renamed from vendor_name to vendor. Agent 2 v2.0 was never tested against Agent 1 v1.4. It expected vendor_name. The field was missing. Enrichment failed silently for six hours before quality dropped enough to fire a threshold alert.

This is the version consistency problem. It is not a bug. It is a missing practice: testing agent version combinations as a matrix, not as individual deployments.

Why Agent Deployment is Harder Than Service Deployment

When you deploy a new version of a microservice, the contract between services is enforced by an API schema - typically OpenAPI or gRPC. If Service A changes its response shape, Service B's SDK fails type-checking at compile time or the API gateway rejects the mismatched request. The incompatibility surfaces before production traffic hits it.

Agent outputs don't work this way. LLM-generated output is natural language or loosely structured JSON. The "schema" is whatever the prompt instructs. When Agent 1 v1.4 changes its prompt to produce vendor instead of vendor_name, nothing in the deployment pipeline catches it. The output is still valid JSON. Agent 2 receives it. The field is missing. Depending on Agent 2's code, it either raises a KeyError (caught by error handling, triggering the halt protocol from Part 3), or defaults silently to None and proceeds - the worst case.

Three practices close this gap:

Semantic versioning for agent contracts - treat agent output schemas as explicit, versioned contracts, not as implicit LLM outputs
Version matrix testing - test every deployed combination of agent versions before any version goes live
Deployment strategies that gate on behavioral metrics - canary deployments that promote based on quality, not just latency and error rate

The unified observability layer from Part 1 provides the Tier 3 behavioral metrics that make quality-gated promotion possible. Without that infrastructure, canary deployments for agents degrade to the same latency-and-error-rate gating used for stateless services - which misses the class of failures that matters most for agents.

Semantic Versioning for Agent Contracts

The root failure is the Version Consistency Problem: in a multi-agent pipeline, any combination of agent versions that has not been explicitly tested together is untested. Deploying Agent 1 v1.4 without testing it against every version of Agent 2 currently running in production is shipping an untested system - even if each agent passed all its own unit and integration tests in isolation.

Closing the Version Consistency Problem requires two practices running together: semantic versioning for agent output contracts, and version matrix testing before every promotion.

An agent contract is the typed, versioned specification of what an agent accepts as input and produces as output.

code

# agent_contracts.py# Versioned agent contracts - the interface layer between agents.# Breaking changes require a major version bump.# Downstream agents declare which versions they are compatible with.from __future__ import annotationsfrom dataclasses import dataclass, fieldfrom typing import Optionalfrom pydantic import BaseModel, Field# --- Agent 1: Extractor contract ---class ExtractorOutputV1(BaseModel):    """    v1.x contract for the invoice extractor.    Breaking changes: rename fields, remove fields, change types.    Non-breaking changes: add Optional fields, change validation rules.    """    vendor_name: str          # v1.x field name - downstream agents depend on this    amount: float    currency: str    invoice_date: str    contract_version: str = "1"class ExtractorOutputV2(BaseModel):    """    v2.x contract - breaking change: vendor_name renamed to vendor.    Downstream agents must explicitly declare compatibility with v2.    Cannot be deployed until all downstream agents are updated and tested.    """    vendor: str               # renamed from vendor_name - BREAKING    amount: float    currency: str    invoice_date: str    po_number: Optional[str] = None   # new optional field - non-breaking    contract_version: str = "2"# --- Agent 2: Enricher - declares upstream compatibility ---@dataclassclass AgentCompatibilityManifest:    """    Declares which upstream contract versions this agent is tested against.    The deployment system enforces this - it will not route traffic from    an incompatible upstream agent version to this agent.    """    agent_type: str    agent_version: str    upstream_compatibility: dict[str, list[str]]  # agent_type -> [compatible versions]ENRICHER_V3_MANIFEST = AgentCompatibilityManifest(    agent_type="enricher",    agent_version="3.0.0",    upstream_compatibility={        # enricher v3.0 is only tested against extractor v2.x        # it will NOT receive traffic from extractor v1.x        "extractor": ["2"],    },)ENRICHER_V2_MANIFEST = AgentCompatibilityManifest(    agent_type="enricher",    agent_version="2.0.0",    upstream_compatibility={        # enricher v2.0 is tested against extractor v1.x only        "extractor": ["1"],    },)

The compatibility manifest is enforced at the routing layer - the control plane checks that the upstream agent's contract version is in the downstream agent's compatibility list before allowing traffic to flow. If Agent 1 v1.4 still produces ExtractorOutputV1 (v1.x contract), it can flow to Enricher v2.0 safely. Only when Agent 1 v2.0 starts producing ExtractorOutputV2 does the router block traffic to Enricher v2.0 and require Enricher v3.0.

Version Matrix Testing

Version Matrix Testing is the practice of running integration tests against every combination of agent versions that will co-exist in the fleet, before any version is promoted to production. This is how you close the Version Consistency Problem.

Before any agent version is deployed to production, every combination of agent versions that will co-exist in the fleet must pass an integration test.

code

# version_matrix_test.py# Run as part of CI before any agent version is promoted to production.# Tests every combination of agent versions defined in the active version matrix.import pytestfrom itertools import productfrom typing import NamedTupleclass VersionCombination(NamedTuple):    extractor_version: str    enricher_version: str    payment_version: str# The active version matrix: all versions currently running or about to runACTIVE_VERSIONS = {    "extractor": ["1.3", "1.4"],   # 1.4 is the new version being promoted    "enricher": ["2.0"],    "payment": ["1.0"],}def generate_version_matrix() -> list[VersionCombination]:    """Generate every combination of agent versions in the matrix."""    return [        VersionCombination(*combo)        for combo in product(*ACTIVE_VERSIONS.values())    ]def build_pipeline(    extractor_version: str,    enricher_version: str,    payment_version: str,):    """    Builds the invoice pipeline using the specified agent versions.    In practice: loads versioned agent code from the artifact registry.    Returns a compiled LangGraph graph for the given version combination.    """    # Implementation: pull versioned agent images/modules and wire them    # into the pipeline graph. Exact implementation is deployment-system-specific.    raise NotImplementedError("Wire to your agent artifact registry")def evaluate_quality(result: dict, expected_output: dict) -> float:    """Returns a quality score 0.0-1.0 comparing result to expected."""    # Implementation: compare extracted fields against golden set    correct_fields = sum(        1 for k, v in expected_output.items()        if result.get("payment_payload", {}).get(k) == v    )    return correct_fields / len(expected_output) if expected_output else 0.0@pytest.mark.parametrize("combo", generate_version_matrix())def test_version_combination(combo: VersionCombination, invoice_fixture):    """    Each combination must pass a golden-set integration test.    Failure blocks the new version from production deployment.    """    pipeline = build_pipeline(        extractor_version=combo.extractor_version,        enricher_version=combo.enricher_version,        payment_version=combo.payment_version,    )    result = pipeline.invoke({"invoice_text": invoice_fixture["text"]})    # Structural correctness: required fields present and typed    assert result["payment_payload"]["amount"] == invoice_fixture["expected_amount"]    assert result["payment_payload"]["vendor_id"] is not None    assert not result["halt"], f"Pipeline halted: {result.get('halt_reason')}"    # Behavioral correctness: output quality above threshold    quality_score = evaluate_quality(result, invoice_fixture["expected_output"])    assert quality_score >= 0.90, (        f"Quality below threshold for {combo}: {quality_score:.2f}"    )

This is the practice most teams skip. Running version matrix tests takes time and requires maintaining a golden set of test fixtures. The payoff is that the vendor_name to vendor failure described in the opening never reaches production - it is caught in the matrix test for the (extractor-1.4, enricher-2.0) combination.

The Three Deployment Strategies

With contracts and matrix testing in place, the deployment strategy determines how new versions are introduced into the live fleet.

Rolling deployment replaces agent instances one by one. At any moment during a rolling deploy, the fleet contains a mix of old and new versions. This is the default for stateless services and is wrong for agents. If Agent 1 v1.3 and Agent 1 v1.4 are running simultaneously and the pipeline load balances across them, the same session may use v1.3 for one request and v1.4 for the next. For stateful agents that checkpoint between steps - as established in the State Management layer - this creates checkpoint corruption. The state written by v1.3 may not be readable by v1.4.

Blue-green deployment maintains two full environments. Blue runs the current version. Green runs the new version. Traffic switches atomically. No mixed-version sessions. This is the correct default for agents in active pipelines.

code

# blue_green_router.py# Atomic traffic switch between agent fleet versions.# Sessions that started on Blue complete on Blue.# New sessions start on Green after the switch.from dataclasses import dataclass, fieldfrom typing import Literalimport threadingEnvironment = Literal["blue", "green"]@dataclassclass BlueGreenRouter:    """    Routes new sessions to the active environment.    In-flight sessions continue on their original environment.    Thread-safe atomic switch.    """    _active: Environment = "blue"    _lock: threading.Lock = field(default_factory=threading.Lock)    # session_id -> environment mapping for in-flight sessions    _session_env: dict[str, Environment] = field(default_factory=dict)    def get_environment(self, session_id: str) -> Environment:        """        Returns the environment for a session.        In-flight sessions always stay on their original environment.        New sessions go to the current active environment.        """        if session_id in self._session_env:            return self._session_env[session_id]        env = self._active        self._session_env[session_id] = env        return env    def switch_to_green(self) -> None:        """        Atomic switch: new sessions go to Green.        Existing sessions complete on Blue.        Call after Green passes all pre-promotion checks.        """        with self._lock:            self._active = "green"    def rollback_to_blue(self) -> None:        """        Immediate rollback: new sessions return to Blue.        In-flight Green sessions complete on Green (already committed).        """        with self._lock:            self._active = "blue"    def release_session(self, session_id: str) -> None:        """Call when a session completes to free the mapping."""        self._session_env.pop(session_id, None)

Canary deployment routes a percentage of new sessions to the new version and promotes based on behavioral metrics. This is the most powerful strategy for detecting quality regressions that don't surface as errors.

code

# canary_controller.py# Promotes canary based on Tier 3 behavioral metrics, not just error rates.# Uses the quality_score metric from the fleet telemetry layer (Part 1).from dataclasses import dataclassfrom typing import Optionalimport time@dataclassclass CanaryController:    """    Manages canary promotion based on quality metrics comparison.    Baseline = stable version metrics. Canary = new version metrics.    Promotes when canary quality is within acceptable delta of baseline    AND error rate is below threshold AND minimum traffic has been served.    """    canary_traffic_pct: float = 0.05      # start at 5%    min_sessions_before_promote: int = 100    quality_delta_threshold: float = 0.03  # canary quality within 3% of baseline    error_rate_threshold: float = 0.01     # canary error rate below 1%    max_canary_duration_hours: float = 4.0 # auto-rollback if not promoted in 4h    _canary_start_time: Optional[float] = None    _canary_sessions: int = 0    def should_route_to_canary(self) -> bool:        """Probabilistic routing based on canary traffic percentage."""        import random        return random.random() < self.canary_traffic_pct    def record_canary_session(self) -> None:        if self._canary_start_time is None:            self._canary_start_time = time.monotonic()        self._canary_sessions += 1    def evaluate_promotion(        self,        baseline_quality: float,        canary_quality: float,        canary_error_rate: float,    ) -> tuple[bool, str]:        """        Returns (should_promote, reason).        Promote only if all three conditions hold.        """        if self._canary_sessions < self.min_sessions_before_promote:            return False, f"Insufficient canary traffic: {self._canary_sessions} sessions"        hours_elapsed = (time.monotonic() - (self._canary_start_time or 0)) / 3600        if hours_elapsed > self.max_canary_duration_hours:            return False, f"Canary duration exceeded {self.max_canary_duration_hours}h - auto-rollback"        quality_gap = baseline_quality - canary_quality        if quality_gap > self.quality_delta_threshold:            return False, f"Quality regression: canary={canary_quality:.3f} baseline={baseline_quality:.3f}"        if canary_error_rate > self.error_rate_threshold:            return False, f"Error rate too high: {canary_error_rate:.3f}"        return True, "Canary meets all promotion criteria"

The critical difference from service canary deployments: the promotion check uses canary_quality - a Tier 3 behavioral metric from the fleet telemetry plane - not just latency and error rate. An agent version that produces lower-quality outputs at normal speed with zero errors is still a regression. Standard canary gates miss it. Quality-gated canary catches it.

Which Strategy for Which Scenario

Scenario	Recommended Strategy	Why
Model swap, same output schema	Canary	Quality may shift; catch regressions early
Prompt update, non-breaking	Canary	Output distribution changes; need behavioral validation
Schema change (breaking)	Blue-green	Version consistency requires atomic switch with matrix test
Hotfix under incident	Blue-green	Speed matters; switch atomically, rollback fast
New agent type added to fleet	Rolling (for new type only)	No existing sessions affected; existing agents unchanged

Diagram: Agent Versioning and Deployment Lifecycle

The diagram below shows how all three practices - contract versioning, matrix testing, and deployment strategy selection - compose into a single promotion pipeline. A new agent version enters at the left and must pass every gate before it reaches full production traffic.

mermaid

flowchart LR
    subgraph Dev["Development"]
        NV["New Agent\nVersion\nv1.4"]
        SC["Semantic\nContract\nDefined"]
    end

    subgraph CI["CI Gate"]
        MT["Version Matrix\nTest\nAll combos"]
        CM["Compatibility\nManifest\nVerified"]
    end

    subgraph Staging["Staging"]
        INT["Integration\nTest\nGolden set"]
    end

    subgraph Deploy["Production Deployment"]
        STRAT{"Change\nType?"}
        CAN["Canary\n5% traffic\n4h window"]
        BG["Blue-Green\nAtomic switch"]

        subgraph Promote["Promotion Gate"]
            QM["Quality Score\n≥ baseline - 3%"]
            ER["Error Rate\n< 1%"]
            TC["Traffic Count\n≥ 100 sessions"]
        end
    end

    FULL["Full Production\n100% traffic"]
    RB["Rollback\nto previous"]

    NV --> SC --> MT
    MT -->|"fail"| RB
    MT -->|"pass"| CM --> INT
    INT -->|"fail"| RB
    INT -->|"pass"| STRAT

    STRAT -->|"model swap\nprompt update"| CAN
    STRAT -->|"breaking schema\nhotfix"| BG

    CAN --> QM & ER & TC
    QM & ER & TC -->|"all pass"| FULL
    QM & ER & TC -->|"any fail"| RB

    BG -->|"verified"| FULL
    BG -->|"incident"| RB

    style NV fill:#95A5A6,color:#fff
    style SC fill:#4A90E2,color:#fff
    style MT fill:#7B68EE,color:#fff
    style CM fill:#7B68EE,color:#fff
    style INT fill:#7B68EE,color:#fff
    style STRAT fill:#9B59B6,color:#fff
    style CAN fill:#4A90E2,color:#fff
    style BG fill:#4A90E2,color:#fff
    style QM fill:#FFD93D,color:#333
    style ER fill:#FFD93D,color:#333
    style TC fill:#FFD93D,color:#333
    style FULL fill:#6BCF7F,color:#fff
    style RB fill:#E74C3C,color:#fff

Three gates must pass before any version reaches full production: the CI matrix test (catches version consistency failures), the promotion criteria (catches quality regressions in canary), and the compatibility manifest verification (catches breaking schema changes). Any gate failure routes back to rollback, not forward to production.

Connecting Versioning to the Control Plane

LangSmith Fleet (launched March 2026) introduced agent identity and versioning as first-class concepts. Each agent in the fleet has an identity, a version tag, and fleet-scoped permissions. The gen_ai.agent.version attribute in OTel spans - established in Part 1's telemetry architecture - connects deployment versions to behavioral metrics automatically: every quality score, every tool call, every halt event is tagged with the agent version that produced it.

This means the canary controller's canary_quality metric is not a separate data collection effort. It is a query against the existing fleet telemetry: avg(gen_ai.agent.quality_score) where gen_ai.agent.version = "1.4". The infrastructure built in Part 1 makes quality-gated promotion a query, not an instrumentation project.

Agent Versioning Checklist

Agent output schemas defined as Pydantic models with explicit version field (contract_version)
Breaking vs. non-breaking changes defined: field rename/removal/type change = breaking; adding Optional field = non-breaking
Each downstream agent declares upstream_compatibility manifest listing tested upstream versions
Version matrix test runs in CI for every new agent version - tests all active version combinations
Blue-green router implemented with session pinning - in-flight sessions complete on their original environment
Canary controller promotion criteria include Tier 3 quality metrics, not just latency/error rate
gen_ai.agent.version attribute present on all OTel spans - version-tagged metrics available from day one
LangSmith Fleet agent identity configured - audit log of version deployments maintained
Rollback procedure documented and tested: blue-green switch is atomic and tested in staging quarterly

What Comes Next

Part 5 covers Cost Governance and Budget Allocation Across Agent Types - how to allocate token budgets across a fleet, attribute spend to business units, and prevent runaway agent costs before they appear on an invoice.