← Back to Blog
For: AI Engineers, ML Engineers, Platform Engineers, AI Systems Architects

Multi-Agent Topology Patterns: Every Topology Has a Tear Point

Orchestrator, supervisor, peer-mesh - each one fails first at a different contract, and you can predict which before you write a line of code.

#multi-agent-systems#agent-orchestration#topology-patterns#agent-contracts#production-ai

A multi-agent system went to production with a clean architecture diagram: one orchestrator, four specialist subagents, arrows pointing the right way. It passed every demo. Three weeks in, it started returning answers that were confidently wrong about half the time, and nobody could say why.

The trace told the story. A subagent had timed out on a slow tool call. The orchestrator, which had no contract for what a timeout meant, treated the empty response as a completed task with no findings. It synthesized a final answer from three subagent results instead of four and never flagged the gap. The diagram was correct. The boxes and arrows were exactly where they should be. The system failed anyway - and it failed at the single most predictable place an orchestrator fails.

That predictability is the point of this article. Multi-agent systems do not fail randomly across their surface. Each topology fails first at one specific contract between agents, and which contract that is follows directly from the topology you chose. You can name it at design time, before any code exists.

The Thesis: Topology Determines Where You Tear First

The standard framing of multi-agent design treats topology as a performance choice - orchestrator for parallelism, supervisor for control, peer-mesh for latency - and treats reliability as a separate, later concern you bolt on with retries and guardrails. That ordering is backwards. The topology you pick decides which contract between your agents will fail first. I call that contract the topology's Tear Point: the one inter-agent agreement that takes the most load under that arrangement and therefore tears before any other when the system is stressed.

This is a sharper claim than "multi-agent systems fail at coordination," which is by now well established. The largest empirical study of these failures, the Multi-Agent System Failure Taxonomy (MAST), analyzed over 1,600 execution traces across seven frameworks and found failure rates between 41 and 86.7 percent, clustered into system-design issues (roughly 44 percent) and inter-agent misalignment (roughly 32 percent) rather than raw model errors. The same work observed that different systems have different dominant failure modes tied to their structure - a star topology, for instance, tends toward termination failures. Independent work mapping coordination structures to error pathways reached a parallel conclusion: each configuration induces predictable failures rather than uniform ones.

What that body of work establishes is that topology and failure are linked. What it does not give a practitioner is the operationally useful version: given my topology, which single contract do I harden first? That is the Tear Point, and locating it on the contract - not just predicting that failure will happen somewhere - is what turns "multi-agent systems are unreliable" into a design checklist. The prior art tells you the building will settle. The Tear Point tells you which wall to reinforce.

The contracts themselves are not new, and I am not claiming them as a contribution. Formal treatments already exist: behavioral-contract models specify each agent's required input and output artifacts as pre- and post-conditions, and resource-bounded agent-contract frameworks add budgets and termination triggers as enforceable obligations. Cross-organization protocols like Agent2Agent (A2A) externalize the same agreements as published capability declarations. This article builds on that vocabulary. The contribution is the mapping: topology to Tear Point, so the contract work goes where the load actually is instead of being spread evenly across a surface that is not evenly stressed.

Why This Matters: The Tear Point Is Where Cost And Corruption Both Concentrate

Locating the Tear Point matters because contract work is not free and teams have a limited reliability budget. Hardening every boundary equally is how you end up with an over-engineered system that is still fragile at the one place that mattered. Worse, the failure at the Tear Point is usually silent.

Consider the cost side first. Multi-agent systems are expensive by construction. Anthropic's published research system, an orchestrator-worker design where a lead agent spawns parallel subagents, outperformed its single-agent baseline by a wide margin on internal evaluations but consumed on the order of fifteen times the tokens of a normal chat, with roughly 80 percent of the performance variance traced to token usage. That multiplier is the cost of the topology. Now add an untorn-but-loaded contract: a subagent that recursively spawns more subagents, or a tool returning an oversized payload, multiplies that already-large cost by another order of magnitude. The published blueprint, by the authors' own framing, ships without circuit breakers or per-run caps. The runaway bill does not come from a random agent - it comes from the orchestrator's Tear Point, the place that topology loads hardest.

Now the corruption side. In a single-agent system, an unhandled edge case produces one bad output. In a multi-agent system, an unhandled edge at the Tear Point propagates: the timed-out subagent in the opening story did not just produce one gap, it corrupted the synthesis, which became the answer, which a downstream consumer trusted. Empirical work finds that the majority of multi-agent failures are silent semantic errors - plausible-looking, unusable outputs that emit no exception. The Tear Point is exactly where those silent errors originate, because it is the boundary carrying the most weight with the least slack.

So the asymmetry is stark. Miss the Tear Point and you get silent corruption and unbounded spend concentrated at one predictable boundary. Get the topology choice itself slightly wrong - a supervisor where an orchestrator would have been marginally faster - and you get a system that is recoverable, just suboptimal. The reliability payoff is in finding the tear, not in perfecting the boxes.

The Wrong Way: Pick A Topology, Inherit The Contracts

Here is the common path. A team picks a topology, reaches for the prebuilt constructor, and ships - leaving every inter-agent contract at its framework default.

code
# WRONG: topology chosen, contracts left at framework defaultsfrom langgraph_supervisor import create_supervisorfrom langgraph.prebuilt import create_react_agentresearcher = create_react_agent(model, tools=[web_search], name="researcher")analyst = create_react_agent(model, tools=[run_analysis], name="analyst")writer = create_react_agent(model, tools=[draft_report], name="writer")# The supervisor routes between workers. Looks complete. It is not.app = create_supervisor(    agents=[researcher, analyst, writer],    model=model,    prompt="Route the task to the right specialist.",).compile()result = app.invoke({"messages": [("user", task)]})

This compiles, runs, and demos cleanly. Every contract is implicit, and the supervisor topology's Tear Point - the routing decision - is wide open:

  • Capability routing (the Tear Point here): the supervisor routes by reading a one-line prompt and the agents' names. There is no machine-checkable statement of what each worker accepts. Nothing stops it from sending a raw query to an agent that only accepts analysis - the canonical inter-agent mismatch, where one agent emits YAML and the next expects JSON, and the error cascades downstream.
  • Message schema: the default handoff passes the entire message history to the next agent, so context grows with each hop. This is the cost the "handoff tax" discourse has been naming - the toll you pay when a transfer carries everything instead of a scoped result.
  • Timeout contract: there is none. If run_analysis hangs, the system hangs, or returns a partial that looks complete.
  • Output specification: "Route the task to the right specialist" says nothing about the shape of a worker's result. The supervisor will synthesize whatever it gets, including an empty string.

The topology is fine. A supervisor over three specialists is reasonable. It will still fail, and it will fail first at routing, because that is where this topology puts the load.

The Right Way: Locate The Tear Point, Then Harden That Contract First

Invert the order. Name the Tear Point for your topology, harden that contract before anything else, then fill in the rest of the surface. The four contracts below are the agreements that can tear; the topology decides which one tears first.

code
from typing import Literalfrom pydantic import BaseModel, Fieldimport asyncio# MESSAGE SCHEMA - what crosses the boundary, and nothing more. A scoped# result, not a transcript. This is what pays down the handoff tax.class AgentResult(BaseModel):    status: Literal["complete", "partial", "failed", "timed_out"]    payload: dict = Field(default_factory=dict)    confidence: float = Field(ge=0.0, le=1.0)    cost_tokens: int = Field(ge=0)    notes: str = ""  # scoped summary, not transcript# CAPABILITY DECLARATION - machine-checkable, so routing is validated, not guessed.class Capability(BaseModel):    name: str    accepts: list[str]            # input content types this agent handles    returns: str                  # output content type produced    max_cost_tokens: int          # the budget the caller agrees to fund    timeout_seconds: float        # the deadline the caller agrees to waitCAPABILITIES: dict[str, Capability] = {    "researcher": Capability(        name="researcher", accepts=["text/query"], returns="text/findings",        max_cost_tokens=40_000, timeout_seconds=60.0,    ),    "analyst": Capability(        name="analyst", accepts=["text/findings"], returns="text/analysis",        max_cost_tokens=20_000, timeout_seconds=30.0,    ),    "writer": Capability(        name="writer", accepts=["text/analysis"], returns="text/report",        max_cost_tokens=15_000, timeout_seconds=30.0,    ),}# CAPABILITY ROUTING CHECK - the supervisor's Tear Point, closed. A request# whose content type no worker declares is rejected at the boundary, not# discovered three hops downstream as a cascaded mismatch.def route(content_type: str) -> str:    for cap in CAPABILITIES.values():        if content_type in cap.accepts:            return cap.name    raise ValueError(f"no agent accepts {content_type}: routing would have guessed")# TIMEOUT + CANCELLATION - a deadline is a first-class outcome, not an# exception that corrupts synthesis. OUTPUT SPEC validated before trust.async def call_agent(name: str, agent, request: dict) -> AgentResult:    cap = CAPABILITIES[name]    try:        raw = await asyncio.wait_for(            agent.ainvoke(request), timeout=cap.timeout_seconds        )    except asyncio.TimeoutError:        return AgentResult(            status="timed_out", confidence=0.0, cost_tokens=0,            notes=f"{name} exceeded {cap.timeout_seconds}s deadline",        )    result = AgentResult.model_validate(raw)  # validate the boundary    if result.cost_tokens > cap.max_cost_tokens:        result.notes += f" [BUDGET EXCEEDED: {result.cost_tokens}/{cap.max_cost_tokens}]"    return result

The controller then consumes contracts, not raw model output, with a global bound on the whole run:

code
GLOBAL_RUN_TIMEOUT_SECONDS = 180.0  # bounds the run regardless of per-agent deadlinesasync def orchestrate(task: str, agents: dict) -> dict:    try:        return await asyncio.wait_for(            _run_pipeline(task, agents), timeout=GLOBAL_RUN_TIMEOUT_SECONDS        )    except asyncio.TimeoutError:        return {"status": "degraded", "reason": "global run deadline exceeded",                "partial": None}async def _run_pipeline(task: str, agents: dict) -> dict:    findings = await call_agent(route("text/query"), agents["researcher"],                                {"content": task, "type": "text/query"})    # Branch on a declared status instead of treating absence as success.    if findings.status in ("failed", "timed_out"):        return {"status": "degraded", "reason": findings.notes, "partial": None}    analysis = await call_agent(route("text/findings"), agents["analyst"],                                {"content": findings.payload, "type": "text/findings"})    if analysis.status in ("failed", "timed_out"):        return {"status": "degraded", "reason": analysis.notes,                "partial": findings.payload}  # a partial is still a labeled result    report = await call_agent(route("text/analysis"), agents["writer"],                              {"content": analysis.payload, "type": "text/analysis"})    return {"status": report.status, "report": report.payload,            "confidence": report.confidence}

The topology is still a supervisor-style pipeline. The difference is sequence: the routing check - this topology's Tear Point - was hardened first, with a content-type match that fails loudly instead of guessing. Everything else (schema, timeout, output validation) fills in the rest of the surface. Move to a different topology and the same four contracts exist, but a different one moves to the front of the queue.

The Named Concept: The Tear Point

A topology's Tear Point is the inter-agent contract that carries the most load under that arrangement and therefore fails first when the system is stressed. It is a location, not a behavior - a specific face of the contract surface, named in advance, that tells you where to spend your reliability budget before you have written any code.

Three properties make it useful:

  1. It is predictable from topology alone. You do not need traces to find it. The arrangement of agents tells you where the load concentrates, and load concentration is where the tear happens.
  2. It is singular and prescriptive. Failure-mode taxonomies enumerate everything that can go wrong; the Tear Point names the one thing to fix first. That is the difference between a catalog and a design instruction.
  3. It is contract-located. The tear is not "the orchestrator" or "coordination" in the abstract - it is a specific agreement (a schema, a routing check, a deadline, an output spec) you can point to in code and write a test for.

Topology determines the Tear Point because topology determines load distribution. An orchestrator funnels every subagent boundary into one synthesis join, so the join is loaded hardest. A supervisor pushes every decision through one routing step, so routing is loaded hardest. A peer-mesh distributes control across handoffs with no center, so the handoff is loaded hardest. Same four contracts in every case; different one at the front.

The Three Topologies And Their Tear Points

mermaid
flowchart TD
    subgraph ORC["Orchestrator / Subagent - Tear Point: synthesis join"]
        O1["Orchestrator"]
        O1 -->|scoped task + deadline| W1["Subagent A"]
        O1 -->|scoped task + deadline| W2["Subagent B"]
        O1 -->|scoped task + deadline| W3["Subagent C"]
        W1 -->|AgentResult| J1{{"Synthesis join"}}
        W2 -->|AgentResult| J1
        W3 -->|AgentResult| J1
        J1 --> O1
    end

    subgraph SUP["Supervisor / Worker - Tear Point: routing decision"]
        S1{{"Routing decision"}}
        S1 -->|validated by capability| SW1["Worker A"]
        SW1 -->|result| S1
        S1 -->|validated by capability| SW2["Worker B"]
        SW2 -->|result| S1
    end

    subgraph MESH["Peer-Mesh - Tear Point: handoff"]
        P1["Agent A"] <-->|handoff schema| P2["Agent B"]
        P2 <-->|handoff schema| P3["Agent C"]
        P1 <-->|handoff schema| P3
    end

    classDef orchestrator fill:#4A90E2,stroke:#2C2C2A,color:#FFFFFF
    classDef worker fill:#98D8C8,stroke:#2C2C2A,color:#2C2C2A
    classDef tear fill:#E74C3C,stroke:#2C2C2A,color:#FFFFFF
    classDef peer fill:#FFA07A,stroke:#2C2C2A,color:#2C2C2A

    class O1 orchestrator
    class W1,W2,W3 worker
    class SW1,SW2 worker
    class J1,S1 tear
    class P1,P2,P3 peer

The red nodes are the Tear Points - the boundary each topology loads hardest. Notice they are not the same kind of object: one is a join, one is a decision, one is an edge. That is the whole argument made visual - the failure location moves with the structure.

Orchestrator / Subagent - Tear Point: the synthesis join

One agent decomposes the task, dispatches scoped work to subagents in parallel, and synthesizes the results. This is the dominant production pattern and the one Anthropic's research system uses. Each subagent runs in its own context window, which is the source of its power: parallelism and isolated reasoning a single agent cannot achieve.

Every subagent boundary feeds one synthesis step, so the synthesis join is where load concentrates and where this topology tears first. The orchestrator's final answer is only as trustworthy as its weakest subagent result, and if any result arrives malformed, empty, or timed-out without a declared status, synthesis folds the gap into a confident, fluent, wrong answer - the opening story exactly. The contracts that hold the join together are the output specification (every result validated against a schema before synthesis) and the timeout contract (a slow subagent produces a declared timed_out the join can branch on, rather than stalling or silently dropping). Harden those first. This is also the topology most exposed to runaway spend, so the capability declaration's max_cost_tokens budget is the close second.

Use it when the task decomposes into independent parallel strands and the answer is worth the token multiplier - breadth-first research, due diligence, literature review.

Supervisor / Worker - Tear Point: the routing decision

A supervisor reads the goal, routes to the appropriate worker, collects the result, and decides the next move. The difference from an orchestrator is intent: a supervisor routes and oversees in a loop rather than fanning out and joining once. Its strength is control - per-branch budgets, the ability to cancel a worker that wanders, and an auditable trail of routing decisions.

Every task passes through one routing step, so routing is where this topology tears first. A routing decision made against a prose prompt instead of a machine-checkable capability degrades as the conversation deepens and the prompt's context fills - the supervisor starts sending work to the wrong specialist, the YAML-to-JSON-expecting mismatch that cascades. The contract that holds it together is the capability declaration: routing validates the request's content type against each worker's declared accepts before dispatch, so a mismatch fails at the boundary instead of three steps later. The timeout and cancellation contract is the close second, because oversight means the supervisor must be able to terminate a worker, not just wait for it. The cost of this topology is latency: every routing hop is another serial model call.

Use it when the task needs ordering, gating, auditability, or mid-flight cancellation - compliance flows, anything where a routing decision must be traceable.

Peer-Mesh - Tear Point: the handoff

No central coordinator. Each agent hands off directly to a peer when the work crosses into another's domain, with no return trip through a hub. The strength is latency - a handoff is one hop, not a round trip - and the natural fit is conversational flows where control should follow the topic.

Control lives entirely in the handoffs, so the handoff is where this topology tears first. With no central agent to scope context, the default handoff passes full history to the peer, and across several hops the context balloons - each agent inherits everything every prior agent said, which inflates cost, slows inference, and pushes agents toward fixating on stale context. That surfaces as the loss-of-history and step-repetition modes the taxonomy catalogs: an agent buried in inherited context reverts to stale state or re-runs work a peer already completed. The contract that holds it together is the message schema: a handoff passes a scoped result and an explicit reason, never a transcript. Peer-mesh is also the hardest to debug, because there is no central log - the routing logic is distributed across every agent's handoff tools, so a misrouted conversation has no single place that decided it.

Use it when latency matters more than central oversight and the agent count is small enough that the handoff graph stays comprehensible - interactive customer service, conversational routing.

The Deep Dive: One Failure Mode That Tears Every Topology

The Tear Point moves with topology, but one failure mode loads every topology at once: termination blindness. An agent that does not recognize when the task is complete keeps working, spawns more steps, or loops - one of the most prevalent single modes in the failure data, and the one a star topology is structurally prone to because no agent owns the "we are done" decision.

Termination blindness is not located at any single topology's Tear Point, which is exactly why it is dangerous: hardening your Tear Point does not catch it. It needs a separate, global contract - a shared, explicit definition of "done" that every agent checks against (a terminal status: complete the controller treats as a hard stop) plus a global run timeout that bounds every execution regardless of what the agents believe. In the code above, that is GLOBAL_RUN_TIMEOUT_SECONDS: the outer contract that fires even when every inner contract is satisfied and the agents have simply failed to stop.

There is also a cost dimension that compounds at the Tear Point. Full-history handoffs and full-state rebroadcast - injecting the complete updated context into every agent that might need it - are the near-universal framework default, and they buy consistency at a cost that grows with every agent and every hop. At the peer-mesh handoff Tear Point this is the dominant failure; at the orchestrator synthesis join it is what makes runaway spend possible. The message schema is therefore doing double duty: it is a correctness contract and a cost-control contract at the same time. A scoped schema is the difference between coordination cost that scales and coordination cost that explodes.

Decision Guide: Find Your Tear Point Before You Build

Pick the topology by the task, then harden its Tear Point first and the rest of the surface second.

Choose the topology:

  • Task decomposes into independent parallel strands, answer worth a high token cost -> orchestrator / subagent
  • Task needs ordering, gating, auditability, or mid-flight cancellation -> supervisor / worker
  • Latency matters more than central oversight, agent count is small -> peer-mesh
  • Task fits one context window without quality loss -> do not go multi-agent at all - the token multiplier and the entire contract surface are pure overhead

Harden the Tear Point first, by topology:

TopologyTear PointHarden firstClose second
Orchestrator / subagentSynthesis joinOutput spec + timeout: validate every result, branch on declared statusPer-agent token budget
Supervisor / workerRouting decisionCapability declaration: validate content type before dispatchCancellation contract
Peer-meshHandoffMessage schema: scoped result + reason, never a transcriptDistributed trace correlation

Non-negotiable for every topology, regardless of Tear Point:

  • Every agent boundary has an explicit message schema. No raw model output crosses a boundary unvalidated.
  • Every agent declares what it accepts, returns, and costs. Routing and delegation validate against the declaration, never a prompt alone.
  • Every cross-agent call has a deadline that produces a named outcome (timed_out), and the caller branches on it.
  • "Done" is defined and checked globally. A terminal status is a hard stop; a global run timeout bounds every execution. This is the one contract your Tear Point does not cover.
  • A per-agent token budget is enforced and the system fails closed - a degraded result with a stated reason beats unbounded spend or a silently incomplete answer.

If you can name your topology but not its Tear Point, you have chosen a shape without choosing where to reinforce it - and the failure data is clear that the reinforcement is what separates the systems that ship from the ones that quietly return wrong answers for three weeks.

References


Agentic AI

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:


Comments