← Back to Blog
For: AI Engineers, ML Engineers, Platform Engineers, AI Systems Architects

Subagents: How to Run Parallelism Inside a Single Agent Session Without Poisoning the Parent

Every subagent burns its own context so the parent doesn't have to. That's the entire architecture.

#subagents#multi-agent#context-engineering#claude-code#orchestrator#parallelism#context-isolation#llm-production

Your agent is four hours into a complex session. It has read 40 files, run a test suite, drafted three implementation variants, explored two dead ends, and is currently trying to audit a diff it can barely see anymore because its context is full of noise from everything it did before. The output quality has degraded visibly. The model is still smart. The context is not.

This is the core failure mode of single-context agents at scale: every operation they perform is also an operation they must carry forever. The exploration that found a dead end still occupies 8,000 tokens. The test output that confirmed a passing suite still sits in the thread. The draft variant that was rejected is still there. The parent agent is paying for every decision it ever made, not just the ones that matter now.

Subagents solve this at the architectural level. Not by making the parent smarter, not by compressing its history, but by delegating focused work to child agents that spawn in fresh context windows, do their work, and return only the result. The parent gets approximately 400 tokens of summary back. The child's entire working process - every file it read, every intermediate step, every failed attempt - is discarded. The parent stays sharp. The child burns its own context so the parent doesn't have to.

This is not a convenience feature. It is the mechanism that makes sustained, high-quality agent work possible at production time horizons.


Why Single-Context Agents Break at Depth

The failure is architectural, not incidental. A single-context agent accumulates state monotonically across its session. Each tool call appends output to the thread. Each file read adds tokens. Each test run adds the full output. There is no native mechanism to discard intermediate work that has already been processed and no longer needs to be visible.

Two mechanisms turn this accumulation into degradation:

Context rot - As established in prior Harness Engineering work, model performance degrades as the context window fills. Attention is distributed across all tokens quadratically. Important constraints set at the beginning of the session compete for attention weight with 60,000 tokens of accumulated tool output. The model technically sees everything. It does not reliably act on everything.

Exploration noise - The specific failure mode for agents doing any research, analysis, or variant generation is that the exploratory process itself produces most of the noise. Reading 80 files to find 3 relevant ones means 77 files of irrelevant content permanently in the thread. The parent needed the answer from those 80 files, not the files themselves.

The naive fix is to compress or trim the context as it grows. This works to a point - the context engineering article in this series covers the strategies. But compression is lossy and adds latency. There is a class of work where the correct architectural answer is not "how do we keep the parent's context manageable" but "how do we keep this work out of the parent's context entirely."

That is what subagents are for.


The Subagent Contract

The core contract is simple enough to fit in a single sentence: a subagent receives scoped input, executes in an isolated context window, and returns a summary. The parent sees only the summary.

The slide above illustrates this precisely. One session, one context window for the parent. Three children - Researcher, Eval Runner, Reviewer - each spawn in fresh windows, do focused work, and return summaries. "Each child burns its own context. Parent gets ~400 tokens back."

That 400-token figure is the return on isolation. A Researcher subagent reading 80 files might consume 120,000 tokens of its own context in the process. The parent receives a 400-token summary: the three files that were relevant, the two patterns that emerged, the one finding that matters. The parent's context grew by 400 tokens, not 120,000.

In Claude Code's implementation, this maps directly to the Agent tool (formerly Task). The orchestrator invokes it with a scoped task description. The subagent executes in its own context. It cannot spawn further subagents in a single-level hierarchy, which enforces accountability - there is always exactly one orchestrator that controls the delegation tree. When done, the subagent's final message returns to the parent. Everything else in the subagent's context is discarded.

This is not the same as Agent Teams, which is Claude Code's experimental peer-to-peer coordination mode where multiple independent sessions share a task list and can communicate laterally. Subagents are hierarchical and synchronous: the parent delegates, the child executes, the result returns. Agent Teams are appropriate when workers need to share findings mid-execution or challenge each other's outputs. For the patterns covered in this article, the subagent model is sufficient and considerably cheaper.


The Wrong Way: Inline Accumulation

Here is a common pattern in early production agent implementations. The orchestrator is asked to explore a multi-repo codebase, run the test suite, and audit a diff. It does all of this sequentially in its own context:

code
# Naive: everything accumulates in the parent's contextasync def analyze_codebase(task: str) -> str:    result = ""    # Step 1: Explore - reads 80 files, full content in thread    for file_path in find_relevant_files(repo_root):        content = read_file(file_path)        # 1,000-5,000 tokens each        result += f"\n--- {file_path} ---\n{content}"    # Step 2: Run tests - full test output in thread    test_output = run_test_suite()             # 20,000+ tokens of stdout    result += f"\nTest results:\n{test_output}"    # Step 3: Audit diff - added on top of everything above    diff = get_current_diff()    result += f"\nDiff to audit:\n{diff}"    # By here, context is 150,000+ tokens    # Model reasoning about the diff is competing with    # 80 file reads and 20k tokens of test output for attention    return llm_analyze(result)

By the time the agent reaches the diff audit, its context holds the full content of 80 files and the complete test output. The model is being asked to reason about a 2,000-token diff while attending to 148,000 tokens of accumulated noise. This is not a hypothetical - this is the context profile of most single-agent implementations doing sustained multi-step work.

The measurable consequence: tasks that take the model 10 seconds to reason about correctly in a fresh 10,000-token context take 30+ seconds and produce lower-quality outputs in a polluted 150,000-token context. And the cost scales with context size, not just with the work done.


The Right Way: Delegate, Isolate, Summarize

The correct architecture delegates each bounded subtask to a subagent. The parent's context contains only the task description, the summaries returned by subagents, and the synthesis. At no point does the parent's context hold the raw artifacts that subagents produced:

code
# .claude/agents/researcher.md# ---# name: researcher# description: >#   Reads and analyzes files to answer a specific question.#   Use when exploration would flood the main context with file contents.#   Expects: a focused research question and a search scope.#   Returns: a structured summary of findings only, not raw file contents.# tools: Read, Grep, Glob# model: claude-haiku-4-5-20251001# ---## You are a research specialist. Your job is to answer one focused question.# Read whatever files are necessary. Return ONLY a structured summary:# - Key findings (3-5 bullet points)# - Files that are relevant (paths only, not content)# - Files that are NOT relevant (so the parent doesn't re-investigate)# - Confidence level: high / medium / low# Do NOT return file contents. Return findings only.
code
# .claude/agents/eval-runner.md# ---# name: eval-runner# description: >#   Runs the test suite and returns a structured pass/fail report.#   Use when test output would pollute the main context.#   Returns: summary report with counts and failure signatures only.# tools: Bash# model: claude-haiku-4-5-20251001# ---## You are a test execution specialist. Run the test suite.# Return ONLY:# - Total: N passed, M failed, K skipped# - Failing test names and the first error line (not full stack traces)# - Runtime: Xs# Do NOT return full test output. Return the report only.

In Claude Code, the orchestrator delegates via the Task tool (also called the Agent tool) which Claude invokes autonomously when a task matches a defined subagent's description. The following is illustrative pseudocode showing the delegation logic - in practice, the orchestrator is Claude itself, invoking agent definitions from .claude/agents/ automatically:

code
# PSEUDOCODE - illustrates delegation logic.# In Claude Code: Claude invokes subagents via the Task tool automatically# when task content matches a subagent's description in .claude/agents/# Orchestrator delegates: parent context stays clean throughout# Step 1: Delegate exploration → researcher subagent (defined in .claude/agents/researcher.md)# Parent context grows by ~400 tokens (summary), not 120,000 (file contents)research_summary = invoke_subagent(    agent="researcher",               # matches .claude/agents/researcher.md    task=(        "Find all files related to authentication in this codebase. "        "Specifically: where are JWT tokens validated? Known issues?"    ),    scope="/src/auth, /src/middleware")# researcher reads 80 files in its own fresh context, returns 400 tokens of findings# Step 2: Delegate test run → eval-runner subagent (runs in parallel with researcher)# Parent context grows by ~200 tokens (report), not 20,000 (test stdout)test_report = invoke_subagent(    agent="eval-runner",              # matches .claude/agents/eval-runner.md    task="Run the full test suite and return the structured report.")# eval-runner runs pytest in its own fresh context, returns pass/fail report# Step 3: Delegate diff audit → reviewer subagent (runs in parallel)# Parent context grows by ~300 tokens (findings), not diff + surrounding contextaudit_findings = invoke_subagent(    agent="reviewer",                 # matches .claude/agents/reviewer.md    task=(        "Audit this diff against our security standards. "        "Focus on: auth bypass risks, input validation gaps, error handling."    ),    diff=get_current_diff()           # only the diff, not the parent's full context)# Parent synthesizes three ~300-400 token summaries# Total parent context across entire operation: ~1,500 tokens of signal# vs. 150,000+ tokens inlinereturn synthesize(research_summary, test_report, audit_findings)

The parent's context across the entire operation: the task description, three structured summaries totaling ~1,200 tokens, and the synthesis. The parent never saw a single file content, never processed the full test output, never held the diff alongside 80 files of context. It stayed sharp throughout.


The Four Canonical Subagent Patterns

Based on production usage across Claude Code teams and the broader practitioner community, four patterns cover the majority of legitimate subagent use cases. Each maps to a specific failure mode of single-context execution.

Pattern 1: Exploration Containment

When to use it: Any task requiring reading many files, traversing large codebases, or searching across documentation where most of what you read is irrelevant.

The failure it prevents: The parent accumulating thousands of tokens of irrelevant file content while searching for the three files that actually matter.

Contract: Subagent reads whatever is necessary. Returns paths, findings, and answers only - never raw file content. The 80-files-to-find-3 scenario costs the parent ~400 tokens, not 120,000.

A practitioner running 9 parallel code review subagents documented this pattern explicitly: each specialist (security reviewer, performance reviewer, style reviewer) reads only the files relevant to its domain and returns a structured finding list. The orchestrator synthesizes findings it never had to hold in full.

Pattern 2: Noisy Operation Containment

When to use it: Operations that produce large, verbose output that has already been processed before being useful to the parent - test runs, build outputs, log analysis, data pipeline runs.

The failure it prevents: 20,000 tokens of test stdout accumulating in the parent's context when the parent only needs to know "3 tests failed, here are the names and error signatures."

Contract: Subagent runs the operation, processes the output, returns the structured summary. The noise is discarded in the child's context, never reaching the parent.

Pattern 3: Variant Generation and Selection

When to use it: Drafting N variants of something (implementations, designs, communications) and selecting the best one.

The failure it prevents: The parent holding all N rejected variants in its context while evaluating and selecting. Each rejected variant is noise the parent pays for permanently once it's in the thread.

Contract: Each variant is drafted by a subagent in its own context. The orchestrator receives only the variants (which are short), selects one, and discards the rejected options. The drafting process for each variant never enters the parent's context.

Pattern 4: Diff and Review Isolation

When to use it: Reviewing a change, auditing a diff, validating an output against standards - any evaluation that should not contaminate the main reasoning context.

The failure it prevents: Review context (the diff plus the reviewer's analysis) mixing with implementation context (the plan, the intermediate steps, the accepted constraints). These are semantically distinct contexts that produce lower-quality results when mixed.

Contract: The reviewer subagent receives exactly: the diff and the review criteria. It returns findings. The parent's implementation context is never exposed to the reviewer, and the review analysis never enters the implementation context.


The Cost Profile: What You Are Trading

Subagents are not free. The decision to use them is a cost trade-off, not a pure optimization. Engineers who understand the numbers make better architectural decisions.

Startup overhead: Each subagent starts with roughly 10,000-20,000 tokens of base context before any task content is added - the model's system prompt, tool definitions, inherited configuration. For tasks that consume less than this in inline context, spawning a subagent costs more than it saves.

No amortization: Subagents provide zero amortization benefit on repeated similar queries because each call resets to a fresh context. An orchestrator that spawns a researcher subagent five times pays the startup cost five times. A persistent inline agent that builds up domain context across queries amortizes that setup cost. The right choice depends on whether accumulated context is signal (amortize) or noise (isolate).

Token multiplication: Running three parallel subagents uses roughly 3-4x the tokens of a single sequential session doing the same work. The time savings are real - work that takes 3 minutes sequentially takes 1 minute in parallel. The cost is not. A 3-teammate agent team uses roughly 3-4x the tokens per measured production benchmarks. Princeton NLP found that a single agent matched or outperformed multi-agent systems on 64% of benchmarked tasks when given the same tools and context - multi-agent adds roughly 2 percentage points of accuracy at approximately double the cost. Subagents are not the default architecture. They are the architecture for specific, well-justified cases.

The decision heuristic from the slide is precise: reach for a subagent when the task is searching a large, multi-repo codebase; running long evals without blocking the main loop; drafting N variants and picking one; or reviewing something you don't want polluting the main context. These are the cases where the isolation benefit exceeds the startup cost.

The inverse is equally important: do not spawn a subagent when the task is a single file read, a simple lookup, a short computation, or any operation that produces less output than the subagent startup overhead. The practitioner case is clear: checking if a file exists and reading its first 10 lines does not warrant a subagent. Spawning one for that task costs 15 seconds of setup for 0.1 seconds of actual work.


Defining Subagents: The .claude/agents/ Format

The slide specifies the definition mechanism: .claude/agents/<name>.md. This is the persistent subagent format - defined once, available across sessions, automatically invoked when Claude's orchestrator determines the task matches the description.

code
.claude/└── agents/    ├── researcher.md        # Project-scoped: lives in repo, shared with team    ├── eval-runner.md    └── reviewer.md~/.claude/└── agents/    └── security-auditor.md  # User-scoped: available across all projects

The format follows the same YAML frontmatter + Markdown body pattern as Agent Skills:

code
---name: researcherdescription: >  Reads and analyzes files to answer a specific, bounded research question  about this codebase. Use when file exploration would flood the main context  with raw content. Use when given a scope and a question to answer.  Do NOT use for tasks that require writing files or making changes.tools:  - Read  - Grep  - Globmodel: claude-haiku-4-5-20251001---# Researcher AgentYou are a specialist that reads files and returns structured findings.## What you receive- A focused research question- An optional scope (directories or file patterns to search)## What you returnAlways return a structured summary with these sections:### Key findings3-5 bullet points answering the research question directly.### Relevant filesPaths only. No content. List the files where findings came from.### Dead endsWhat you checked that was NOT relevant, so the orchestrator doesn't re-investigate.### Confidencehigh | medium | low - and one sentence on why.## Constraints- Return findings, not file contents- If the question is ambiguous, answer the most plausible interpretation  and note your assumption- If you cannot find an answer within the given scope, say so explicitly

Two configuration decisions in this definition deserve attention:

tools restriction - The researcher is given Read, Grep, and Glob only. No Write, no Edit, no Bash. This is least-privilege enforcement at the agent boundary. A researcher that cannot write cannot accidentally modify production files while exploring. Tool restriction is not just a safety control - it is an architectural signal to the model about what its role is. Models given write tools in research contexts sometimes use them. Models given only read tools stay in their lane.

model selection - The researcher runs on claude-haiku-4-5-20251001, the cheapest available model. The orchestrator runs on a more capable model appropriate for synthesis and complex reasoning. This is the most impactful single cost optimization for subagent architectures: route focused, bounded tasks to the cheapest model that can do the job. The slide states this explicitly. File reading and grep-based exploration do not require Opus-level reasoning. Running them on Haiku cuts subagent costs by roughly 10x compared to running the same work on a frontier model.


Subagent Failure Modes to Design Around

Subagents introduce failure modes that do not exist in single-context execution. Production systems need explicit handling for each.

Silent success with wrong output - A subagent can return HTTP 200 with an empty result because the tool it called errored silently. File writes to wrong paths due to path confusion in isolated environments are documented in production. The rule: always validate that expected artifacts exist after subagent completion. Check session status. Confirm returned data is non-empty and well-formed. Subagents are helpers, not oracles.

Context bleed on handoff - When the orchestrator's task description to a subagent is too detailed, it inadvertently imports the parent's context into the child. A researcher subagent that receives a 5,000-token task description containing implementation details, prior decisions, and rejected approaches starts its fresh context heavily polluted. The task description should be the minimum the subagent needs to answer its specific question - nothing more.

Cascading timeout - Subagents can stall. A researcher exploring a poorly specified scope can spend its entire context budget on irrelevant files without returning. Aggressive timeouts (2-5 minutes for bounded tasks) with explicit fallback behavior are mandatory. The orchestrator must have a recovery path: spawn a new subagent with a narrower scope rather than waiting indefinitely.

Over-decomposition - The second most common failure after under-decomposition. Splitting a task into 15 subagents when 3 would suffice adds coordination complexity, increases token overhead, and introduces more surfaces for partial failure. Anthropic's guidance suggests 2-4 subagents as the practical sweet spot for most tasks. Beyond that, coordination overhead and parallel git worktree complexity tend to outpace the gains.

Dependency-based ordering failures - Subagents that run in parallel cannot share discoveries mid-execution. If the researcher finds that authentication lives in /src/auth/v2 (not /src/auth as expected) and the reviewer was simultaneously told to audit /src/auth, the reviewer's findings are based on the wrong scope. Tasks with implicit dependencies between subagents must be sequenced, not parallelized. The slide's three examples - Researcher, Eval Runner, Reviewer - are parallel precisely because they are independent. The reviewer audits the current diff regardless of what the researcher finds.


The Clean Summary Discipline

The single most important discipline in subagent architecture is what gets returned, not what gets delegated. The subagent contract is only valuable if the summary it returns is actually concise and structured. A subagent that returns "here are all 40,000 tokens of my findings" has defeated the purpose of isolation.

This is the Clean Summary Discipline: every subagent definition must specify an output contract as precisely as its input contract. What sections does the summary contain? What is the maximum length? What must be omitted (raw content, full stack traces, verbatim file dumps)?

The output contract belongs in the subagent's SKILL.md or agent definition, not in the per-invocation task description. If the orchestrator has to remind the researcher to return summaries not file contents on every call, the researcher is underspecified.

Production subagent definitions that follow this discipline consistently outperform those that don't - not because the model is smarter, but because the model's output format is constrained at definition time rather than negotiated at invocation time.


Subagents in the Agent Architecture Stack

If you are building on the Harness Engineering and AI Control Plane patterns from this series, subagents sit at the execution boundary of the harness layer. The harness controls what tools are available, what policies apply, and what gets logged. Subagents operate within the harness - each subagent has its own harness instance with its own tool restrictions and policy enforcement.

This is architecturally clean: the policy enforcement in the Execution Contract Pattern from the Harness Engineering series applies per subagent, not per session. A researcher subagent gets a read-only harness. An eval runner gets a bash-restricted harness. The reviewer gets a read + diff harness. The parent orchestrator gets the synthesis harness. Policy enforcement is granular because the context boundaries are granular.

The connection to the Context Curation Loop from the previous article on context engineering is direct: subagents are the most aggressive form of context isolation in that loop. When the question is "should this output enter the parent's context, or should it be isolated to a child's context and returned as a summary?" - the subagent is the mechanism for choosing isolation.


Decision Guide: Spawn or Stay Inline?

Before spawning a subagent, answer these questions:

Is the task bounded and self-contained? If the task requires ongoing back-and-forth with the parent mid-execution, a subagent cannot deliver it. Subagents are single-shot: they receive one context, produce one result, return one summary. Tasks requiring iterative refinement based on parent feedback belong in the main thread.

Will the task produce more noise than the subagent startup cost? If the task will consume fewer than ~15,000 tokens inline, spawning a subagent likely costs more than it saves. Simple lookups, single-file reads, and short computations belong inline.

Is the task genuinely independent from other parallel work? If the task has implicit dependencies on findings from other concurrent subagents, it must be sequenced. Parallelizing dependent tasks produces incorrect results that are difficult to debug because the failure happens at the coordination layer.

Does the task require a model cheaper than the orchestrator? Bounded, focused tasks almost always do. File exploration, test execution, and format checking do not require frontier-model reasoning. Route them to Haiku. Save Sonnet or Opus for synthesis, complex reasoning, and architectural decisions.

Is the tool set for this task a strict subset of the parent's tool set? If yes, enforce it in the subagent definition. Least-privilege at the agent boundary prevents the class of failures where exploratory subagents inadvertently modify state.

mermaid
flowchart TD
    A[Task arrives at orchestrator]:::blue --> B{Self-contained\nand bounded?}:::purple
    B -->|No| C[Execute inline:\ntask needs parent feedback]:::grey
    B -->|Yes| D{Inline token cost\nvs startup overhead?}:::purple
    D -->|Below threshold| C
    D -->|Above threshold| E{Independent of\nother parallel tasks?}:::purple
    E -->|No - has dependencies| F[Sequence:\nrun after dependencies resolve]:::yellow
    E -->|Yes - independent| G{Tool set\nrestriction applicable?}:::purple
    G -->|Yes| H[Define restricted subagent:\nleast-privilege tool set]:::teal
    G -->|No| I[Define full-tool subagent:\nwith explicit output contract]:::teal
    H --> J[Spawn subagent:\ncheapest model that can do the job]:::green
    I --> J
    J --> K[Receive summary ~400 tokens]:::blue
    K --> L[Validate: non-empty,\nwell-formed, expected artifacts exist]:::purple
    L -->|Valid| M[Orchestrator incorporates summary]:::green
    L -->|Invalid| N[Retry with narrower scope\nor escalate]:::red

    classDef blue fill:#4A90E2,color:#fff,stroke:#3A7BC8
    classDef purple fill:#7B68EE,color:#fff,stroke:#6858DE
    classDef teal fill:#98D8C8,color:#fff,stroke:#88C8B8
    classDef yellow fill:#FFD93D,color:#333,stroke:#EFC92D
    classDef green fill:#6BCF7F,color:#fff,stroke:#5BBF6F
    classDef red fill:#E74C3C,color:#fff,stroke:#D43C2C
    classDef grey fill:#95A5A6,color:#fff,stroke:#859596

Production Checklist: Subagent Architecture Decisions

Before deploying any subagent-based workflow to production:

Definition quality

  • Does the agent definition specify an exact output contract following the Clean Summary Discipline - sections, maximum length, what to omit (raw content, full stack traces, verbatim file dumps)?
  • Is the tool set restricted to the minimum necessary for the agent's task?
  • Is the model selection intentional - cheapest model that can do the job reliably?
  • Does the description specify both when to use AND when NOT to use this subagent?

Orchestrator discipline

  • Is the task description passed to each subagent minimal - just the question and scope, not the parent's full context?
  • Are parallel subagents genuinely independent, or do they have implicit dependencies that require sequencing?
  • Is there explicit timeout handling with defined fallback behavior (retry with narrower scope, escalate to human)?
  • Does the orchestrator validate subagent output before treating it as success?

Cost awareness

  • Have you measured the token cost of the subagent startup overhead vs. the inline cost of the same task?
  • For tasks running frequently, have you considered whether accumulated context would be signal (keep inline) vs. noise (isolate)?
  • Is the number of parallel subagents bounded? Default max of 4 unless specific justification exists.

Failure handling

  • Does the orchestrator have a recovery path for subagent timeout or empty return?
  • Are expected output artifacts verified to exist after subagent completion?
  • Is there observability on per-subagent token cost, not just aggregate session cost?

References


Agentic AI

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:


Comments