← Back to Blog

Orchestration in Agentic AI: Tool Selection, Execution, Planning Topologies, and Context Engineering

#orchestration#agentic-ai#langgraph#tool-selection#context-engineering#react-agents#planner-executor#tool-execution#langchain#ai-systems

Most agent failures are not model failures. They are orchestration failures.

The model did exactly what it was asked to do — selecting a tool that was semantically close but operationally wrong, executing it with parameters that fit the schema but not the intent, inside a planning topology that had no recovery path, operating on a context window that was missing critical state. The model followed instructions. The instructions were wrong.

Real orchestration is about four things: which tool the model calls and how precisely it routes to the right one (tool selection), how that tool call is parameterized and executed (tool execution), how the agent plans, sequences, and recovers across steps (planning topology), and what information the model has when it makes each decision (context engineering). Get any one wrong, and your agent is confidently wrong. If your agent works in demos but falls apart in production, your orchestration is wrong — not your model.

We'll walk through each layer — not as abstractions, but as failure points you will hit in production.


The Orchestration Stack

Before going deep on any individual component, it helps to see how the layers relate:

Tool Selection
Standard · Semantic · Hierarchical
Tool Execution
Single · Parallel · Chain · Graph
Planning Topologies
ReAct · Planner-Executor · Reflection · Deep Research
Context Engineering
Memory retrieval · Dynamic context assembly

As task complexity grows, agents need to coordinate across multiple steps — managing memory, sequencing tool calls, and keeping context coherent across a long execution window. Each agent archetype makes different tradeoffs across these layers. Understanding which archetype fits which problem space will save you from over-engineering or under-engineering your system.


Agent Archetypes

Reflex Agents

A reflex agent implements a direct input-to-action mapping with no internal reasoning trace. The model sees the input, selects an action, executes it. Done.

The production scenario where this is the right choice: a high-volume triage system that classifies inbound support tickets into one of eight queues before routing them downstream. Fixed label set, sub-100ms latency requirement, thousands of requests per minute. A ReAct agent gives you the same output with ten times the latency and token cost. There is no reasoning to audit — the classification either lands in the right queue or it doesn't, and you know immediately from downstream metrics.

The failure mode is brittleness. When input falls outside the training distribution or the action space shifts — a new queue gets added, ticket language changes — reflex agents break with no recovery mechanism. There is no "I'm not sure" path. Build explicit fallback routing for out-of-distribution inputs, and monitor distribution shift before it catches you in production.

ReAct Agents

ReAct agents run a tight loop: reason about what to do next, call a tool, incorporate the result into the next reasoning step, and continue until the task resolves.

from langchain.agents import AgentExecutor, create_react_agentfrom langchain_core.prompts import PromptTemplatetemplate = """Answer the question using the tools available.Tools: {tools}Tool names: {tool_names}Question: {input}{agent_scratchpad}"""prompt = PromptTemplate.from_template(template)agent = create_react_agent(llm, tools, prompt)executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

Where ReAct earns its cost: any task that requires adapting mid-flight — data analysis that depends on what earlier queries return, troubleshooting where the next step depends on the current error, multi-source aggregation where you don't know which sources matter until you start. The step-by-step trace also gives you something reflex agents don't: a record of why each tool was called, which makes debugging tractable when something goes wrong.

The failure modes are token inefficiency (every step burns context) and reasoning drift on long chains. If your task requires more than five or six tool calls, a ReAct agent will often start hallucinating tool arguments or forgetting earlier observations.

Planner-Executor Agents

Planner-executor agents separate the planning phase from execution. The planner receives the task and produces a structured plan — a DAG or ordered list of steps. The executor works through that plan, calling tools as prescribed.

The advantages are real:

Clear decomposition. The plan is inspectable before execution begins. You can validate it, log it, and replay it.

Debuggability. When something goes wrong, you know whether the failure was in planning or execution. That distinction matters when you're trying to fix it.

Cost efficiency. The planner only runs once per task. Execution nodes can be simpler, cheaper models.

Model allocation. The planner and executor don't need to use the same model. The planner generates a DAG or step list — a reasoning-heavy task suited to a large frontier model. Each executor node runs a single, well-scoped tool call — the kind of task a smaller, faster model handles reliably. Using a 70B model for planning and a 7B model for execution can cut per-task cost by 40–60% with minimal accuracy loss on execution steps.

# Two-model planner-executorplanner_llm = ChatOpenAI(model="gpt-4o")       # expensive, runs once per taskexecutor_llm = ChatOpenAI(model="gpt-4o-mini") # cheap, runs once per stepdef plan_task(task: str) -> list[dict]:    return planner_llm.invoke(planning_prompt.format(task=task))def execute_step(step: dict, state: dict) -> dict:    return executor_llm.invoke(execution_prompt.format(step=step, state=state))

The weakness is rigidity. A plan written at time T may be wrong by the time step N executes, especially when external state changes mid-task. Build in re-planning triggers when executor steps return unexpected results — a checkpoint that compares the actual step output against the expected output in the plan and calls the planner again if they diverge.

Query-Decomposition Agents

Query-decomposition agents attack a complex question by breaking it into smaller, answerable sub-questions — running retrieval or tool calls against each — then assembling the individual answers into a final response. The pattern is sometimes called "self-ask with search."

The scenario where this earns its keep: a due-diligence agent that needs to answer "Is this vendor a suitable partner?" The top-level question is unanswerable as stated. But break it into sub-questions — "What is their financial stability?", "Do they have SOC 2?", "What do their support SLAs say?", "Have they had data breaches?" — and each becomes a tractable retrieval task. The decomposition forces the model to make its assumptions explicit rather than pattern-matching to an answer it hasn't actually verified. That's where the reliability gain comes from.

The failure mode is over-decomposition. If the model splits a straightforward query into a dozen sub-questions, you pay for it in latency and cost with no benefit. Set hard limits on decomposition depth — three to five levels is usually sufficient — and add a pre-decomposition check that routes simple queries directly to retrieval without decomposing at all.

Reflection Agents

Reflection agents add a self-assessment step after every action: before proceeding, the model evaluates whether the result matched expectations, whether the current plan still holds, and what needs to change. This catches reasoning errors before they compound.

The ReflAct framework pairs reflection with action in the same way ReAct pairs reasoning with action. Each tool output triggers a reflection prompt that asks the model to assess whether the result was expected, whether the plan still holds, and what should change.

reflection_prompt = """Previous action: {action}Observed result: {observation}Expected result: {expected}Assess:1. Did the result match expectations?2. Does the current plan still hold?3. What corrections are needed?Reflection:"""

Where this pattern pays off: workflows involving irreversible operations — payment processing, clinical decision support, incident escalation — where a wrong committed action costs more than an extra inference call. The reflection step gives the agent a chance to catch a deviation and replan before it's too late to undo.

The cost is concrete: every reflection step is a full inference call. A 10-step agent becomes 20 LLM invocations. Budget accordingly, and restrict this pattern to workflows where the cost of a wrong committed action exceeds the cost of verification.

Deep Research Agents

Deep research agents stack multiple patterns into a single workflow: planner-executor for structuring the research agenda, query-decomposition for breaking questions into targeted searches, ReAct for iterating on findings, and reflection for validating synthesis at each stage. The capability ceiling is genuinely high. So is everything else: latency, cost, fragility.

The non-obvious failure mode: deep research agents don't fail because they lack information. They fail because they fail to converge. The agent accumulates evidence, the context fills with partially-contradictory findings, and the synthesis step produces a confident but incoherent output. More tool calls make this worse, not better. The fix is convergence criteria enforced at each synthesis step — not just at the end — and a hard cap on search iterations per sub-question.

At $0.01 per 1K tokens on a mid-tier model, a 50-call run might cost $2–5 per task. Multiply that across concurrent users and you have a budget problem before you have a product. Use this archetype for high-value, low-frequency tasks where the alternative is hours of human research time.

Archetype Comparison

ArchetypeStrengthsWeaknessesBest Fit
ReflexMinimal latency, low costBrittle, no recoveryClassification, routing
ReActAuditable reasoning, adaptiveToken-heavy, drift on long chainsExploratory analysis, troubleshooting
Planner-ExecutorDebuggable, cost-efficientRigid, needs re-planning hooksMulti-step workflows with known structure
Query-DecompositionGrounded reasoning, multi-sourceOver-decomposition riskComplex Q&A, research aggregation
ReflectionHigh reliability, self-correcting2x inference costHigh-stakes, irreversible operations
Deep ResearchMaximum capability, adaptiveExpensive, high latency, fragileResearch synthesis, discovery tasks

Hard Rules

Before picking an archetype, apply these constraints. They are not guidelines — they are thresholds where a different pattern is categorically better:

flowchart LR
    A[">5 tool calls"] --> B["Use Planner-Executor"]
    C[">20 tools"] --> D["Semantic or hierarchical selection"]
    E["Irreversible step"] --> F["Add reflection or HITL checkpoint"]
    G["External state dependency"] --> H["Planner-Executor + re-plan hooks"]
    I["Latency budget <200ms"] --> J["Reflex agent or direct retrieval"]

    style A fill:#F5C4B3,stroke:#993C1D,color:#4A1B0C
    style C fill:#F5C4B3,stroke:#993C1D,color:#4A1B0C
    style E fill:#F5C4B3,stroke:#993C1D,color:#4A1B0C
    style G fill:#F5C4B3,stroke:#993C1D,color:#4A1B0C
    style I fill:#F5C4B3,stroke:#993C1D,color:#4A1B0C

    style B fill:#9FE1CB,stroke:#0F6E56,color:#04342C
    style D fill:#9FE1CB,stroke:#0F6E56,color:#04342C
    style F fill:#9FE1CB,stroke:#0F6E56,color:#04342C
    style H fill:#9FE1CB,stroke:#0F6E56,color:#04342C
    style J fill:#9FE1CB,stroke:#0F6E56,color:#04342C

These aren't rules-of-thumb you can ignore with a good argument. They're thresholds past which a simpler pattern breaks in ways that aren't fixable by prompt engineering.


Tool Selection

Adding more tools without better selection makes your system worse, not better. The model's attention is finite. More options don't increase capability — they increase noise. At small tool counts the model can figure it out. At 300 tools with overlapping descriptions and similar input schemas, you have a routing problem that requires explicit engineering at the infrastructure level, not in the prompt.

Standard Tool Selection

Standard tool selection passes the full tool list and descriptions to the foundation model and asks it to select the most appropriate tool for the given context. The output is compared against the toolset and the closest match is chosen.

Before you switch selection strategies, exhaust description engineering. A better description is cheaper than a semantic index, and it compounds — clearer descriptions improve accuracy in every strategy, not just standard. The pattern:

  • Concise, specific name: verb-noun, unambiguous (calculate_pro_rated_refund, not process_refund)
  • One-sentence purpose: what it does and when to use it, not how it works internally
  • Example invocation: a concrete input/output pair in the docstring
  • Input constraints: explicit types, allowed values, and format requirements
from langchain_core.tools import tool# WEAK: forces the model to guess@tooldef query_customer_db(customer_id: str) -> str:    """Get customer data."""    pass# STRONG: model knows exactly when and how to call this@tooldef get_customer_profile(customer_id: str) -> str:    """    Retrieve a customer's profile, account tier, and lifetime value.    Use when you need account standing before resolving a billing dispute.        Example: get_customer_profile("CUST-1042") -> {"tier": "premium", "ltv": 4200, ...}        customer_id: Internal ID in format CUST-XXXX.    Returns JSON with keys: id, tier, ltv, created_at, account_status.    """    pass

When accuracy is still low after description improvement, add two or three representative (task → tool) few-shot examples to the selection prompt. This can close a significant accuracy gap without any infrastructure changes — before you reach for embeddings.

This works well at small scale. The failure mode is latency at large scale — passing 300 tool descriptions in every prompt is expensive, and as descriptions start overlapping, selection accuracy degrades regardless of description quality.

Semantic Tool Selection

Semantic tool selection moves tool routing out of the prompt and into a vector retrieval step. Tool descriptions are embedded at registration time. At inference time, the agent's current task state is embedded and matched against the tool index.

flowchart LR
    A["Task embedding"] --> B["Vector similarity search"] --> C["Top-K tools"] --> D["Model selects from K"]

    style A fill:#EEEDFE,stroke:#534AB7,color:#26215C
    style B fill:#EEEDFE,stroke:#534AB7,color:#26215C
    style C fill:#CED0F6,stroke:#534AB7,color:#26215C
    style D fill:#9FA9EC,stroke:#3C3489,color:#26215C

This reduces context burden from O(total tools) to O(K) where K is typically 5–10.

In practice, retrieve a wider candidate set (say K=10) and make a second model call to filter down to the tools actually needed for the current task. The first pass is broad and cheap; the second pass is precise. This two-stage pattern catches cases where a single similarity search would return the wrong five tools.

from langchain_community.vectorstores import FAISSfrom langchain_openai import OpenAIEmbeddings# At registration timetool_descriptions = [{"name": t.name, "description": t.description} for t in all_tools]tool_index = FAISS.from_texts(    [t["description"] for t in tool_descriptions],    OpenAIEmbeddings(),    metadatas=tool_descriptions)# At inference time: two-stage selectiondef retrieve_tools(query: str, k_candidates: int = 10, k_final: int = 5) -> list:    # Stage 1: broad semantic retrieval    candidates = tool_index.similarity_search(query, k=k_candidates)    candidate_tools = [t for t in all_tools if t.name in [r.metadata["name"] for r in candidates]]        # Stage 2: model call to filter to only what this task actually needs    selected = model.invoke({        "task": query,        "candidate_tools": [{"name": t.name, "description": t.description} for t in candidate_tools],        "instruction": "Select only the tools required for this specific task. Return tool names as JSON list."    })    selected_names = parse_json_list(selected)    return [t for t in candidate_tools if t.name in selected_names]

Two failure modes to handle: embedding retrieval misses a relevant tool (semantically dissimilar description despite functional relevance), and the second model call over-selects. For the first, maintain a set of "always-available" tools that bypass retrieval entirely — system tools like log_step or get_current_time. For the second, set a hard cap on the number of tools the model can select in the filtering pass.

Hierarchical Tool Selection

When your toolset is large and contains many semantically similar tools, hierarchical selection adds a second level of routing. You organize tools into groups with group-level descriptions. Selection first chooses a group, then performs a secondary search within that group.

flowchart LR
    A["Query"] --> B["Group selection"] --> C["Within-group tool selection"] --> D["Model invocation"]

    style A fill:#EEEDFE,stroke:#534AB7,color:#26215C
    style B fill:#CECBF6,stroke:#534AB7,color:#26215C
    style C fill:#AFA9EC,stroke:#3C3489,color:#26215C
    style D fill:#7F77DD,stroke:#3C3489,color:#EEEDFE
tool_groups = {    "customer_data": {        "description": "Tools for reading and writing customer records, profiles, and account status",        "tools": [query_customer_db, update_customer_status, get_customer_history]    },    "billing": {        "description": "Tools for invoice generation, payment processing, and refund operations",        "tools": [create_invoice, process_payment, issue_refund]    },    "notifications": {        "description": "Tools for sending email, SMS, and push notifications to customers",        "tools": [send_email, send_sms, send_push_notification]    }}def hierarchical_tool_select(query: str) -> list:    # Stage 1: select group    group_descriptions = {k: v["description"] for k, v in tool_groups.items()}    selected_group = select_group(query, group_descriptions)  # your selection logic        # Stage 2: select within group    candidate_tools = tool_groups[selected_group]["tools"]    return semantic_select(query, candidate_tools, k=3)

Two-stage routing, smaller search space at each level, higher precision when group boundaries are well-defined.


Tool Execution

Tool execution has two phases: parameterization and invocation. Parameterization is where most failures originate. Topology determines whether those failures stay local or cascade globally.

Parameterization

Schema design matters more than most engineers realize. Vague parameter descriptions produce garbage arguments — the model isn't guessing, it's filling in the schema you gave it.

from typing import Literalfrom langchain_core.tools import tool# BAD: vague schema@tooldef search_orders(query: str, filters: dict) -> str:    """Search orders."""    pass# GOOD: precise schema with constraints@tooldef search_orders(    customer_id: str,    status: Literal["pending", "shipped", "delivered", "cancelled"],    date_from: str,  # ISO 8601 format: YYYY-MM-DD    date_to: str,    # ISO 8601 format: YYYY-MM-DD    limit: int = 20  # Max 100) -> str:    """    Search order records for a specific customer within a date range.    Returns a JSON list of order objects with id, status, total, and line items.    """    pass

Precise types, clear constraints, and a concrete return format description. That's the whole job.

Execution Topologies

How you wire tool calls together determines your system's parallelism, latency profile, and failure propagation characteristics. Single-node failures in a chain become total failures. Single-node failures in a graph with proper consolidation stay contained.

Single Tool Execution

One tool call per agent step. Appropriate when each step depends on the previous result. Maximum debuggability, zero parallelism.

Parallel Tool Execution

When multiple tool calls are independent, run them concurrently. LangChain's RunnableParallel handles this cleanly:

from langchain_core.runnables import RunnableParallelparallel_calls = RunnableParallel(    customer_data=get_customer_data_tool,    account_status=get_account_status_tool,    recent_orders=get_recent_orders_tool)# All three fire simultaneously; results merge into a dictresults = parallel_calls.invoke({"customer_id": "C-12345"})

The latency of a parallel block equals its slowest member. Build timeout logic to prevent one slow external call from holding the entire block hostage.

Chains

Chains wire tool outputs sequentially through LangChain Expression Language (LCEL):

from langchain_core.runnables import RunnableLambdafrom langchain_core.output_parsers import JsonOutputParserchain = (    retrieve_customer_tool    | RunnableLambda(lambda x: {"customer": x, "query": "purchase history"})    | query_orders_tool    | JsonOutputParser()    | format_summary_tool)result = chain.invoke({"customer_id": "C-12345"})

Deterministic and predictable. Can't branch on intermediate results without adding conditional wrappers — use a graph when you need branching.

Graphs

Graphs are the production-grade solution for complex, non-linear workflows. Where chains force sequential execution and trees enforce strict branching, graphs let you define both conditions for branching and edges that merge parallel paths back into a shared downstream node.

LangGraph is the framework of choice here. Each node represents a discrete tool invocation or logical step. Edges — including add_conditional_edges — declare the conditions under which the agent transitions between steps.

from langgraph.graph import StateGraph, ENDfrom typing import TypedDict, Optional# Document processing pipeline: classify → branch by doc type → validate → storeclass DocState(TypedDict):    raw_text: str    doc_type: Optional[str]      # "invoice" | "contract" | "report" — set after classify    extracted: Optional[dict]    # set after extraction branch    validated: Optional[bool]    # set after validation    storage_id: Optional[str]    # set after successful storedef classify_document(state: DocState) -> DocState:    doc_type = classify_doc_tool.invoke(state["raw_text"])    return {**state, "doc_type": doc_type}def route_by_type(state: DocState) -> str:    # Conditional edge: branch to the right extractor    return f"{state['doc_type']}_extractor"def invoice_extractor(state: DocState) -> DocState:    extracted = invoice_extract_tool.invoke(state["raw_text"])    return {**state, "extracted": extracted}def contract_extractor(state: DocState) -> DocState:    extracted = contract_extract_tool.invoke(state["raw_text"])    return {**state, "extracted": extracted}def report_extractor(state: DocState) -> DocState:    extracted = report_extract_tool.invoke(state["raw_text"])    return {**state, "extracted": extracted}def validate_and_store(state: DocState) -> DocState:    # All branches consolidate here    valid = validate_schema_tool.invoke(state["extracted"])    if valid:        storage_id = store_document_tool.invoke(state["extracted"])        return {**state, "validated": True, "storage_id": storage_id}    return {**state, "validated": False}# Build the graphworkflow = StateGraph(DocState)workflow.add_node("classify", classify_document)workflow.add_node("invoice_extractor", invoice_extractor)workflow.add_node("contract_extractor", contract_extractor)workflow.add_node("report_extractor", report_extractor)workflow.add_node("validate_and_store", validate_and_store)workflow.set_entry_point("classify")workflow.add_conditional_edges("classify", route_by_type)# All three extractor branches merge into the same validation nodefor extractor in ["invoice_extractor", "contract_extractor", "report_extractor"]:    workflow.add_edge(extractor, "validate_and_store")workflow.add_edge("validate_and_store", END)app = workflow.compile()

The decision rule for choosing topology:

  • Start with a chain. If your task is strictly sequential (every step feeds the next, no branching), a chain is easier to reason about, test, and debug. Every additional node and edge multiplies the number of execution paths and potential error modes.
  • Escalate to a graph only when you need both branching and consolidation. If you need to branch on a condition and merge those branches back into a shared downstream step, a chain can't express this cleanly — that's when a graph earns its overhead.

Sketch the topology on paper before writing code. Label each node with its tool or logical step, draw the edges, and highlight where parallel paths reconverge. If your sketch has no merging branches, use a chain. The graph definition you'll write is exactly what you drew.

Graphs offer maximum flexibility for modeling complex, non-linear workflows. The cost is cognitive overhead — more LLM calls, more routing logic, and a larger surface area for cycles or unreachable paths. Cap your branching depth, write tests for each conditional router, and use LangGraph's built-in tracing to verify every path reaches a terminal node.


Context Engineering

Context is the actual execution environment of an LLM. The model has no persistent state, no awareness of what happened ten steps ago, and no way to reason about information that isn't in its current window. Which means the context window you construct for each invocation is the system. Most orchestration failures are context failures — everything else is downstream.

This is the most underinvested layer in agentic systems. Engineers spend weeks optimizing tool schemas and zero hours thinking about what the model actually sees when it makes a decision. That ratio is backwards.

To make it concrete: consider an e-commerce support agent handling a return request. Its context window at the classification step isn't just the user's message. It's assembled from: the system prompt defining allowed actions; the user's current message; a retrieved summary of the relevant order record; applicable return policy excerpts from a knowledge base; and the results of any tool calls already completed earlier in the workflow. Each element has to be there, in the right form, within the token budget. Leave out the policy excerpts and the model makes decisions without constraints. Include the full order history verbatim instead of a summary and you burn 2,000 tokens on data the model doesn't need at this step. Context engineering is the discipline of getting that assembly right, at every step.

Prompt Engineering Is Not Context Engineering

Prompt engineering is about crafting better instructions. Context engineering is a different problem: at each agent step, you need to decide what combination of inputs — current task goal, tool schema, prior step outputs, retrieved memory, workflow state — gets assembled into the model's window, and in what form. The quality of that assembly determines whether the model reasons correctly or confidently goes wrong.

This is what ties the other three layers together. A good planning topology produces steps that need the right context to execute. Good tool selection narrows the action space, but the model still needs the surrounding state to parameterize correctly. Without deliberate context assembly, both break down at inference time even when everything else is correct.

Context Is a State Management Problem

In a multi-step agentic workflow, context isn't a single prompt — it's a state machine. At each step, you're deciding:

  • What from the task history is still relevant?
  • Which memory fragments inform the current decision?
  • How much of the prior tool output should survive to the next step?
  • What's the right level of schema detail for the currently-selected tool?

Getting this right requires treating context construction as a first-class engineering concern, not an afterthought. The naive approach — dump everything into context — works until it doesn't, and when it fails, it fails silently. The model doesn't say "my context is too noisy." It hallucinates confidently.

Information Prioritization, Not Just Retrieval

Retrieval-augmented context is necessary but not sufficient. Retrieval gives you relevant information. Prioritization decides how much of it the model can act on without its attention being diluted.

Three types of memory serve different functions in long-horizon tasks. Episodic memory captures specific past interactions — what happened in prior agent runs, what decisions were made, what failed. This lives in a vector store and gets retrieved by semantic similarity. Semantic memory is domain knowledge — facts, policies, schemas — that informs reasoning without being task-specific. Working memory is the current task state: what step you're on, what's been decided, what's in-flight. This belongs in structured state management, not in retrieved text.

class AgentContext:    """Constructs a deterministic context window for a single agent step."""        def __init__(self, max_tokens: int = 4000):        self.max_tokens = max_tokens        self.token_budget = {            "goal": 200,          # Always included, fixed            "tool_schema": 800,   # Only selected tool's schema            "working_memory": 600, # Current step state            "episodic": 600,      # Top-3 relevant past fragments            "semantic": 400,      # Domain knowledge chunks            "history": 1400,      # Compressed recent steps        }        def assemble(        self,        task_goal: str,        current_state: dict,        selected_tool: Tool,        episodic_fragments: list[str],        semantic_chunks: list[str],        step_history: list[dict],    ) -> str:        goal_block = f"Task: {task_goal}"        # Adapt to your Tool implementation: args_schema.schema() for Pydantic v1,        # get_input_schema().schema() for Pydantic v2, or serialize manually        tool_block = f"Tool available:\n{selected_tool.args_schema.schema()}"        state_block = f"Current state:\n{json.dumps(current_state, indent=2)}"                # Prioritize episodic fragments by recency and relevance score        episodic_block = self._format_fragments(            episodic_fragments[:3],             budget=self.token_budget["episodic"]        )                # Compress history: summarize steps older than N, keep recent verbatim        history_block = self._compress_history(            step_history,            budget=self.token_budget["history"]        )                return "\n\n".join([goal_block, tool_block, state_block,                             episodic_block, history_block])        def _compress_history(self, history: list[dict], budget: int) -> str:        """Keep last 3 steps verbatim, summarize older steps."""        recent = history[-3:]        older = history[:-3]                if not older:            return self._format_steps(recent, budget)                summary = summarize_steps(older)  # single LLM call        recent_text = self._format_steps(recent, budget - estimate_tokens(summary))        return f"Earlier steps (summarized):\n{summary}\n\nRecent steps:\n{recent_text}"

The Determinism Requirement

Context assembly should be deterministic. Given the same inputs — same task goal, same state, same retrieval results — the assembly function produces the same context. This sounds obvious. In practice, most teams build context assembly with hidden stochasticity: retrieval results that vary on identical queries, timestamp-dependent step compression, random sampling of memory fragments.

Non-deterministic context makes debugging nearly impossible. You can't reproduce the exact conditions that caused a bad decision. You can't write meaningful regression tests. You can't tell whether a fix worked or whether you just got lucky.

Build your context assembly as a pure function. Any stochasticity (retrieval, sampling) should happen upstream and be passed in as explicit arguments. The assembly function itself should have no side effects and no randomness.

Context Poisoning

A noisy or incorrect tool output early in a workflow doesn't just cause a bad step — it corrupts all subsequent reasoning. The model incorporates that output into its world model. Future decisions are made with a false premise embedded in context.

Validate tool outputs before they enter context. Set a schema for what "well-formed output" looks like and reject anything that doesn't conform. Better to surface a hard failure at step 3 than to let a malformed API response silently contaminate steps 4 through 12.

def add_to_context(step_output: dict, expected_schema: dict) -> dict:    """Validate before adding to working context."""    errors = validate_against_schema(step_output, expected_schema)    if errors:        raise ContextPoisoningError(            f"Step output failed validation. Errors: {errors}. "            f"Output will not be added to context."        )    return step_output

Four Practices That Separate Good Context Engineering from Bad

Pull only what's relevant. Blindly appending all available memory or knowledge into context doesn't help — it dilutes attention and wastes token budget. Retrieve specifically, filter aggressively, and include only what the current step actually needs to reason correctly.

Structure the inputs. Raw text dumps make the model work harder than it needs to. Schemas, typed state dicts, and structured protocols like MCP let you pass retrieved knowledge and workflow state in a form the model can parse efficiently rather than interpret.

Compress history progressively. Don't carry full step transcripts forward indefinitely. Summarize older steps into a compact representation and keep only recent steps verbatim. The _compress_history method above does this — one summarization call buys significantly more reasoning capacity at later steps.

Rebuild at every step. Context assembled at step 1 is stale by step 5. The agent's objective, available tool, and workflow position all change as execution progresses. Treat each inference call as a fresh assembly problem, not a continuation of a static prompt.

Every token you waste on low-signal context is a token you don't have for high-signal reasoning.


Putting It Together: A Customer Support Agent

Abstract patterns become concrete when you trace a single system end-to-end. Here's how the layers interact in a realistic customer support agent.

The task: A customer reports a billing discrepancy. The agent needs to retrieve the account, investigate the issue, determine the resolution path, and close the ticket.

Archetype: Planner-Executor. The issue type is deterministic enough to plan upfront (billing problems have a bounded resolution space), but complex enough to require explicit decomposition. A ReAct loop would work but is harder to audit when something goes wrong on a sensitive financial operation.

Tool selection: Hierarchical. The agent has access to ~40 tools across four domains: customer data, billing, notifications, and escalation. Standard selection at 40 tools is noisy. Semantic selection alone misses the billing/escalation boundary. Hierarchical selection routes to the billing group first, then selects within it.

Execution topology: Graph. The resolution path branches based on investigation results. A standard chain can't express the conditional logic; a graph can.

Context: Structured state + retrieval. Working memory lives in LangGraph's state dict (typed, deterministic, inspectable). Episodic memory retrieves past resolutions for similar billing issues. Tool schemas are added only for the active execution branch.

from langgraph.graph import StateGraph, ENDfrom typing import TypedDict, Optionalclass SupportState(TypedDict):    customer_id: str    ticket_id: str    # Retrieved data    account: Optional[dict]    billing_history: Optional[list]    # Investigation results      discrepancy_type: Optional[str]   # "duplicate_charge" | "incorrect_amount" | "unauthorized"    discrepancy_amount: Optional[float]    # Resolution    resolution_action: Optional[str]    resolution_status: Optional[str]# --- Planner generates the execution plan ---def plan_investigation(state: SupportState) -> SupportState:    plan = planner_llm.invoke({        "task": "Investigate billing discrepancy",        "customer_id": state["customer_id"],        "available_steps": ["fetch_account", "fetch_billing_history",                            "classify_discrepancy", "resolve"]    })    return {**state, "_plan": plan}# --- Executor nodes ---def fetch_account(state: SupportState) -> SupportState:    # Hierarchical tool selection: group=customer_data, tool=get_account    account = get_account_tool.invoke(state["customer_id"])    return {**state, "account": account}def fetch_billing_history(state: SupportState) -> SupportState:    # Parallel fetch: recent charges + invoice history fire simultaneously    results = RunnableParallel(        charges=get_recent_charges_tool,        invoices=get_invoice_history_tool    ).invoke({"customer_id": state["customer_id"]})    return {**state, "billing_history": results}def classify_discrepancy(state: SupportState) -> SupportState:    # Context: account data + billing history + episodic memory of similar cases    context = context_engine.assemble(        task_goal="Classify billing discrepancy type and amount",        current_state={"account": state["account"], "billing": state["billing_history"]},        selected_tool=classify_billing_issue_tool,        episodic_fragments=retrieve_similar_cases(state["billing_history"]),        semantic_chunks=retrieve_billing_policies(),        step_history=state.get("_history", [])    )    result = classify_billing_issue_tool.invoke(context)    return {**state,             "discrepancy_type": result["type"],            "discrepancy_amount": result["amount"]}# --- Conditional routing ---def route_resolution(state: SupportState) -> str:    if state["discrepancy_type"] == "duplicate_charge":        return "auto_refund"    elif state["discrepancy_type"] == "incorrect_amount":        return "credit_adjustment"    else:  # "unauthorized" or unrecognized        return "escalate_to_human"def auto_refund(state: SupportState) -> SupportState:    result = issue_refund_tool.invoke({        "customer_id": state["customer_id"],        "amount": state["discrepancy_amount"]    })    return {**state, "resolution_action": "refund", "resolution_status": result["status"]}def credit_adjustment(state: SupportState) -> SupportState:    result = apply_credit_tool.invoke({        "customer_id": state["customer_id"],        "amount": state["discrepancy_amount"]    })    return {**state, "resolution_action": "credit", "resolution_status": result["status"]}def escalate_to_human(state: SupportState) -> SupportState:    ticket = create_escalation_ticket_tool.invoke(state)    return {**state, "resolution_action": "escalated", "resolution_status": ticket["id"]}# --- Consolidation ---def notify_customer(state: SupportState) -> SupportState:    # All branches merge here; notification varies by resolution type    send_resolution_email_tool.invoke({        "customer_id": state["customer_id"],        "resolution_type": state["resolution_action"],        "details": state["resolution_status"]    })    return state# Build the graphworkflow = StateGraph(SupportState)workflow.add_node("plan", plan_investigation)workflow.add_node("fetch_account", fetch_account)workflow.add_node("fetch_billing", fetch_billing_history)workflow.add_node("classify", classify_discrepancy)workflow.add_node("auto_refund", auto_refund)workflow.add_node("credit_adjustment", credit_adjustment)workflow.add_node("escalate", escalate_to_human)workflow.add_node("notify", notify_customer)workflow.set_entry_point("plan")workflow.add_edge("plan", "fetch_account")workflow.add_edge("fetch_account", "fetch_billing")workflow.add_edge("fetch_billing", "classify")workflow.add_conditional_edges("classify", route_resolution)for node in ["auto_refund", "credit_adjustment", "escalate"]:    workflow.add_edge(node, "notify")workflow.add_edge("notify", END)app = workflow.compile()

The important decisions here aren't framework choices — they're architectural. Why planner-executor instead of ReAct? Because a financial resolution workflow needs an auditable plan before execution begins. Why hierarchical tool selection? Because routing "check for duplicate charge" to a notifications tool is an unacceptable error class. Why graph instead of chain? Because you can't know at design time whether a discrepancy will be auto-resolved or escalated.

These decisions lock in before you write a line of code. The framework just implements them.


What Breaks in Production

A few patterns that reliably cause production failures:

Tool schema drift. External APIs change. Your tool schema doesn't. The model generates arguments that were valid six months ago but are now rejected by the API. Build schema validation into your tool wrapper, not the model. In a Langfuse trace, this shows up as a spike in tool error rates on a previously-stable tool, usually coinciding with a vendor deployment.

Context poisoning. A noisy or incorrect tool output early in a chain corrupts subsequent reasoning. Implement output validation at each tool step, not just at the end. The symptom in traces: the agent produces a confident, coherent answer that is factually wrong, and the error is traceable to a malformed response three steps prior — not the final step.

Unbounded recursion in ReAct loops. Set a hard limit on iterations. A ReAct agent that doesn't converge will consume your entire token budget before timing out. Five to ten iterations is a reasonable ceiling for most tasks. Detection: track iteration_count per run; anything consistently hitting your ceiling is either a broken tool or a task the agent can't actually complete.

Parallelism without isolation. Parallel tool calls that share mutable state without proper locking will corrupt each other. Keep parallel calls stateless where possible. This surfaces as intermittent, non-reproducible errors — the hardest class to debug. If you're seeing flaky failures on parallel execution paths, shared state is the first thing to audit.

Planning without re-planning. Planner-executor agents that don't detect mid-task failures will execute a broken plan to completion. Build explicit checkpoints that compare expected vs. actual state after each executor step. The trace signature: all executor steps succeed individually, but the final output is wrong because the plan was invalid and no step was responsible for catching it.

The Demo Trap. If your agent only works when the full tool list is in context, you don't have orchestration — you have prompt luck. This is the most common failure pattern in early-stage agentic systems: the agent performs well in a notebook with five tools and collapses in staging with fifty. The fix is not a better prompt. The fix is semantic or hierarchical tool selection so the model never sees tools irrelevant to the current step.


Conclusion

Bad orchestration is silent failure. The model completes every step, produces output, and closes the ticket — and the resolution is wrong because the context window at step 3 had a poisoned tool output, and every decision downstream was built on a false premise.

Good orchestration is controlled degradation. Every layer has explicit failure modes, validation checkpoints, and fallback behavior. When something goes wrong, you know exactly where, why, and what to do next.

The patterns in this article — agent archetypes, tool selection hierarchies, execution topologies, context assembly — are not framework features. They're engineering decisions you make before you open a code editor. LangGraph and LangChain implement these patterns; they don't replace the need to understand them.

Before writing code, answer four questions honestly: How complex is the task decomposition — does it need explicit planning or can the model reason step-by-step? How large and semantically overlapping is your tool set? What execution topology matches your branching and failure requirements? What are your constraints on latency, cost, and reliability? The answers determine the architecture. The framework is just the last mile.

Every agent step follows the same loop:

graph LR
    C[Context] --> D[Decision]
    D --> A[Action]
    A --> F[Feedback]
    F --> C
    style C fill:#4A90D9,color:#fff,stroke:none
    style D fill:#7B68EE,color:#fff,stroke:none
    style A fill:#50C878,color:#fff,stroke:none
    style F fill:#FF8C42,color:#fff,stroke:none

Orchestration is the engineering discipline that controls what enters each stage of that loop. Tool selection controls which actions are available. Tool execution controls how those actions are parameterized and run. Planning topology controls how the agent sequences, branches, and recovers. Context engineering controls what the model sees when it makes each decision.

Orchestration quality = f(  Tool selection,          # which action the model picks  Tool execution,          # how that action is run  Planning topology,       # how steps are sequenced and recovered  Context engineering      # what the model sees at each step)

Optimize all four. Most teams optimize one. The ones that ship optimize all four.


Further Reading

If you want to go deeper on LangGraph's state management, graph topology, and checkpoint patterns, the Building Real World Agentic AI Systems with LangGraph series covers these in full — including interrupt handling, human-in-the-loop workflows, and production deployment patterns that build directly on the architecture decisions discussed here.


References


Related Articles

Agentic AI

More Articles

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:


Comments