Why Your AI Agent Finishes Tasks But Fails the Goal

For engineers building multi-step or long-horizon agents in production.

Most developers who build multi-step agents treat the task list as a config artifact. You define it before execution starts, pass it in as part of the initial state, and the agent works through it top to bottom. Simple, predictable, easy to debug.

It works fine - until it doesn't.

The failure mode is subtle enough that you often miss it the first few times. The agent completes every task. The execution log looks clean. But the outcome is wrong - the agent finished the plan instead of achieving the goal. The task list became the reference frame, and mid-execution signals that pointed somewhere more useful got ignored because they weren't on the list.

This is not a bug in your agent. It is a design choice that was never consciously made.

The core question nobody asks upfront

Before you write a single node in your LangGraph workflow, there is one question that determines your entire task list architecture: does the next task depend only on whether the previous task completed, or on what it actually found?

The answer maps to one of three patterns:

Execution Contract Pattern (static task list) - fixed plan, execution-driven. The structure is defined before the workflow starts and never changes. Only status fields mutate.

Discovery Hypothesis Pattern (dynamic task list) - evolving plan, learning-driven. The task list is a hypothesis that gets revised as the agent uncovers findings. Structure mutates at replanning boundaries.

Hybrid Boundary Pattern - an explicit transition between the two. Static for deterministic collection, dynamic for open-ended synthesis, with a designed handoff point between them.

Before getting into which pattern fits your problem, it is worth understanding what a task list actually is in an agentic system - because the answer changes how you think about both patterns.

An LLM's reasoning is ephemeral. It happens inside a forward pass and then disappears. The context window captures inputs and outputs, but not the intermediate structure of the reasoning process - what was considered, what was discarded, what was deferred. For single-turn interactions this does not matter. For multi-step agents running over minutes or hours, it matters enormously.

A task list is how you externalize that reasoning structure into persistent, inspectable state. When an agent writes "investigate hypothesis B before hypothesis A because finding X suggests B is more likely," it is not just scheduling work - it is recording a reasoning decision in a form that survives context window boundaries, crashes, replanning cycles, and human handoffs. The task list is the agent's working theory of how to solve the problem, made visible.

The Execution Contract Pattern says the reasoning happened upfront and execution is mechanical. The Discovery Hypothesis Pattern says reasoning is continuous - the plan reflects the current best understanding at every point in execution. Both are valid. The choice is determined by the epistemic structure of your problem, not by preference.

It also connects directly to interpretability and auditing. A well-maintained task list - with status, results, and replan history - is one of the most useful artifacts you can produce from an agentic workflow. Not as a log, but as a record of how the agent thought about the problem. For any workflow where the output will be reviewed, the task list history is often more valuable than the output itself.

mermaid

flowchart TD
  Q{"Does output change\nwhat comes next?"}

  Q -->|No| EC["Execution Contract Pattern"]
  Q -->|Yes| DH["Discovery Hypothesis Pattern"]

  EC --> B{"Both phases present?"}
  DH --> B

  B -->|Yes| HB["Hybrid Boundary Pattern"]
    style Q fill:#EAB308,color:#0f172a,stroke:#0ea5e9
    style DH fill:#6366F1,color:#FFFFFF,stroke:#0ea5e9
    style EC fill:#3B82F6,color:#FFFFFF,stroke:#0ea5e9
    style B fill:#EAB308,color:#0f172a,stroke:#0ea5e9
    style HB fill:#8B5CF6,color:#FFFFFF,stroke:#0ea5e9

A file processing pipeline is an Execution Contract problem. Each file gets validated, transformed, and written. Whether you transform file 3 does not depend on what file 2 contained - only that it completed. Lock the list, use it as a checkpoint, move on.

A research agent is a Discovery Hypothesis problem. Whether you run a second round of searches depends entirely on what the first round returned. You cannot define the full plan before execution starts, because the plan is a function of what you learn.

The mental model in one diagram

Before getting into code, here is the structural difference between the two patterns.

mermaid

flowchart LR
  subgraph Static["Execution Contract Pattern"]
    direction TB
    S1[T1: pending] --> SE1[execute] --> S1D[T1: done]
    S2[T2: pending] --> SE2[execute] --> S2D[T2: done]
    S3[T3: pending] --> SE3[execute] --> S3D[T3: done]
  end

  subgraph Dynamic["Discovery Hypothesis Pattern"]
    direction TB
    D1[T1: pending] --> DE1[execute] --> DF1{findings}
    DF1 -->|replan| RP[replan node]
    RP --> D2N[T2: REVISED]
    RP --> D3N[T3: REVISED]
    RP --> D4N[T4: NEW]
    D2N --> DE2[execute] --> DF2{findings}
    DF2 -->|continue| D3N
  end
    style D1 fill:#EAB308,color:#000000,stroke:#0ea5e9
    style D2N fill:#EAB308,color:#0f172a,stroke:#0ea5e9
    style D4N fill:#EAB308,color:#0f172a,stroke:#0ea5e9
    style D3N fill:#EAB308,color:#0f172a,stroke:#0ea5e9
    style DE1 fill:#10B981,color:#FFFFFF,stroke:#0ea5e9
    style DE2 fill:#10B981,color:#FFFFFF,stroke:#0ea5e9
    style DF1 fill:#8B5CF6,color:#FFFFFF,stroke:#0ea5e9
    style DF2 fill:#8B5CF6,color:#FFFFFF,stroke:#0ea5e9
    style RP fill:#FCA5A5,color:#0f172a,stroke:#0ea5e9
    style S1 fill:#EAB308,color:#0f172a,stroke:#0ea5e9
    style S2 fill:#EAB308,color:#0f172a,stroke:#0ea5e9
    style S3 fill:#EAB308,color:#0f172a,stroke:#0ea5e9
    style SE1 fill:#10B981,color:#FFFFFF,stroke:#0ea5e9
    style SE2 fill:#10B981,color:#FFFFFF,stroke:#0ea5e9
    style SE3 fill:#10B981,color:#FFFFFF,stroke:#0ea5e9
    style S1D fill:#FDE68A,color:#0f172a,stroke:#0ea5e9
    style S2D fill:#FDE68A,color:#0f172a,stroke:#0ea5e9
    style S3D fill:#FDE68A,color:#0f172a,stroke:#0ea5e9
    style Dynamic fill:#e8f4fd,color:#0f172a,stroke:#0ea5e9
    style Static fill:#e8f4fd,color:#0f172a,stroke:#0ea5e9

The state evolution tells the whole story. In the static case, the task list is a checklist - structure fixed, only checkmarks move. In the dynamic case, the task list is a living hypothesis - the structure itself changes as the agent learns what it is dealing with.

The three patterns at a glance

The following table compares the Execution Contract vs. the Discovery Hypothesis across various aspects.

Aspect	Execution Contract	Discovery Hypothesis
Task mutation	Never - status only	Yes - tasks rewritten on replan
Predictability	High - same path every run	Medium - path depends on findings
Cost	Low - no replanning LLM calls	Higher - each milestone adds $0.05–$0.20 (GPT-4o)
Crash recovery	Clean - resume from checkpoint	Requires careful state preservation
Use case	Pipelines, ETL, structured generation	Research, debugging, exploratory analysis
Goal displacement risk	Low	High if replanning is unconstrained
Primary failure mode	Stale assumptions	Over-replanning or plan drift

If most of the left column describes your workflow, use the Execution Contract Pattern. If most of the right column does, use the Discovery Hypothesis Pattern - but guard it carefully. If your workflow has both phases, you need the Hybrid Boundary Pattern.

The cost row deserves more than a table cell. Every replanning milestone requires two LLM calls - one to evaluate whether to replan (the should_replan check) and one to rewrite the task list if the decision is yes. At GPT-4o pricing, each pair runs roughly $0.05–$0.20 depending on context size. Across three milestones in a single run, that is $0.15–$0.60 in replanning overhead before the agent does any real work. This is not a reason to avoid the Discovery Hypothesis Pattern - but it is the financial argument for the Hybrid Boundary Pattern. If your workflow has a deterministic collection phase, running that phase under the Execution Contract (zero replanning cost) and reserving replanning for the synthesis phase is not just architecturally cleaner. It is cheaper. The Cost-to-Discovery Ratio - how much replanning overhead you pay per meaningful plan revision - is the metric worth tracking. If you are spending $0.40 per run on replanning and the replan fires zero or one times, your milestones are too frequent or your trigger criteria are too sensitive.

What goal displacement actually looks like

Goal displacement is when an agent optimizes for completing the plan rather than achieving the underlying intent. The task list becomes the goal, not the means to a goal.

Here is a concrete example. You give a research agent this task list:

code

1. Search for recent papers on transformer efficiency2. Extract key findings from top 5 results3. Identify common themes across findings4. Draft a summary of the current state of the field

The agent runs step 1 and finds that the most relevant recent work is actually on state space models, not transformers, and several major labs have publicly moved research resources in that direction. This is directly relevant to the user's underlying question about the current state of efficient sequence modeling.

A static agent ignores this. It extracts findings from the transformer papers it found, identifies themes within those papers, and produces a summary that is accurate within the bounds of its task list but misleading relative to what the user actually needed to know. It completed the plan. It failed the goal.

A dynamic agent reads the step 1 output, recognizes that the original framing is stale, rewrites the task list to include state space model literature, and produces a summary that reflects the actual current state of the field. It revised the plan. It achieved the goal.

The difference is not intelligence. It is architecture.

A real failure: when we built the wrong one

We built a competitive intelligence agent for a client in the fintech space. The task was straightforward on paper: monitor ten sources daily, extract mentions of three competitor products, classify sentiment, produce a summary report. Classic execution workflow. We gave it a static task list, it ran clean in staging, and we shipped it.

Three weeks in, one of the three competitor products was quietly discontinued. No press release - just a product page that went 404 and a support forum thread. The agent kept monitoring for it. Every day it searched, found nothing meaningful, classified the silence as neutral sentiment, and included a "no significant mentions" entry in the report. The summary looked complete. Nobody flagged it.

Six weeks later the client flagged it - not because someone read the report carefully, but because a competitor launched a replacement product and the intelligence trail showed nothing. They went back through six weeks of reports, all of them complete, all of them wrong in the same way. The agent had completed every task correctly. The task list had a slot for that product, the slot got filled, the report got generated. Nothing broke. The goal - track what matters in the competitive landscape - had been silently failing since week three. The fix took two days of engineering. The damage to trust took longer to repair.

The fix was a weekly source validation node that ran before the main collection loop. If a monitored source returned consistent 404s or near-zero content for five consecutive days, it flagged the product for human review before the report ran. That node had a dynamic element - it could surface a replan signal to a human operator. The collection loop itself stayed static.

The lesson was not "use dynamic task lists." It was: static task lists require that your assumptions about the world stay valid for the lifetime of the workflow. When they might not - because sources change, products get discontinued, APIs deprecate - you need either a validation gate or a replan trigger. Building the right one requires knowing which assumption is at risk.

The Execution Contract Pattern: static task lists

When the problem is deterministic and sequential, a static list is the correct choice. It gives you predictable execution, clean checkpointing, and straightforward observability. Do not reach for dynamic replanning just because it sounds more sophisticated.

The right signal is that your tasks are execution steps, not discovery steps. The tasks do not produce information that changes what tasks come next - only information that feeds into subsequent tasks as input.

Document processing pipelines, structured data extraction across a fixed schema, code generation workflows where the steps are defined by a spec, form completion, multi-step API orchestration where the sequence is known - these are static problems.

In LangGraph, a static task list lives in state and gets updated only by marking items complete. The structure is simple:

code

from typing import TypedDictfrom langgraph.graph import StateGraph, ENDclass AgentState(TypedDict):    tasks: list[dict]         # [{"id": str, "description": str, "status": str, "result": any}]    current_task_index: int    context: str    final_output: strdef get_current_task(state: AgentState) -> dict | None:    tasks = state["tasks"]    idx = state["current_task_index"]    if idx >= len(tasks):        return None    return tasks[idx]def execute_task(state: AgentState) -> dict:    task = get_current_task(state)    if task is None:        return state    # Execute the task - call tools, LLM, whatever the task requires    result = run_task_logic(task, state["context"])    # Mark complete, store result, advance index    updated_tasks = state["tasks"].copy()    updated_tasks[state["current_task_index"]] = {        **task,        "status": "complete",        "result": result    }    return {        "tasks": updated_tasks,        "current_task_index": state["current_task_index"] + 1,        "context": state["context"] + f"\n\nStep {task['id']} result: {result}"    }def should_continue(state: AgentState) -> str:    if state["current_task_index"] >= len(state["tasks"]):        return "done"    return "execute"workflow = StateGraph(AgentState)workflow.add_node("execute", execute_task)workflow.set_entry_point("execute")workflow.add_conditional_edges("execute", should_continue, {    "execute": "execute",    "done": END})app = workflow.compile(checkpointer=checkpointer)

The task list is immutable after initialization. The agent only updates status and result fields. This gives you clean crash recovery - if the agent dies at task 4, you resume with current_task_index: 3 and replay from there.

The checkpoint is the to-do list. You do not need to reconstruct state from conversation history. You read the task list, find the first incomplete task, and continue. This is the underrated production argument for task lists - not token compression, but crash recovery without a full restart.

The Discovery Hypothesis Pattern: dynamic task lists

Dynamic replanning is for problems where the work is exploratory by nature. Research, debugging, root cause analysis, open-ended code review, anything where you are discovering the shape of the problem as you go. You cannot write the task list before execution starts because you do not know what you will find.

The pattern introduces a replanning node - a point in the graph where the agent reads its accumulated findings and decides whether the current plan still makes sense. It is a deliberate pause for reflection, not a reactive response to failure.

code

from langchain_core.messages import SystemMessage, HumanMessagefrom langchain_openai import ChatOpenAIimport json# Two separate LLM clients - intentional, not a refactor target.# Execution tasks can tolerate slight temperature for varied phrasing.# Replanning must be temperature=0: any creativity in the replan node risks# the agent hallucinating its way out of a difficult task by inventing# convenient findings or generating tasks that drift from the original goal.# Prompt drift compounds across replan cycles - a temperature of even 0.2# on the replan LLM will produce measurably different task lists on identical# inputs across runs, making debugging and reproducibility significantly harder.execution_llm = ChatOpenAI(model="gpt-4o", temperature=0.1)replan_llm = ChatOpenAI(model="gpt-4o", temperature=0)  # strict zero - no exceptionsclass DynamicAgentState(TypedDict):    original_goal: str    tasks: list[dict]    current_task_index: int    accumulated_findings: list[str]    replan_count: int    context: str    task_list_versions: list[dict]  # snapshot at each replan boundary - required for auditing    replan_trigger: str             # which of the 4 triggers fired - set by should_replandef should_replan(state: DynamicAgentState) -> str:    """    Called after key milestones - not after every task.    Replanning has a cost. Call it at natural checkpoints,    not continuously.    """    # Only replan at defined milestones, not after every task    milestone_indices = [2, 5, 8]    current_idx = state["current_task_index"]    if current_idx not in milestone_indices:        return "continue"    # Hard limit on replanning cycles to prevent infinite loops    if state["replan_count"] >= 3:        return "continue"    findings_summary = "\n".join(state["accumulated_findings"])    remaining_tasks = [        t for t in state["tasks"]        if t["status"] == "pending"    ]    response = replan_llm.invoke([        SystemMessage(content="""You are evaluating whether an agent's task plan        still makes sense given what it has discovered so far.        Respond with exactly one word: REPLAN or CONTINUE.        Trigger REPLAN only when findings meet at least one of these conditions:        1. Contradiction - a finding directly invalidates an assumption the remaining           tasks depend on (e.g. the source being queried no longer exists)        2. Obsolescence - a finding shows that 2 or more remaining tasks will produce           redundant work (e.g. a paper you planned to synthesize already synthesizes it)        3. Scope shift - findings show the original framing of the problem is wrong           or incomplete in a way that changes what needs to be investigated        4. New critical path - a finding reveals a shorter or higher-confidence route           to the goal that makes the existing path wasteful        Do NOT trigger REPLAN for:        - Surprising but non-actionable findings        - Partial confirmations of what existing tasks will cover anyway        - Minor additions that can be absorbed within existing task scope        - Task failures due to transient errors (use retry logic, not replanning)        """),        HumanMessage(content=f"""Original goal: {state['original_goal']}What we've found so far:{findings_summary}Remaining planned tasks:{[t['description'] for t in remaining_tasks]}Should the agent revise its plan based on these findings?""")    ])    decision = response.content.strip().upper()    # Note: in a full implementation, parse which trigger fired from the LLM response    # and write it to state["replan_trigger"] before routing to the replan node.    # Here we use a simplified binary decision for clarity.    return "replan" if decision == "REPLAN" else "continue"def replan(state: DynamicAgentState) -> dict:    """    Rewrites the pending task list based on accumulated findings.    Completed tasks are preserved - only pending tasks change.    """    findings_summary = "\n".join(state["accumulated_findings"])    completed = [t for t in state["tasks"] if t["status"] == "complete"]    # Determine which trigger fired - passed in via state or derived from should_replan    # In practice, store the trigger reason in state before calling this node    replan_trigger = state.get("replan_trigger", "unspecified")    response = replan_llm.invoke([        SystemMessage(content="""You are replanning an agent's remaining tasks        based on what it has discovered. Return a JSON array of task objects,        each with 'id', 'description', and 'status: pending'.        Only return the NEW remaining tasks. Do not repeat completed ones.        Return raw JSON only - no markdown, no explanation."""),        HumanMessage(content=f"""Original goal: {state['original_goal']}Completed tasks so far:{[t['description'] for t in completed]}Findings so far:{findings_summary}Write the revised task list for what remains to be done.""")    ])    new_tasks = json.loads(response.content)    # Preserve completed tasks, append new pending ones    updated_tasks = [t for t in state["tasks"] if t["status"] == "complete"]    updated_tasks.extend(new_tasks)    # Version snapshot - full task list state at this replan boundary    snapshot = {        "version": state["replan_count"] + 1,        "trigger": replan_trigger,        "tasks_before": [t["description"] for t in state["tasks"] if t["status"] == "pending"],        "tasks_after": [t["description"] for t in new_tasks],        "findings_at_replan": state["accumulated_findings"][-3:],    }    versions = state.get("task_list_versions", [])    versions.append(snapshot)    return {        "tasks": updated_tasks,        "current_task_index": len(completed),        "replan_count": state["replan_count"] + 1,        "task_list_versions": versions,    }

The graph now has a decision point after milestones:

code

workflow = StateGraph(DynamicAgentState)workflow.add_node("execute", execute_task)workflow.add_node("check_replan", check_replan_node)workflow.add_node("replan", replan)def tasks_remaining(state: DynamicAgentState) -> str:    """Termination check - separate from replanning logic."""    pending = [t for t in state["tasks"] if t["status"] == "pending"]    return END if not pending else "check_replan"workflow.set_entry_point("execute")workflow.add_conditional_edges("execute", tasks_remaining, {    "check_replan": "check_replan",    END: END,})workflow.add_conditional_edges("check_replan", should_replan, {    "replan": "replan",    "continue": "execute",})workflow.add_edge("replan", "execute")

Two things in this implementation are non-negotiable in production. First, completed tasks are never touched. Only pending tasks get rewritten. This preserves the checkpoint property - if you crash after a replan, you still resume from the last completed task. Second, replanning has a hard ceiling. Three replanning cycles is a reasonable limit for most workflows. Without this, you can end up in a loop where findings from replanned tasks trigger another replan indefinitely.

What actually triggers replanning in production

The REPLAN or CONTINUE decision is where most dynamic agent implementations break down. The prompt above gives the LLM clear criteria, but it is worth understanding what each trigger means in practice - because the failure modes are different for each one.

Contradiction is the cleanest trigger. A finding directly kills an assumption the remaining tasks rely on. An agent researching "current Python packaging best practices" finds in step two that the tool it was about to spend three steps documenting was deprecated six months ago. Every remaining task built on that assumption is now wrong. Replan.

Obsolescence is the most common trigger in research workflows. The agent finds in step three that a paper, dataset, or resource already does the work that steps four through six were going to do. Continuing is not wrong - it is just wasteful. If two or more remaining tasks become redundant, the replanning cost pays for itself. If only one task is affected, absorb it and continue.

Scope shift is the hardest to detect reliably and the most dangerous to act on aggressively. The findings suggest the original framing was incomplete. This is the trigger most likely to cause goal drift if the replanning prompt is not anchored to the original goal. A research agent that discovers a tangential thread and rewrites its entire task list to follow it is not replanning - it is going off-task. The safeguard is to always evaluate proposed task revisions against the original goal before accepting the replan. If more than half of the new tasks do not map back to the original goal, reject the replan and continue.

New critical path is the most valuable trigger when it fires correctly. The agent finds a shortcut - a direct answer, a pre-existing synthesis, a high-confidence result that makes the long route unnecessary. Taking this path requires trusting the finding, which is where confidence scoring becomes useful. If the finding comes from a single source or has no corroboration, the shortcut might be wrong. Build a minimum confidence threshold into your replanning criteria: a new critical path should only trigger a replan if the finding meets a corroboration threshold you define upfront.

The four triggers share a common property: they all describe situations where the original plan will produce provably worse results than a revised plan. If you cannot make that case explicitly, do not replan. Curiosity is not a trigger. Interesting findings are not a trigger. Only findings that make the remaining plan actively wrong or actively wasteful justify the cost and risk of replanning.

The failure modes that will hit you in production

Replanning on every task. Calling the replanning check after every single task is expensive and usually unnecessary. Replanning requires an LLM call to evaluate findings and potentially another to rewrite the task list. Do it at natural milestones - after an initial research phase, after a validation phase, after a synthesis phase - not after every individual step. The right frequency depends on your workflow, but it is almost never every task.

Losing the checkpoint on replan. A naive replan implementation replaces the entire task list, including completed tasks. Now you have lost your resume point. If the agent crashes after a replan, you have no idea which tasks were actually completed. Always preserve completed task records and append new tasks rather than replacing the full list.

Unbounded replanning loops. An agent discovers something surprising in step 3, replans, discovers something else in step 5, replans again, discovers something in the new step 3, replans again. Without a ceiling, this runs until you hit a token limit or a timeout. Set a hard limit on replan cycles and respect it.

Goal drift through successive replanning. Each replan moves slightly away from the original goal. Three or four replans later and the agent is solving a different problem than the one you gave it. The fix is to always include the original goal verbatim in the replanning prompt and evaluate proposed task revisions against it explicitly before accepting them.

Treating replanning as error recovery. Replanning is not a fallback for when tasks fail. It is a deliberate architectural choice for when findings suggest the plan itself needs revision. If you use it to recover from failures, you conflate two separate concerns and end up with a system that is hard to debug and harder to observe. Handle task failures with retry logic and error nodes. Reserve replanning for when the map was wrong, not when a single step hit a transient error.

Anti-patterns: how to self-diagnose fast

If any of these describe your current implementation, you have a design problem - not a prompt problem.

❌ Execution Contract Pattern applied to a discovery workflow. You defined the task list before execution started for a research, debugging, or exploratory task. The agent will complete the plan. It will not achieve the goal.

❌ Discovery Hypothesis Pattern without a ceiling. Your dynamic agent has no max_replans limit. One surprising finding triggers a replan, which generates new tasks, which produce more surprising findings, which trigger another replan. The agent never finishes. Add a hard ceiling and treat hitting it as a signal to review your replanning trigger criteria.

❌ Replacing the full task list on replan. Your replanning node overwrites state["tasks"] entirely. Completed tasks are gone. If the agent crashes after a replan, you have no checkpoint to resume from. The replan operation should only touch pending tasks - completed tasks are immutable once marked done.

❌ Using replanning for error recovery. A task hits a rate limit or returns a 503. Your agent triggers a replan instead of a retry. Now the error is invisible in your task history and the task list has changed for the wrong reason. Replanning is for when the map is wrong. Retries are for when the road is temporarily blocked. Do not conflate them.

❌ No Hybrid Boundary in mixed workflows. Your workflow starts with deterministic data collection and ends with open-ended synthesis, but you apply the same replanning policy to both phases. The collection phase gets non-deterministic execution it does not need. The synthesis phase gets a locked plan it cannot adapt. Design the boundary explicitly.

The heuristic that works in practice

Before building anything, answer one question about your workflow: do your tasks produce information that changes what tasks should come next, or do they just produce information that feeds into subsequent tasks as input? Research produces the former. Data transformation produces the latter. If your tasks produce instructions for what to do next, you need the Discovery Hypothesis Pattern. If they produce data to pass forward, the Execution Contract Pattern is the right fit.

Most production systems need both. A research agent that feeds a report generation pipeline uses the Discovery Hypothesis Pattern in the research phase and the Execution Contract Pattern in the generation phase. The boundary between them - where dynamic findings become static inputs - is the Hybrid Boundary Pattern. Design it explicitly.

mermaid

flowchart LR
  subgraph EC["Execution Contract Phase"]
    direction TB
    C1[collect source A] --> C2[collect source B] --> C3[collect source C]
  end

  B{{Phase boundary\nfindings → static inputs}}

  subgraph DH["Discovery Hypothesis Phase"]
    direction TB
    R1[analyze findings] --> RF{replan?}
    RF -->|yes| RP[replan node]
    RF -->|no| R2[next task]
    RP --> R2
  end

  EC --> B --> DH
    style EC fill:#e8f4fd,color:#0f172a,stroke:#0ea5e9
    style C1 fill:#3B82F6,color:#FFFFFF,stroke:#0ea5e9
    style C2 fill:#4F6EF7,color:#FFFFFF,stroke:#0ea5e9
    style C3 fill:#8B5CF6,color:#FFFFFF,stroke:#0ea5e9
    style B fill:#FBCFE8,color:#000000,stroke:#0ea5e9
    style DH fill:#ececff,color:#0f172a,stroke:#0ea5e9
    style R1 fill:#06B6D4,color:#FFFFFF,stroke:#0ea5e9
    style RF fill:#EAB308,color:#0f172a,stroke:#0ea5e9
    style RP fill:#FF2D55,color:#FFFFFF,stroke:#0ea5e9
    style R2 fill:#10B981,color:#FFFFFF,stroke:#0ea5e9

The boundary node is the most important part of the hybrid design. It is where raw collected data gets consolidated into a structured input that the discovery phase treats as fixed context. Everything before the boundary runs under the Execution Contract Pattern. Everything after runs under the Discovery Hypothesis Pattern. Mixing the policies on either side of this line is the source of most hybrid workflow failures.

In LangGraph, the boundary is a conditional edge that checks whether the static collection phase is complete and, if so, initialises the dynamic phase's task list from the collected findings:

code

class HybridAgentState(TypedDict):    # Shared    goal: str    phase: str                    # "collection" | "synthesis"    # Execution Contract phase    collection_tasks: list[dict]    collection_index: int    collected_data: list[str]    # Discovery Hypothesis phase    synthesis_tasks: list[dict]    synthesis_index: int    accumulated_findings: list[str]    replan_count: int    task_list_versions: list[dict]    synthesis_initialized: booldef phase_gate(state: HybridAgentState) -> str:    """    Conditional edge: fires after each collection task.    Once all collection tasks are done, initialises synthesis    and routes to the discovery phase. Never fires again.    """    all_collected = all(        t["status"] == "complete"        for t in state["collection_tasks"]    )    if all_collected and not state["synthesis_initialized"]:        return "init_synthesis"    if all_collected:        return "synthesize"    return "collect"def init_synthesis(state: HybridAgentState) -> dict:    """    Boundary node: converts collected data into synthesis tasks.    This is the handoff - dynamic findings become static inputs.    Runs exactly once.    """    # your function: takes goal + collected_data, returns list[dict] of task objects    # each with "id", "description", "status": "pending", "result": None    synthesis_tasks = generate_synthesis_tasks(        goal=state["goal"],        collected_data=state["collected_data"]    )    return {        "synthesis_tasks": synthesis_tasks,        "synthesis_index": 0,        "synthesis_initialized": True,        "phase": "synthesis",    }# Graph constructionworkflow = StateGraph(HybridAgentState)workflow.add_node("collect", execute_collection_task)  # your function: runs one collection task, appends result to collected_dataworkflow.add_node("init_synthesis", init_synthesis)workflow.add_node("synthesize", execute_synthesis_task)  # your function: runs one synthesis task, appends result to accumulated_findingsworkflow.add_node("check_replan", check_replan_node)workflow.add_node("replan", replan)def synthesis_done(state: HybridAgentState) -> str:    """Termination check - separate from replanning logic."""    pending = [t for t in state["synthesis_tasks"] if t["status"] == "pending"]    return END if not pending else "check_replan"workflow.set_entry_point("collect")workflow.add_conditional_edges("collect", phase_gate, {    "collect": "collect",    "init_synthesis": "init_synthesis",    "synthesize": "synthesize",})workflow.add_edge("init_synthesis", "synthesize")workflow.add_conditional_edges("synthesize", synthesis_done, {    "check_replan": "check_replan",    END: END,})workflow.add_conditional_edges("check_replan", should_replan, {    "replan": "replan",    "continue": "synthesize",})workflow.add_edge("replan", "synthesize")

init_synthesis runs exactly once - the synthesis_initialized flag prevents it from firing again if phase_gate is called after the boundary has already been crossed. The collection loop is entirely static: no replanning, no conditional branching based on content. The synthesis loop is entirely dynamic: replanning allowed, task list mutable. The boundary is explicit in both the state schema and the graph topology.

Observability: debugging dynamic agents in production

Debugging a static agent is straightforward - the task list is fixed, execution is sequential, and any failure maps directly to a task index and an error. Debugging a dynamic agent is categorically harder, because the task list itself changes shape during execution. A trace that shows "task 4 failed" tells you nothing useful if task 4 in the current run is not the same task 4 that existed when the workflow started. That asymmetry is what makes observability a first-class concern for the Discovery Hypothesis Pattern, not an afterthought.

You need three things: task list diffing at every replan boundary, trigger attribution, and task list versioning.

Task list diffing is the minimum. The replan function shown above already captures this - tasks_before and tasks_after in the snapshot give you removed, added, and preserved tasks at every boundary. Add a logger.info call alongside the snapshot to push it to your logging pipeline:

code

# Inside the replan function, after building the snapshot:logger.info("replan_diff", extra={    "thread_id": config["configurable"]["thread_id"],    "replan_count": snapshot["version"],    "trigger": snapshot["trigger"],              # which of the 4 triggers fired    "tasks_removed": [t for t in snapshot["tasks_before"] if t not in snapshot["tasks_after"]],    "tasks_added": [t for t in snapshot["tasks_after"] if t not in snapshot["tasks_before"]],    "tasks_preserved": [t for t in snapshot["tasks_before"] if t in snapshot["tasks_after"]],})

Trigger attribution is what makes the diff actionable. Logging which of the four triggers fired - Contradiction, Obsolescence, Scope shift, New critical path - tells you not just that the plan changed but why. In production this becomes a dashboard metric: if Scope shift is firing on 40% of replans, your initial planning prompt is under-specified. If Contradiction is firing repeatedly on the same workflow type, your task assumptions are systematically wrong. You cannot see this without the trigger label in the log. replan_trigger is set from state.get("replan_trigger") in the main replan function - populate it in should_replan before returning, or pass it through a dedicated state field.

Task list versioning is handled by the task_list_versions field in DynamicAgentState and the snapshot logic already in the main replan function. Every replan boundary produces a full snapshot: version number, trigger, tasks before, tasks after, and the last three findings that drove the decision. The complete implementation is in the Discovery Hypothesis Pattern section above - no separate code needed here.

This gives you a full reconstruction of how the agent's plan evolved across the entire run. When a dynamic agent produces a wrong output and you need to explain why, the version history is the audit trail. You can point to exactly which replan introduced the wrong direction, which trigger fired, and what findings drove it. Without this, post-mortem analysis on dynamic agents is guesswork.

If you are using Langfuse or a similar tracing tool, the replan_diff log maps cleanly to a custom span and the version snapshots attach as metadata. Tag everything with thread_id so you can reconstruct the full execution timeline across replanning cycles in a single view. The broader observability stack for agentic systems - spans, traces, cost tracking - is covered in Agentic AI Observability: Why Traditional Monitoring Breaks with Autonomous Systems.

Human-in-the-loop at the replanning boundary

The interrupt_before feature in LangGraph is worth using here. The question is when it is optional and when it is mandatory. If you need a full walkthrough of how interrupt_before works in a LangGraph graph, Multi-Party Authorization: Requiring Human Approval Without Killing Autonomy covers the implementation pattern in depth.

code

app = workflow.compile(    checkpointer=checkpointer,    interrupt_before=["replan"]  # pause before any replan for human review)

Before the replan node executes, the agent surfaces its proposed task list revision and waits for approval. The user sees what the agent wants to change and why, approves or modifies the new plan, and execution resumes.

HITL at the replanning boundary is mandatory in three situations. First, when the replan involves a scope shift trigger - the original framing is being questioned, and only a human can confirm whether the new direction still serves the underlying intent. Second, when the agent is operating in a domain where a wrong replan is expensive to reverse: financial analysis, medical research, legal document review. Third, when the output of the workflow will be published or acted on without further human review - a research summary going directly into a report, an analysis feeding an automated decision system.

In high-stakes production environments - fintech, healthcare, legal - scope shift should never route autonomously to the replan node. It should always route to a human first. The other three triggers (Contradiction, Obsolescence, New critical path) can be autonomous because they represent clear, verifiable conditions: an assumption is provably wrong, work is provably redundant, or a shorter path is provably available. Scope shift is different - it is a judgment call about whether the agent's new direction still serves the user's intent, and that judgment belongs to a human. Implement this as a trigger-aware routing function rather than a blanket interrupt_before:

code

def route_after_replan_decision(state: DynamicAgentState) -> str:    """    Route based on which trigger fired, not just whether to replan.    Scope shift always goes to human in high-stakes workflows.    """    if state["replan_trigger"] == "scope_shift":        return "human_review"   # interrupt_before equivalent - pause for approval    if state["replan_trigger"] in ("contradiction", "obsolescence", "new_critical_path"):        return "replan"         # autonomous - condition is verifiable    return "execute"            # no replan needed# human_review node: your implementation surfaces the proposed replan to the operator# and waits for approval before resuming. In LangGraph this is typically an# interrupt_before on a lightweight node that reads state and writes back the decision.workflow.add_node("human_review", human_review_node)workflow.add_conditional_edges("check_replan", route_after_replan_decision, {    "human_review": "human_review",    "replan": "replan",    "execute": "execute",})workflow.add_edge("human_review", "replan")  # after approval, proceed to replan

This gives you fine-grained control: autonomous replanning for verifiable conditions, mandatory human review for judgment calls. The interrupt_before=["replan"] blanket approach is simpler to implement but over-interrupts on low-risk triggers and under-protects on scope shift if you forget to set it.

The concrete failure mode it prevents: an agent researching "AI infrastructure costs in India" discovers mid-execution that a related topic - compute subsidy policies - has more recent literature. It triggers a scope shift replan and rewrites its task list to focus on subsidy policy instead. The agent finishes, the report is generated, and it covers a completely different question than the one the user asked. Without HITL at the replan boundary, this failure is silent. The agent did not error. It produced a complete, well-formed report. It just answered the wrong question.

With interrupt_before=["replan"], the agent pauses and surfaces: "I found that subsidy policy literature is more recent. I want to shift focus. Do you want me to proceed?" The user says no, the original plan continues. One prompt, one second, failure avoided.

Testing both patterns before you ship

Testing the Execution Contract Pattern is straightforward. Initialize state with a known task list, run the graph, assert that the task list structure is identical at the end - only status fields changed. Any structural mutation is a test failure.

code

def test_static_task_list_immutability():    initial_tasks = [        {"id": "t1", "description": "extract", "status": "pending", "result": None},        {"id": "t2", "description": "validate", "status": "pending", "result": None},    ]    initial_ids = {t["id"] for t in initial_tasks}    result = app.invoke({"tasks": initial_tasks, "current_task_index": 0, "context": ""})    final_ids = {t["id"] for t in result["tasks"]}    assert initial_ids == final_ids, "Task list structure mutated - Execution Contract violated"    assert all(t["status"] == "complete" for t in result["tasks"])

Testing the Discovery Hypothesis Pattern requires simulating findings that should trigger each of the four replan conditions. Write one test per trigger type: inject a finding that contradicts a task assumption and assert the replan fires; inject a finding that is surprising but non-actionable and assert it does not.

code

# mock_llm can be a pytest fixture using unittest.mock or a simple stub -# the key is controlling what REPLAN or CONTINUE the LLM returns per test casedef test_replan_fires_on_contradiction(mock_llm):    # Inject a finding that directly invalidates task t3's assumption    state = build_state_with_finding(        finding="The API documented in task t3 was deprecated in v2.0",        pending_tasks=["t2", "t3", "t4"]    )    mock_llm.set_response("REPLAN")    result = should_replan(state)    assert result == "replan"def test_replan_does_not_fire_on_noise(mock_llm):    state = build_state_with_finding(        finding="Interesting side note found in footnote 12",        pending_tasks=["t2", "t3", "t4"]    )    mock_llm.set_response("CONTINUE")    result = should_replan(state)    assert result == "continue"

Testing the Hybrid Boundary means asserting that the boundary node fires exactly once, that tasks before the boundary run under static policy, and that tasks after run under dynamic policy. A boundary that fires twice or not at all is a configuration bug, not a runtime bug - catch it in tests.

code

def test_hybrid_boundary_fires_exactly_once():    initial_state = build_hybrid_state(        collection_tasks=["collect A", "collect B", "collect C"],        goal="research and synthesize findings"    )    result = app.invoke(initial_state)    # Boundary fires exactly once - one version snapshot from synthesis phase    # (collection phase produces none because it runs under static policy)    assert result["synthesis_initialized"] is True    # Collection tasks ran under static policy - no replanning, no versions    collection_results = [t for t in result["collection_tasks"]]    assert all(t["status"] == "complete" for t in collection_results)    # Synthesis ran under dynamic policy - task_list_versions may exist    # (zero versions is valid if no replan was triggered; > 0 means replan fired)    assert isinstance(result["task_list_versions"], list)def test_hybrid_boundary_does_not_refire():    """synthesis_initialized flag prevents phase_gate from re-entering init_synthesis."""    state = build_hybrid_state_post_boundary(synthesis_initialized=True)    route = phase_gate(state)    assert route != "init_synthesis", "Boundary node fired after synthesis was already initialised"

The first test covers the happy path: collection completes, boundary fires once, synthesis runs. The second is the guard test: with synthesis_initialized=True, phase_gate must never route back to init_synthesis regardless of state.

The four trigger names defined in "What actually triggers replanning in production" - Contradiction, Obsolescence, Scope shift, New critical path - are the same names you instrument in your observability setup. The trigger field in every replan_diff log entry and every task_list_versions snapshot maps directly to one of those four. If you change the trigger taxonomy in your should_replan prompt, update the observability labels to match. They are the same system.

Architecture pattern summary

If you remember three things from this article:

Execution Contract Pattern (static task list) - fixed plan, execution-driven. The task list is a contract signed before execution begins. Tasks depend on completion status, not output content. Lock the structure, mutate only status fields, use the list as a crash recovery checkpoint.

Discovery Hypothesis Pattern (dynamic task list) - evolving plan, learning-driven. The task list is a hypothesis revised as the agent learns. Tasks depend on what previous tasks found. Build a replanning node, fire it at milestones not every step, preserve completed tasks on every replan, cap replan cycles with a hard ceiling.

Hybrid Boundary Pattern - explicit transition between the two. Static for collection, dynamic for synthesis, with a designed handoff node where findings become fixed inputs. The boundary is the architecture.

The wrong choice in either direction fails silently. A static architecture on a discovery problem produces agents that finish the plan and fail the goal. A dynamic architecture without guardrails produces agents that plan indefinitely and finish nothing. The architecture is the fix, not the prompt.

References

LangGraph documentation - State management and checkpointing: https://langchain-ai.github.io/langgraph/
LangGraph interrupt_before - Human-in-the-loop patterns: https://langchain-ai.github.io/langgraph/how-tos/human_in_the_loop/
Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. https://arxiv.org/abs/2210.03629
Shinn, N., et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023. https://arxiv.org/abs/2303.11366
Wang, L., et al. (2023). A Survey on Large Language Model based Autonomous Agents. https://arxiv.org/abs/2308.11432
Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022. https://arxiv.org/abs/2201.11903
LangChain blog - Building reliable agents with LangGraph: https://blog.langchain.dev

AI Engineering

Agentic AI

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:

The core question nobody asks upfront

The mental model in one diagram

The three patterns at a glance

What goal displacement actually looks like

A real failure: when we built the wrong one

The Execution Contract Pattern: static task lists

The Discovery Hypothesis Pattern: dynamic task lists

What actually triggers replanning in production

The failure modes that will hit you in production

Anti-patterns: how to self-diagnose fast

The heuristic that works in practice

Observability: debugging dynamic agents in production

Human-in-the-loop at the replanning boundary

Testing both patterns before you ship

Architecture pattern summary

References

Related Articles

Comments