← Back to Guides
7

Series

Harness Engineering· Part 7

GuideFor: AI Engineers, ML Engineers, Platform Engineers, AI Systems Architects

State Management for Agentic Systems: How to Build Agents That Don't Start Over

A long-running agent without state management is a gamble. You're betting the entire task completes before something goes wrong. At production scale, that bet loses constantly.

#harness-engineering#state-management#checkpoint-resume#agent-memory#long-running-agents#production-ai-systems

The Agent That Started Over

An agent was running a complex codebase analysis task. Fifty files. Cross-file dependency mapping. Security vulnerability identification. Estimated runtime: 12 minutes.

At minute 9, the process died. Memory limit exceeded.

The agent had no checkpoint. It restarted from file 1.

But it had already written intermediate results to a shared database - partial dependency maps, preliminary vulnerability flags. The restart overwrote some of them incorrectly. When it finally completed on the third attempt (another memory spike at minute 11 killed the second run too), the output contained contradictions: vulnerabilities flagged in the first run that were cleared in the third, dependencies mapped differently across restarts.

Nobody knew which version was correct. Three attempts. Three different outputs. One database. Zero confidence in any of it.

The agent wasn't poorly designed. The task was reasonable. The failure was one missing layer: state management. No checkpointing, no restart-from-midpoint, no idempotency on writes. Every run was a fresh start with side effects from previous runs still in the database.

A stateless agent in a stateful world produces stateful damage.

This is Part 7 of the Harness Engineering series. Part 1 introduced the seven-layer Harness Architecture. This article covers Layer 7 - State Management - including checkpoint-resume strategies, cross-session memory, and durable state for long-running agents.


Why State Management Is a Distinct Layer

For simple LLM applications, state is trivial. Request comes in, response goes out, nothing persists between them except what you explicitly store.

For agentic systems, state is the engineering problem. Agents run for minutes or hours. They execute sequences of actions. They accumulate intermediate results. They call tools that have side effects. They may be interrupted, resumed, or run across multiple user sessions.

Without explicit state management, you have three compounding problems:

Resume fragility. If the agent process dies, it restarts from scratch. All computation performed before the failure is lost. For a 12-minute task, that's 12 minutes of wasted compute on every failure. At production scale, with many long-running tasks, this cost is material.

Idempotency failure. If the agent retries a step that already had side effects, those effects are duplicated. Database writes happen twice. API calls fire twice. Emails send twice. Without idempotency guarantees, retry-on-failure corrupts state rather than recovering it.

Context discontinuity. When a user resumes a task in a new session, the agent starts without memory of what was done in previous sessions. It re-asks questions already answered. It re-does work already completed. The user's patience degrades with every fresh start.

State management solves all three. Checkpoints enable resume-from-failure. Idempotency keys prevent duplicate side effects. Persistent memory enables cross-session continuity.

Together these three guarantees form what I call the Stateful Execution Contract: a task produces the same final output regardless of how many times it is interrupted and resumed. That contract is what separates a production agentic system from a fragile script that happens to use an LLM.


The Three State Tiers

State in agentic systems exists at three levels, each with different persistence requirements and access patterns.

Tier 1: Task State (Checkpoint-Resume)

Task state is the record of what has been done within a single task execution. Which steps have completed. What their outputs were. What still needs to happen.

Checkpoints serialize this state to durable storage at meaningful boundaries - after each tool call, after each reasoning step, after each file processed. If the process dies, it resumes from the last checkpoint rather than from step 1.

code
from dataclasses import dataclass, field, asdictfrom datetime import datetimefrom typing import Anyimport jsonimport uuid@dataclassclass StepResult:    step_id: str    step_type: str    input: dict    output: Any    completed_at: str    idempotency_key: str@dataclassclass TaskCheckpoint:    task_id: str    created_at: str    last_updated: str    total_steps: int    completed_steps: list[StepResult] = field(default_factory=list)    current_step_index: int = 0    task_output: Any = None    status: str = "in_progress"  # in_progress | completed | failed    def is_complete(self) -> bool:        return self.status == "completed"    def next_step_index(self) -> int:        return self.current_step_index    def step_already_completed(self, idempotency_key: str) -> StepResult | None:        for step in self.completed_steps:            if step.idempotency_key == idempotency_key:                return step        return Noneclass CheckpointManager:    def __init__(self, store):        self.store = store  # Redis, Postgres, or any durable KV store    def create(self, task_id: str, total_steps: int) -> TaskCheckpoint:        checkpoint = TaskCheckpoint(            task_id=task_id,            created_at=datetime.utcnow().isoformat(),            last_updated=datetime.utcnow().isoformat(),            total_steps=total_steps,        )        self._save(checkpoint)        return checkpoint    def load(self, task_id: str) -> TaskCheckpoint | None:        data = self.store.get(f"checkpoint:{task_id}")        if not data:            return None        return TaskCheckpoint(**json.loads(data))    def record_step(self, checkpoint: TaskCheckpoint, step: StepResult):        checkpoint.completed_steps.append(step)        checkpoint.current_step_index += 1        checkpoint.last_updated = datetime.utcnow().isoformat()        self._save(checkpoint)    def complete(self, checkpoint: TaskCheckpoint, output: Any):        checkpoint.task_output = output        checkpoint.status = "completed"        checkpoint.last_updated = datetime.utcnow().isoformat()        self._save(checkpoint)    def _save(self, checkpoint: TaskCheckpoint):        self.store.set(            f"checkpoint:{checkpoint.task_id}",            json.dumps(asdict(checkpoint)),            ex=86400 * 7  # Expire after 7 days        )

The idempotency key on each step is what enables safe retry. Before executing any step with side effects, check whether a step with the same idempotency key already completed. If it did, return its recorded output rather than re-executing.

code
async def execute_with_checkpoint(    task_id: str,    steps: list[dict],    checkpoint_manager: CheckpointManager,    executor,) -> Any:    # Load or create checkpoint    checkpoint = checkpoint_manager.load(task_id)    if checkpoint and checkpoint.is_complete():        return checkpoint.task_output    if not checkpoint:        checkpoint = checkpoint_manager.create(task_id, len(steps))    # Resume from last completed step    for i, step in enumerate(steps[checkpoint.current_step_index:],                              start=checkpoint.current_step_index):        idempotency_key = f"{task_id}:{step['type']}:{i}"        # Check if already done (idempotency)        existing = checkpoint.step_already_completed(idempotency_key)        if existing:            continue        # Execute step        output = await executor.run(step)        # Record checkpoint        checkpoint_manager.record_step(checkpoint, StepResult(            step_id=str(uuid.uuid4()),            step_type=step["type"],            input=step,            output=output,            completed_at=datetime.utcnow().isoformat(),            idempotency_key=idempotency_key,        ))    # Mark complete    final_output = compile_outputs(checkpoint.completed_steps)    checkpoint_manager.complete(checkpoint, final_output)    return final_output

This pattern guarantees that if the process dies at any point, the next run resumes exactly where it left off - no duplicated side effects, no lost work, no contradictory outputs from partial runs.

Without Tier 1: every process failure wastes all prior compute and risks corrupting shared state with duplicate writes. The codebase analysis incident is the exact consequence.

Tier 2: Session State (Cross-Session Memory)

Session state persists across the boundary between separate user interactions with the same agent. When a user returns to continue a task, the agent should know what happened in previous sessions.

This is distinct from task state (which tracks steps within a single execution) and from context engineering (which manages what's in the current context window). Session state is the bridge between sessions.

code
@dataclassclass SessionSummary:    session_id: str    task_description: str    decisions_made: list[str]    work_completed: list[str]    open_questions: list[str]    next_recommended_steps: list[str]    created_at: strclass SessionMemory:    def __init__(self, vector_store, kv_store):        self.vector_store = vector_store  # For semantic recall        self.kv_store = kv_store          # For direct lookup    def save_session(self, user_id: str, summary: SessionSummary):        # Store in vector store for semantic recall        self.vector_store.add(            id=summary.session_id,            content=self._to_text(summary),            metadata={                "user_id": user_id,                "session_id": summary.session_id,                "created_at": summary.created_at,            }        )        # Store structured data for direct lookup        self.kv_store.set(            f"session:{user_id}:{summary.session_id}",            json.dumps(asdict(summary))        )    def recall_relevant(self, user_id: str, current_query: str, top_k: int = 3) -> list[SessionSummary]:        results = self.vector_store.similarity_search(            query=current_query,            filter={"user_id": user_id},            k=top_k        )        return [SessionSummary(**json.loads(            self.kv_store.get(f"session:{user_id}:{r.metadata['session_id']}")        )) for r in results]    def build_continuation_context(        self, user_id: str, current_query: str    ) -> str:        relevant = self.recall_relevant(user_id, current_query)        if not relevant:            return ""        lines = ["[Prior session context]"]        for s in relevant:            lines.append(f"\nSession: {s.task_description}")            if s.decisions_made:                lines.append(f"Decisions made: {'; '.join(s.decisions_made)}")            if s.work_completed:                lines.append(f"Work completed: {'; '.join(s.work_completed)}")            if s.open_questions:                lines.append(f"Open questions: {'; '.join(s.open_questions)}")        return "\n".join(lines)    def _to_text(self, summary: SessionSummary) -> str:        return (            f"Task: {summary.task_description}. "            f"Completed: {', '.join(summary.work_completed)}. "            f"Decided: {', '.join(summary.decisions_made)}."        )

The session summary is generated at the end of each session by asking the model to distill what happened into a structured format. This compressed representation is what gets injected at the start of the next session - not the raw conversation history, which would be too long and too noisy.

Without Tier 2: every new session starts blind. The agent re-asks questions already answered. The user re-explains context already given. The work already done has to be rediscovered.

Tier 3: User State (Long-Term Preferences)

User state captures standing facts about the user that persist indefinitely and inform agent behavior across all sessions and tasks.

code
class UserStateStore:    def __init__(self, kv_store):        self.store = kv_store    def set(self, user_id: str, key: str, value: str):        self.store.set(f"user:{user_id}:{key}", value)    def get(self, user_id: str, key: str) -> str | None:        return self.store.get(f"user:{user_id}:{key}")    def get_all(self, user_id: str) -> dict:        keys = self.store.keys(f"user:{user_id}:*")        return {            k.replace(f"user:{user_id}:", ""): self.store.get(k)            for k in keys        }    def build_user_context(self, user_id: str) -> str:        facts = self.get_all(user_id)        if not facts:            return ""        lines = ["[User preferences and context]"]        for k, v in facts.items():            lines.append(f"- {k}: {v}")        return "\n".join(lines)

Examples of user state: preferred programming language, coding style preferences, project tech stack, timezone, communication preferences, standing access permissions. These facts don't change session to session and shouldn't need to be re-established every time.

Without Tier 3: every session starts with a stranger. The agent that helped you yesterday knows nothing about you today.


The Stateful Execution Contract in Practice

I call the combination of checkpointing + idempotency the Stateful Execution Contract: the guarantee that a task produces the same final output regardless of how many times it is interrupted and resumed.

The contract has three terms:

Progress is durable. Every completed step is recorded before the next step begins. Interruption cannot cause more than one step of lost work.

Side effects are idempotent. No step with external side effects executes more than once for a given task execution. Retries re-use recorded outputs, not re-execute.

Output is deterministic given inputs. The final output of a task depends only on the task inputs and the steps executed, not on the number of restarts required to complete it.

Without the Stateful Execution Contract, you do not have an agentic system. You have a script that breaks under any fault.


What Observability Looks Like for This Layer

Checkpoint resume rate - what fraction of task executions resume from a checkpoint rather than starting fresh? This tells you how often your agents are being interrupted mid-task. A high resume rate signals infrastructure instability (processes dying frequently) or task timeouts (tasks running longer than expected).

Step retry rate - what fraction of steps are re-executed due to idempotency key hits? Zero is ideal. A non-zero rate means processes are dying and restarting, which is expected, but also means the checkpoint is working correctly - same steps aren't re-executing their side effects.

Session memory recall precision - when session memory is injected at session start, does it improve task completion time or quality? Measure by comparing sessions with and without relevant prior context. If recall doesn't help, your session summaries need to be more structured.

Checkpoint storage growth - how much storage is checkpointing consuming over time? Set expiry policies. Completed task checkpoints don't need to live forever. Seven days is usually sufficient for debugging; 30 days for audit.


What to Build First

First: Idempotency keys on all write operations. Before anything else, make every step that writes to external systems idempotent. This is the minimum viable protection against duplicate side effects from retries.

Second: Checkpoint-resume for tasks over 60 seconds. Any task that runs longer than 60 seconds should checkpoint. Implement the checkpoint manager and wire it into your agent execution loop.

Third: Session summaries. At the end of each session, prompt the model to generate a structured summary: decisions made, work completed, open questions, next steps. Store it.

Fourth: Session memory injection. At the start of each session, retrieve relevant prior session summaries using semantic search. Inject them into the system context.

Fifth: User state store. Capture standing user preferences and project context. Inject at session start.


The Principle

Long-running agents operate in a world where things fail. Processes die. Network calls time out. Infrastructure has incidents. These are not exceptional events - at production scale they are routine.

The question is not whether your agent will be interrupted. It's whether you designed it to resume.

Checkpoint-resume, idempotency, and session memory are not optional features for agentic systems. They are the baseline that separates a production agent from a fragile script.

An agent that starts over is not an agent. It's an expensive coin flip.


What's Next in This Series


References

  1. Kleppmann, M. (2017). Designing Data-Intensive Applications. O'Reilly Media.

  2. Gray, J., & Reuter, A. (1992). Transaction Processing: Concepts and Techniques. Morgan Kaufmann.

  3. Park, J. S., O'Brien, J. C., Cai, C. J., et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. ACM UIST 2023. https://arxiv.org/abs/2304.03442

  4. Zhong, W., Guo, L., Gao, Q., et al. (2024). MemoryBank: Enhancing Large Language Models with Long-Term Memory. AAAI 2024. https://arxiv.org/abs/2305.10250

  5. LangGraph. (2024). LangGraph Persistence and Checkpointing. https://langchain-ai.github.io/langgraph/concepts/persistence

  6. Anthropic. (2024). Building effective agents. https://www.anthropic.com/research/building-effective-agents

  7. Richardson, C. (2018). Microservices Patterns. Manning Publications.


Systems Design

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:


Comments