The Agent That Started Over
An agent was running a complex codebase analysis task. Fifty files. Cross-file dependency mapping. Security vulnerability identification. Estimated runtime: 12 minutes.
At minute 9, the process died. Memory limit exceeded.
The agent had no checkpoint. It restarted from file 1.
But it had already written intermediate results to a shared database - partial dependency maps, preliminary vulnerability flags. The restart overwrote some of them incorrectly. When it finally completed on the third attempt (another memory spike at minute 11 killed the second run too), the output contained contradictions: vulnerabilities flagged in the first run that were cleared in the third, dependencies mapped differently across restarts.
Nobody knew which version was correct. Three attempts. Three different outputs. One database. Zero confidence in any of it.
The agent wasn't poorly designed. The task was reasonable. The failure was one missing layer: state management. No checkpointing, no restart-from-midpoint, no idempotency on writes. Every run was a fresh start with side effects from previous runs still in the database.
A stateless agent in a stateful world produces stateful damage.
This is Part 7 of the Harness Engineering series. Part 1 introduced the seven-layer Harness Architecture. This article covers Layer 7 - State Management - including checkpoint-resume strategies, cross-session memory, and durable state for long-running agents.
Why State Management Is a Distinct Layer
For simple LLM applications, state is trivial. Request comes in, response goes out, nothing persists between them except what you explicitly store.
For agentic systems, state is the engineering problem. Agents run for minutes or hours. They execute sequences of actions. They accumulate intermediate results. They call tools that have side effects. They may be interrupted, resumed, or run across multiple user sessions.
Without explicit state management, you have three compounding problems:
Resume fragility. If the agent process dies, it restarts from scratch. All computation performed before the failure is lost. For a 12-minute task, that's 12 minutes of wasted compute on every failure. At production scale, with many long-running tasks, this cost is material.
Idempotency failure. If the agent retries a step that already had side effects, those effects are duplicated. Database writes happen twice. API calls fire twice. Emails send twice. Without idempotency guarantees, retry-on-failure corrupts state rather than recovering it.
Context discontinuity. When a user resumes a task in a new session, the agent starts without memory of what was done in previous sessions. It re-asks questions already answered. It re-does work already completed. The user's patience degrades with every fresh start.
State management solves all three. Checkpoints enable resume-from-failure. Idempotency keys prevent duplicate side effects. Persistent memory enables cross-session continuity.
Together these three guarantees form what I call the Stateful Execution Contract: a task produces the same final output regardless of how many times it is interrupted and resumed. That contract is what separates a production agentic system from a fragile script that happens to use an LLM.
The Three State Tiers
State in agentic systems exists at three levels, each with different persistence requirements and access patterns.
Tier 1: Task State (Checkpoint-Resume)
Task state is the record of what has been done within a single task execution. Which steps have completed. What their outputs were. What still needs to happen.
Checkpoints serialize this state to durable storage at meaningful boundaries - after each tool call, after each reasoning step, after each file processed. If the process dies, it resumes from the last checkpoint rather than from step 1.
from dataclasses import dataclass, field, asdictfrom datetime import datetimefrom typing import Anyimport jsonimport uuid@dataclassclass StepResult: step_id: str step_type: str input: dict output: Any completed_at: str idempotency_key: str@dataclassclass TaskCheckpoint: task_id: str created_at: str last_updated: str total_steps: int completed_steps: list[StepResult] = field(default_factory=list) current_step_index: int = 0 task_output: Any = None status: str = "in_progress" # in_progress | completed | failed def is_complete(self) -> bool: return self.status == "completed" def next_step_index(self) -> int: return self.current_step_index def step_already_completed(self, idempotency_key: str) -> StepResult | None: for step in self.completed_steps: if step.idempotency_key == idempotency_key: return step return Noneclass CheckpointManager: def __init__(self, store): self.store = store # Redis, Postgres, or any durable KV store def create(self, task_id: str, total_steps: int) -> TaskCheckpoint: checkpoint = TaskCheckpoint( task_id=task_id, created_at=datetime.utcnow().isoformat(), last_updated=datetime.utcnow().isoformat(), total_steps=total_steps, ) self._save(checkpoint) return checkpoint def load(self, task_id: str) -> TaskCheckpoint | None: data = self.store.get(f"checkpoint:{task_id}") if not data: return None return TaskCheckpoint(**json.loads(data)) def record_step(self, checkpoint: TaskCheckpoint, step: StepResult): checkpoint.completed_steps.append(step) checkpoint.current_step_index += 1 checkpoint.last_updated = datetime.utcnow().isoformat() self._save(checkpoint) def complete(self, checkpoint: TaskCheckpoint, output: Any): checkpoint.task_output = output checkpoint.status = "completed" checkpoint.last_updated = datetime.utcnow().isoformat() self._save(checkpoint) def _save(self, checkpoint: TaskCheckpoint): self.store.set( f"checkpoint:{checkpoint.task_id}", json.dumps(asdict(checkpoint)), ex=86400 * 7 # Expire after 7 days )The idempotency key on each step is what enables safe retry. Before executing any step with side effects, check whether a step with the same idempotency key already completed. If it did, return its recorded output rather than re-executing.
async def execute_with_checkpoint( task_id: str, steps: list[dict], checkpoint_manager: CheckpointManager, executor,) -> Any: # Load or create checkpoint checkpoint = checkpoint_manager.load(task_id) if checkpoint and checkpoint.is_complete(): return checkpoint.task_output if not checkpoint: checkpoint = checkpoint_manager.create(task_id, len(steps)) # Resume from last completed step for i, step in enumerate(steps[checkpoint.current_step_index:], start=checkpoint.current_step_index): idempotency_key = f"{task_id}:{step['type']}:{i}" # Check if already done (idempotency) existing = checkpoint.step_already_completed(idempotency_key) if existing: continue # Execute step output = await executor.run(step) # Record checkpoint checkpoint_manager.record_step(checkpoint, StepResult( step_id=str(uuid.uuid4()), step_type=step["type"], input=step, output=output, completed_at=datetime.utcnow().isoformat(), idempotency_key=idempotency_key, )) # Mark complete final_output = compile_outputs(checkpoint.completed_steps) checkpoint_manager.complete(checkpoint, final_output) return final_outputThis pattern guarantees that if the process dies at any point, the next run resumes exactly where it left off - no duplicated side effects, no lost work, no contradictory outputs from partial runs.
Without Tier 1: every process failure wastes all prior compute and risks corrupting shared state with duplicate writes. The codebase analysis incident is the exact consequence.
Tier 2: Session State (Cross-Session Memory)
Session state persists across the boundary between separate user interactions with the same agent. When a user returns to continue a task, the agent should know what happened in previous sessions.
This is distinct from task state (which tracks steps within a single execution) and from context engineering (which manages what's in the current context window). Session state is the bridge between sessions.
@dataclassclass SessionSummary: session_id: str task_description: str decisions_made: list[str] work_completed: list[str] open_questions: list[str] next_recommended_steps: list[str] created_at: strclass SessionMemory: def __init__(self, vector_store, kv_store): self.vector_store = vector_store # For semantic recall self.kv_store = kv_store # For direct lookup def save_session(self, user_id: str, summary: SessionSummary): # Store in vector store for semantic recall self.vector_store.add( id=summary.session_id, content=self._to_text(summary), metadata={ "user_id": user_id, "session_id": summary.session_id, "created_at": summary.created_at, } ) # Store structured data for direct lookup self.kv_store.set( f"session:{user_id}:{summary.session_id}", json.dumps(asdict(summary)) ) def recall_relevant(self, user_id: str, current_query: str, top_k: int = 3) -> list[SessionSummary]: results = self.vector_store.similarity_search( query=current_query, filter={"user_id": user_id}, k=top_k ) return [SessionSummary(**json.loads( self.kv_store.get(f"session:{user_id}:{r.metadata['session_id']}") )) for r in results] def build_continuation_context( self, user_id: str, current_query: str ) -> str: relevant = self.recall_relevant(user_id, current_query) if not relevant: return "" lines = ["[Prior session context]"] for s in relevant: lines.append(f"\nSession: {s.task_description}") if s.decisions_made: lines.append(f"Decisions made: {'; '.join(s.decisions_made)}") if s.work_completed: lines.append(f"Work completed: {'; '.join(s.work_completed)}") if s.open_questions: lines.append(f"Open questions: {'; '.join(s.open_questions)}") return "\n".join(lines) def _to_text(self, summary: SessionSummary) -> str: return ( f"Task: {summary.task_description}. " f"Completed: {', '.join(summary.work_completed)}. " f"Decided: {', '.join(summary.decisions_made)}." )The session summary is generated at the end of each session by asking the model to distill what happened into a structured format. This compressed representation is what gets injected at the start of the next session - not the raw conversation history, which would be too long and too noisy.
Without Tier 2: every new session starts blind. The agent re-asks questions already answered. The user re-explains context already given. The work already done has to be rediscovered.
Tier 3: User State (Long-Term Preferences)
User state captures standing facts about the user that persist indefinitely and inform agent behavior across all sessions and tasks.
class UserStateStore: def __init__(self, kv_store): self.store = kv_store def set(self, user_id: str, key: str, value: str): self.store.set(f"user:{user_id}:{key}", value) def get(self, user_id: str, key: str) -> str | None: return self.store.get(f"user:{user_id}:{key}") def get_all(self, user_id: str) -> dict: keys = self.store.keys(f"user:{user_id}:*") return { k.replace(f"user:{user_id}:", ""): self.store.get(k) for k in keys } def build_user_context(self, user_id: str) -> str: facts = self.get_all(user_id) if not facts: return "" lines = ["[User preferences and context]"] for k, v in facts.items(): lines.append(f"- {k}: {v}") return "\n".join(lines)Examples of user state: preferred programming language, coding style preferences, project tech stack, timezone, communication preferences, standing access permissions. These facts don't change session to session and shouldn't need to be re-established every time.
Without Tier 3: every session starts with a stranger. The agent that helped you yesterday knows nothing about you today.
The Stateful Execution Contract in Practice
I call the combination of checkpointing + idempotency the Stateful Execution Contract: the guarantee that a task produces the same final output regardless of how many times it is interrupted and resumed.
The contract has three terms:
Progress is durable. Every completed step is recorded before the next step begins. Interruption cannot cause more than one step of lost work.
Side effects are idempotent. No step with external side effects executes more than once for a given task execution. Retries re-use recorded outputs, not re-execute.
Output is deterministic given inputs. The final output of a task depends only on the task inputs and the steps executed, not on the number of restarts required to complete it.
Without the Stateful Execution Contract, you do not have an agentic system. You have a script that breaks under any fault.
What Observability Looks Like for This Layer
Checkpoint resume rate - what fraction of task executions resume from a checkpoint rather than starting fresh? This tells you how often your agents are being interrupted mid-task. A high resume rate signals infrastructure instability (processes dying frequently) or task timeouts (tasks running longer than expected).
Step retry rate - what fraction of steps are re-executed due to idempotency key hits? Zero is ideal. A non-zero rate means processes are dying and restarting, which is expected, but also means the checkpoint is working correctly - same steps aren't re-executing their side effects.
Session memory recall precision - when session memory is injected at session start, does it improve task completion time or quality? Measure by comparing sessions with and without relevant prior context. If recall doesn't help, your session summaries need to be more structured.
Checkpoint storage growth - how much storage is checkpointing consuming over time? Set expiry policies. Completed task checkpoints don't need to live forever. Seven days is usually sufficient for debugging; 30 days for audit.
What to Build First
First: Idempotency keys on all write operations. Before anything else, make every step that writes to external systems idempotent. This is the minimum viable protection against duplicate side effects from retries.
Second: Checkpoint-resume for tasks over 60 seconds. Any task that runs longer than 60 seconds should checkpoint. Implement the checkpoint manager and wire it into your agent execution loop.
Third: Session summaries. At the end of each session, prompt the model to generate a structured summary: decisions made, work completed, open questions, next steps. Store it.
Fourth: Session memory injection. At the start of each session, retrieve relevant prior session summaries using semantic search. Inject them into the system context.
Fifth: User state store. Capture standing user preferences and project context. Inject at session start.
The Principle
Long-running agents operate in a world where things fail. Processes die. Network calls time out. Infrastructure has incidents. These are not exceptional events - at production scale they are routine.
The question is not whether your agent will be interrupted. It's whether you designed it to resume.
Checkpoint-resume, idempotency, and session memory are not optional features for agentic systems. They are the baseline that separates a production agent from a fragile script.
An agent that starts over is not an agent. It's an expensive coin flip.
What's Next in This Series
- Part 1: Harness Engineering - The Missing Layer - The full seven-layer Harness Architecture overview
- Part 2: Normalization and Input Defense - Prompt injection, input sanitization, and multi-surface consistency
- Part 3: Context Engineering - Memory architectures, retrieval strategies, and context compression
- Part 4: Gated Execution - Policy engines, human-in-the-loop design, and dry-run patterns
- Part 5: Validation Layer Design - Schema validators, semantic checks, and repair prompt patterns
- Part 6: Retry, Fallback, and Circuit Breaking - Building resilient LLM infrastructure that survives outages
- Part 8: Deterministic Constraint Systems - Building tool registries and action manifests that prevent hallucinated actions in agentic systems
References
-
Kleppmann, M. (2017). Designing Data-Intensive Applications. O'Reilly Media.
-
Gray, J., & Reuter, A. (1992). Transaction Processing: Concepts and Techniques. Morgan Kaufmann.
-
Park, J. S., O'Brien, J. C., Cai, C. J., et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. ACM UIST 2023. https://arxiv.org/abs/2304.03442
-
Zhong, W., Guo, L., Gao, Q., et al. (2024). MemoryBank: Enhancing Large Language Models with Long-Term Memory. AAAI 2024. https://arxiv.org/abs/2305.10250
-
LangGraph. (2024). LangGraph Persistence and Checkpointing. https://langchain-ai.github.io/langgraph/concepts/persistence
-
Anthropic. (2024). Building effective agents. https://www.anthropic.com/research/building-effective-agents
-
Richardson, C. (2018). Microservices Patterns. Manning Publications.
Related Articles
- Context Engineering: What the Model Sees Is What the Model Does
- Deterministic Constraint Systems: Building Tool Registries That Keep Agents in Scope
- Normalization and Input Defense: Hardening the Entry Point of Your LLM System
- Harness Engineering: The Missing Layer Between LLMs and Production Systems