State Persistence & Agent Memory - The Complete Domain
Deep dive into agent memory systems: working memory, episodic memory, semantic memory, checkpointing patterns, context management, and long-running workflow persistence
Concepts Covered in This Article
Prerequisite: This is Part 2 of the Production Agents Deep Dive series. Start with Part 0: Overview for context.
Why This Matters
Your agent is 45 minutes into a complex research task. User closes their browser. Server restarts. All progress lost.
Or worse: agent crashes mid-booking. User refreshes. Agent starts over. Now there’s an orphaned booking in your system.
The Core Problem (from Anthropic, November 2025):
“The core challenge of long-running agents is that they must work in discrete sessions, and each new session begins with no memory of what came before.”
Imagine a software project staffed by engineers working in shifts, where each new engineer arrives with no memory of what happened on the previous shift. That’s what happens when context windows fill up or processes crash.
What Goes Wrong Without This:
Symptom: Agent starts over after user closes browser. Cause: State lived only in memory. Browser close = process kill = state gone. No checkpointing to durable storage. Symptom: Agent repeats completed tasks in new session. Cause: No explicit progress tracking. New session doesn't know what's done. Agent re-does work, wastes time and tokens. Symptom: Can't debug failed agent runs from yesterday. Cause: State was ephemeral. Once process died, context was lost. No audit trail, no replay capability.
Why Context Windows Aren’t Enough
Even with 200k token windows (Claude) or 1M tokens (Gemini):
- Complex tasks overflow: Software development, research, financial modeling require more context than any window holds
- Token costs scale linearly: Keeping everything in context = expensive
- Latency increases: Larger context = slower inference
- Attention degrades: Very long contexts hurt model performance
Most production tasks require work across many sessions.
The Three Challenges
| Challenge | Problem | Solution |
|---|---|---|
| Persistence | State lost on crash/restart | Checkpoint to durable storage |
| Recovery | Don’t know what completed | Track progress explicitly |
| Context Bridging | New session lacks context | Progress files, structured handoff |
Agent Memory Systems: The Complete Picture
State management is really about memory. Understanding the different types of memory helps you design robust agents.
The Memory Taxonomy
┌────────────────────────────────────────────────────────────────────┐ │ AGENT MEMORY SYSTEMS │ ├────────────────────────────────────────────────────────────────────┤ │ │ │ WORKING MEMORY (In-Context) │ │ ├── Current conversation turns │ │ ├── Active task state │ │ ├── Immediate observations │ │ └── Token-limited, ephemeral │ │ │ │ EPISODIC MEMORY (Session State) │ │ ├── Conversation history │ │ ├── Actions taken and results │ │ ├── Decisions made and why │ │ └── Checkpointed, survives crashes │ │ │ │ SEMANTIC MEMORY (Long-term Knowledge) │ │ ├── User preferences │ │ ├── Learned patterns │ │ ├── Domain knowledge │ │ └── Vector DB, persists across sessions │ │ │ │ PROCEDURAL MEMORY (How-to Knowledge) │ │ ├── Tool usage patterns │ │ ├── Workflow sequences │ │ ├── Successful strategies │ │ └── Embedded in prompts/fine-tuning │ │ │ └────────────────────────────────────────────────────────────────────┘
Memory Type Comparison
| Memory Type | Persistence | Scope | Storage | Retrieval |
|---|---|---|---|---|
| Working | None (context window) | Current turn | LLM context | Automatic |
| Episodic | Session | Current task | Checkpointer (Postgres) | By thread_id |
| Semantic | Permanent | All tasks | Vector DB | Similarity search |
| Procedural | Permanent | All tasks | Prompts / Fine-tuning | Always loaded |
How They Map to Implementation
class AgentMemory:
def __init__(self):
# Working Memory: Current context window
self.working_memory = [] # Just conversation turns
# Episodic Memory: Checkpointed session state
self.episodic = PostgresSaver.from_conn_string(DB_URL)
# Semantic Memory: Long-term learned knowledge
self.semantic = VectorDB(embedding_model="text-embedding-3-small")
# Procedural Memory: Baked into the system prompt
self.procedural = load_system_prompt("agent_instructions.md")
def process_turn(self, user_input, thread_id):
# 1. Load episodic memory (session state)
session_state = self.episodic.load(thread_id)
# 2. Query semantic memory (relevant long-term knowledge)
relevant_knowledge = self.semantic.search(user_input, k=3)
# 3. Build working memory (context for this turn)
self.working_memory = [
{"role": "system", "content": self.procedural},
*session_state.get("conversation_history", []),
{"role": "context", "content": format_knowledge(relevant_knowledge)},
{"role": "user", "content": user_input}
]
# 4. Get response
response = llm.chat(self.working_memory)
# 5. Update episodic memory
session_state["conversation_history"].append(
{"role": "user", "content": user_input}
)
session_state["conversation_history"].append(
{"role": "assistant", "content": response}
)
self.episodic.save(thread_id, session_state)
# 6. Optionally update semantic memory with learned insights
if self.should_memorize(response):
self.semantic.insert(extract_insight(response))
return response
The Context Management Problem
The core tradeoff: More context = better understanding, but also:
- Higher token costs
- Increased latency
- Attention degradation on very long contexts
The solution hierarchy:
STRATEGY 1: Keep it small (preferred) └── Only put what's needed for THIS turn in context STRATEGY 2: Summarize when growing └── Compress old conversation turns └── Keep recent turns verbatim STRATEGY 3: Externalize to retrieval └── Store knowledge in vector DB └── Retrieve relevant chunks per turn STRATEGY 4: Multi-session handoff └── End session with progress file └── New session starts fresh with progress context
Memory Flow Diagram
User Request │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ MEMORY ORCHESTRATION │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ WORKING │ │ EPISODIC │ │ SEMANTIC │ │ │ │ MEMORY │◄──│ MEMORY │◄──│ MEMORY │ │ │ │ (context)│ │ (session)│ │ (vector) │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ │ │ │ │ │ ┌─────────▼─────────┐ │ │ │ └───►│ BUILD CONTEXT │◄───┘ │ │ │ (select relevant │ │ │ │ from each type) │ │ │ └─────────┬─────────┘ │ │ │ │ │ ▼ │ │ ┌────────────────┐ │ │ │ LLM CALL │ │ │ └────────┬───────┘ │ │ │ │ │ ▼ │ │ ┌────────────────┐ │ │ │ UPDATE MEMORIES│ │ │ │ - Episodic: +turn │ │ │ - Semantic: +insight │ │ └────────────────┘ │ └─────────────────────────────────────────────────────────────┘
Common Memory Anti-patterns
| Anti-pattern | Problem | Fix |
|---|---|---|
| Everything in context | Token explosion, attention degradation | Use semantic memory for stable knowledge |
| No session continuity | Agent forgets mid-conversation | Checkpoint episodic memory |
| Context as database | Slow, expensive, fragile | Store data externally, retrieve what’s needed |
| No memory pruning | Unbounded growth | TTL on episodic, compaction on working |
| Ignoring procedural | Agent reinvents wheels | Bake patterns into system prompt |
Solution 1: LangGraph Checkpointers
LangGraph is the industry standard for agent state management. Here’s how to use it in production.
Basic Setup
from langgraph.checkpoint.postgres import PostgresSaver
# Production: PostgreSQL for durability
checkpointer = PostgresSaver.from_conn_string(
"postgresql://user:pass@host:5432/db"
)
# Create graph with checkpointing
graph = StateGraph(AgentState)
graph.add_node("think", think_node)
graph.add_node("act", act_node)
# ... add edges ...
app = graph.compile(checkpointer=checkpointer)
# Execute with thread_id for persistence
config = {"configurable": {"thread_id": "user-123-task-456"}}
result = app.invoke({"input": "Book flight to NYC"}, config)
# Later: resume from checkpoint
# Same thread_id = same state
result = app.invoke({"input": "Make it morning flight"}, config)
What StateSnapshot Captures
# Every checkpoint stores:
{
"channel_values": {...}, # Current state data
"next_nodes": ["act"], # What to execute next
"config": {...}, # Configuration
"metadata": {
"writes": {...}, # Recent modifications
"step": 5 # Progress counter
},
"pending_tasks": [...] # Incomplete work
}
Storage Options
| Storage | Use Case | Tradeoffs |
|---|---|---|
MemorySaver | Development | Fast, lost on restart |
SQLiteSaver | Single-node | Local persistence, limited scale |
PostgresSaver | Production | Multi-node, ACID guarantees |
| S3 | Archival | Long-term storage, slower access |
Production rule: Always use PostgresSaver (or equivalent) in production. MemorySaver is for local development only.
Solution 2: Checkpoint Timing
This is where most teams get it wrong. The timing of checkpoints matters.
Wrong: Checkpoint After Execution
# WRONG: If crash happens between execute and checkpoint,
# you don't know if step ran
def execute_step(self, step):
result = step.run() # Execute
self.state['completed'].append(step.id)
self.checkpoint() # Save state
# ^ If crash happens before checkpoint, step ran but state doesn't show it
return result
Right: Checkpoint Before AND After
def execute_step(self, step):
# BEFORE: Mark intent (crash here = know step was attempted)
self.state['in_progress'] = step.id
self.checkpoint()
# Execute
result = step.run()
# AFTER: Mark completion
self.state['completed'].append(step.id)
del self.state['in_progress']
self.state['last_result'] = result
self.checkpoint()
return result
Resume Logic
def resume(self):
state = self.load_checkpoint()
if 'in_progress' in state:
# Crashed during execution
step_id = state['in_progress']
# Check if step actually completed (idempotent read)
if self.check_step_completed_externally(step_id):
# Step ran, just didn't checkpoint
state['completed'].append(step_id)
del state['in_progress']
self.checkpoint()
else:
# Step didn't complete — re-execute with idempotency key
step = self.get_step(step_id)
self.execute_step(step)
# Continue from last known good state
return state
Why this works: If you crash between the two checkpoints, the in_progress marker tells you exactly what was happening. You can check if it completed and act accordingly.
Solution 3: Progress Tracking Files (Anthropic Pattern)
For multi-session tasks, explicit progress files bridge context gaps.
The Two-Agent Pattern
# Initializer Agent (first run only)
def initialize_project(task):
# Set up environment
setup_environment()
# Create progress file
progress = {
"goal": task.description,
"completed_steps": [],
"blockers": [],
"next_action": "Analyze requirements",
"context": {"files": [], "apis": []}
}
write_file("claude-progress.txt", format_progress(progress))
git_commit("Initial project setup")
# Coding Agent (every session)
def continue_work():
# Read progress from last session
progress = read_file("claude-progress.txt")
# Make incremental progress
result = work_on_next_action(progress)
# Update progress for next session
progress["completed_steps"].append(result.action)
progress["next_action"] = result.next_step
write_file("claude-progress.txt", format_progress(progress))
git_commit(f"Completed: {result.action}")
Progress File Structure
# Progress: Book Flight to NYC
## Current Goal
Book morning flight to NYC for tomorrow
## Completed Steps
1. [x] Parsed user intent: destination=NYC, date=tomorrow
2. [x] Inferred departure: SFO (from calendar)
3. [x] Searched flights: 47 options found
4. [x] User clarified: wants LaGuardia, not JFK
5. [x] Filtered to LGA: 18 options
## Current Blocker
8am flight sold out while user was deciding
## Next Action
Present 9am alternative ($12 more)
## Context
- User prefers aisle seats
- Corporate travel policy: max $500
- Departure: SFO
- Arrival: LGA
Why this works: New session reads progress file first. Immediate context on what’s done, what’s blocked, what’s next. No wasted tokens re-discovering state.
Solution 4: Hybrid Memory
For sophisticated agents, combine short-term checkpoints with long-term vector memory.
class HybridMemory:
def __init__(self, checkpointer, vector_db):
self.checkpointer = checkpointer # Short-term
self.vector_db = vector_db # Long-term
def save_session_state(self, thread_id, state):
"""Short-term: current conversation, active task"""
self.checkpointer.save(thread_id, state)
def save_insight(self, insight):
"""Long-term: learned patterns, preferences"""
embedding = embed(insight)
self.vector_db.insert(embedding, insight)
def recall_relevant(self, query, k=5):
"""Retrieve relevant long-term memories"""
return self.vector_db.search(embed(query), k=k)
def load_context(self, thread_id, current_input):
"""Combine short-term state + relevant long-term memories"""
state = self.checkpointer.load(thread_id)
memories = self.recall_relevant(current_input)
return {**state, "relevant_memories": memories}
When to Use Each
| Memory Type | Use For | Don’t Use For |
|---|---|---|
| Short-term (Checkpointer) | Current conversation, active task state | Preferences learned months ago |
| Long-term (Vector DB) | User preferences, learned patterns, domain knowledge | Ephemeral conversation turns |
Key insight: Query long-term memory as a tool (retrieve when needed), don’t jam everything into context.
Observation Masking
For software engineering agents, most tokens in a turn are observation (test output, file contents). This explodes context fast.
def compact_history(history):
compacted = []
for turn in history:
if turn.type == "observation":
# Compress verbose output
compacted.append({
"type": "observation_summary",
"content": summarize(turn.content, max_tokens=100)
})
else:
# Keep action/reasoning in full
compacted.append(turn)
return compacted
# Before: 50k tokens of test output
# After: 100 token summary of test results
Result: Targets the token-heavy part while preserving decision history.
Common Gotchas
| Gotcha | Symptom | Fix |
|---|---|---|
| Checkpoint too large | Save/load becomes bottleneck | Prune old observations, limit history depth |
| Checkpoint corruption | State lost or inconsistent | Atomic writes, versioning, backup checkpoints |
| Session resume confusion | Agent repeats completed tasks | Explicit progress files, structured state schema |
| No checkpoint before execution | Can’t tell if step ran on crash | Checkpoint intent BEFORE execution |
| No atomic writes | Partial checkpoint on crash | Use database transactions, write-ahead logging |
Multi-Agent State (Still Fragile)
2025 Reality Check (from research):
“Multi-agent systems are not yet capable of engaging in long-context, proactive discourse with significantly more reliability than a single agent.”
Why Multi-Agent State Is Hard:
- Context fragmentation across agents
- Synchronization overhead
- Network latency disrupts state updates
- Error compounding from fragmented information
Claude Code’s Solution: Single-threaded subtasking
- Spawns subtasks but never runs parallel work
- Main agent retains comprehensive context
- Prevents error compounding from fragmented state
Recommendation: Start with single-agent, add multi-agent only when necessary.
The Checkpointing Checklist
Before deploying an agent with persistent state:
CHECKPOINT STORAGE [ ] Using PostgreSQL (not MemorySaver) in production [ ] Connection pooling configured [ ] Backup strategy defined [ ] TTL on old checkpoints to prevent unbounded growth CHECKPOINT TIMING [ ] Checkpoint BEFORE execution (mark intent) [ ] Checkpoint AFTER execution (mark completion) [ ] Resume logic handles in_progress state [ ] Idempotent external checks for crash recovery PROGRESS TRACKING [ ] Explicit progress file for multi-session tasks [ ] Git commits after significant progress (audit trail) [ ] Clear next_action for new sessions CONTEXT MANAGEMENT [ ] Observation masking for verbose outputs [ ] History pruning strategy [ ] Long-term memory separate from session state
Key Takeaways
-
Context windows aren’t enough. Complex tasks require state that survives sessions.
-
Checkpoint timing matters. Checkpoint BEFORE execution to know what was attempted. Checkpoint AFTER to know what succeeded.
-
Progress files bridge sessions. New session reads progress first. No wasted tokens rediscovering state.
-
Hybrid memory separates concerns. Short-term state in checkpointer. Long-term knowledge in vector DB.
-
Multi-agent state is fragile. Start single-agent. Add complexity only when necessary.
Next Steps
State persists. But what happens when the agent needs human judgment?
→ Part 3: Human-in-the-Loop Patterns
Or jump to another topic:
- Part 4: Cost Control — Token budgets and circuit breakers
- Part 6: Durable Execution — Temporal, Inngest, Restate frameworks