Skip to content

Production-agents Series

State Persistence & Agent Memory - The Complete Domain

Deep dive into agent memory systems: working memory, episodic memory, semantic memory, checkpointing patterns, context management, and long-running workflow persistence

Concepts Covered in This Article

Prerequisite: This is Part 2 of the Production Agents Deep Dive series. Start with Part 0: Overview for context.

Why This Matters

Your agent is 45 minutes into a complex research task. User closes their browser. Server restarts. All progress lost.

Or worse: agent crashes mid-booking. User refreshes. Agent starts over. Now there’s an orphaned booking in your system.

The Core Problem (from Anthropic, November 2025):

“The core challenge of long-running agents is that they must work in discrete sessions, and each new session begins with no memory of what came before.”

Imagine a software project staffed by engineers working in shifts, where each new engineer arrives with no memory of what happened on the previous shift. That’s what happens when context windows fill up or processes crash.

What Goes Wrong Without This:

STATE PERSISTENCE FAILURE PATTERNS
Symptom: Agent starts over after user closes browser.
Cause:   State lived only in memory. Browser close = process kill = state gone.
       No checkpointing to durable storage.

Symptom: Agent repeats completed tasks in new session.
Cause: No explicit progress tracking. New session doesn't know what's done.
Agent re-does work, wastes time and tokens.

Symptom: Can't debug failed agent runs from yesterday.
Cause: State was ephemeral. Once process died, context was lost.
No audit trail, no replay capability.

Why Context Windows Aren’t Enough

Even with 200k token windows (Claude) or 1M tokens (Gemini):

  • Complex tasks overflow: Software development, research, financial modeling require more context than any window holds
  • Token costs scale linearly: Keeping everything in context = expensive
  • Latency increases: Larger context = slower inference
  • Attention degrades: Very long contexts hurt model performance

Most production tasks require work across many sessions.


The Three Challenges

ChallengeProblemSolution
PersistenceState lost on crash/restartCheckpoint to durable storage
RecoveryDon’t know what completedTrack progress explicitly
Context BridgingNew session lacks contextProgress files, structured handoff

Agent Memory Systems: The Complete Picture

State management is really about memory. Understanding the different types of memory helps you design robust agents.

The Memory Taxonomy

AGENT MEMORY SYSTEMS

                     AGENT MEMORY SYSTEMS                            

                                                                      
  WORKING MEMORY (In-Context)                                        
   Current conversation turns                                      
   Active task state                                               
   Immediate observations                                          
   Token-limited, ephemeral                                        
                                                                      
  EPISODIC MEMORY (Session State)                                    
   Conversation history                                            
   Actions taken and results                                       
   Decisions made and why                                          
   Checkpointed, survives crashes                                  
                                                                      
  SEMANTIC MEMORY (Long-term Knowledge)                              
   User preferences                                                
   Learned patterns                                                
   Domain knowledge                                                
   Vector DB, persists across sessions                             
                                                                      
  PROCEDURAL MEMORY (How-to Knowledge)                               
   Tool usage patterns                                             
   Workflow sequences                                              
   Successful strategies                                           
   Embedded in prompts/fine-tuning                                 
                                                                      

Memory Type Comparison

Memory TypePersistenceScopeStorageRetrieval
WorkingNone (context window)Current turnLLM contextAutomatic
EpisodicSessionCurrent taskCheckpointer (Postgres)By thread_id
SemanticPermanentAll tasksVector DBSimilarity search
ProceduralPermanentAll tasksPrompts / Fine-tuningAlways loaded

How They Map to Implementation

class AgentMemory:
    def __init__(self):
        # Working Memory: Current context window
        self.working_memory = []  # Just conversation turns

        # Episodic Memory: Checkpointed session state
        self.episodic = PostgresSaver.from_conn_string(DB_URL)

        # Semantic Memory: Long-term learned knowledge
        self.semantic = VectorDB(embedding_model="text-embedding-3-small")

        # Procedural Memory: Baked into the system prompt
        self.procedural = load_system_prompt("agent_instructions.md")

    def process_turn(self, user_input, thread_id):
        # 1. Load episodic memory (session state)
        session_state = self.episodic.load(thread_id)

        # 2. Query semantic memory (relevant long-term knowledge)
        relevant_knowledge = self.semantic.search(user_input, k=3)

        # 3. Build working memory (context for this turn)
        self.working_memory = [
            {"role": "system", "content": self.procedural},
            *session_state.get("conversation_history", []),
            {"role": "context", "content": format_knowledge(relevant_knowledge)},
            {"role": "user", "content": user_input}
        ]

        # 4. Get response
        response = llm.chat(self.working_memory)

        # 5. Update episodic memory
        session_state["conversation_history"].append(
            {"role": "user", "content": user_input}
        )
        session_state["conversation_history"].append(
            {"role": "assistant", "content": response}
        )
        self.episodic.save(thread_id, session_state)

        # 6. Optionally update semantic memory with learned insights
        if self.should_memorize(response):
            self.semantic.insert(extract_insight(response))

        return response

The Context Management Problem

The core tradeoff: More context = better understanding, but also:

  • Higher token costs
  • Increased latency
  • Attention degradation on very long contexts

The solution hierarchy:

CONTEXT MANAGEMENT STRATEGIES
STRATEGY 1: Keep it small (preferred)
 Only put what's needed for THIS turn in context

STRATEGY 2: Summarize when growing
 Compress old conversation turns
 Keep recent turns verbatim

STRATEGY 3: Externalize to retrieval
 Store knowledge in vector DB
 Retrieve relevant chunks per turn

STRATEGY 4: Multi-session handoff
 End session with progress file
 New session starts fresh with progress context

Memory Flow Diagram

MEMORY ORCHESTRATION FLOW
User Request
   
   

                    MEMORY ORCHESTRATION                      
                                                              
                        
    WORKING      EPISODIC     SEMANTIC                
     MEMORY    MEMORY    MEMORY                 
    (context)    (session)    (vector)                
                        
                                                          
                                      
          BUILD CONTEXT                          
               (select relevant                            
                from each type)                            
                                        
                                                            
                                                            
                                           
                  LLM CALL                                 
                                           
                                                            
                                                            
                                           
               UPDATE MEMORIES                             
               - Episodic: +turn                            
               - Semantic: +insight                         
                                           

Common Memory Anti-patterns

Anti-patternProblemFix
Everything in contextToken explosion, attention degradationUse semantic memory for stable knowledge
No session continuityAgent forgets mid-conversationCheckpoint episodic memory
Context as databaseSlow, expensive, fragileStore data externally, retrieve what’s needed
No memory pruningUnbounded growthTTL on episodic, compaction on working
Ignoring proceduralAgent reinvents wheelsBake patterns into system prompt

Solution 1: LangGraph Checkpointers

LangGraph is the industry standard for agent state management. Here’s how to use it in production.

Basic Setup

from langgraph.checkpoint.postgres import PostgresSaver

# Production: PostgreSQL for durability
checkpointer = PostgresSaver.from_conn_string(
    "postgresql://user:pass@host:5432/db"
)

# Create graph with checkpointing
graph = StateGraph(AgentState)
graph.add_node("think", think_node)
graph.add_node("act", act_node)
# ... add edges ...

app = graph.compile(checkpointer=checkpointer)

# Execute with thread_id for persistence
config = {"configurable": {"thread_id": "user-123-task-456"}}
result = app.invoke({"input": "Book flight to NYC"}, config)

# Later: resume from checkpoint
# Same thread_id = same state
result = app.invoke({"input": "Make it morning flight"}, config)

What StateSnapshot Captures

# Every checkpoint stores:
{
    "channel_values": {...},     # Current state data
    "next_nodes": ["act"],       # What to execute next
    "config": {...},             # Configuration
    "metadata": {
        "writes": {...},         # Recent modifications
        "step": 5                # Progress counter
    },
    "pending_tasks": [...]       # Incomplete work
}

Storage Options

StorageUse CaseTradeoffs
MemorySaverDevelopmentFast, lost on restart
SQLiteSaverSingle-nodeLocal persistence, limited scale
PostgresSaverProductionMulti-node, ACID guarantees
S3ArchivalLong-term storage, slower access

Production rule: Always use PostgresSaver (or equivalent) in production. MemorySaver is for local development only.


Solution 2: Checkpoint Timing

This is where most teams get it wrong. The timing of checkpoints matters.

Wrong: Checkpoint After Execution

# WRONG: If crash happens between execute and checkpoint,
# you don't know if step ran
def execute_step(self, step):
    result = step.run()           # Execute
    self.state['completed'].append(step.id)
    self.checkpoint()             # Save state
    # ^ If crash happens before checkpoint, step ran but state doesn't show it
    return result

Right: Checkpoint Before AND After

def execute_step(self, step):
    # BEFORE: Mark intent (crash here = know step was attempted)
    self.state['in_progress'] = step.id
    self.checkpoint()

    # Execute
    result = step.run()

    # AFTER: Mark completion
    self.state['completed'].append(step.id)
    del self.state['in_progress']
    self.state['last_result'] = result
    self.checkpoint()

    return result

Resume Logic

def resume(self):
    state = self.load_checkpoint()

    if 'in_progress' in state:
        # Crashed during execution
        step_id = state['in_progress']

        # Check if step actually completed (idempotent read)
        if self.check_step_completed_externally(step_id):
            # Step ran, just didn't checkpoint
            state['completed'].append(step_id)
            del state['in_progress']
            self.checkpoint()
        else:
            # Step didn't complete — re-execute with idempotency key
            step = self.get_step(step_id)
            self.execute_step(step)

    # Continue from last known good state
    return state

Why this works: If you crash between the two checkpoints, the in_progress marker tells you exactly what was happening. You can check if it completed and act accordingly.


Solution 3: Progress Tracking Files (Anthropic Pattern)

For multi-session tasks, explicit progress files bridge context gaps.

The Two-Agent Pattern

# Initializer Agent (first run only)
def initialize_project(task):
    # Set up environment
    setup_environment()

    # Create progress file
    progress = {
        "goal": task.description,
        "completed_steps": [],
        "blockers": [],
        "next_action": "Analyze requirements",
        "context": {"files": [], "apis": []}
    }

    write_file("claude-progress.txt", format_progress(progress))
    git_commit("Initial project setup")

# Coding Agent (every session)
def continue_work():
    # Read progress from last session
    progress = read_file("claude-progress.txt")

    # Make incremental progress
    result = work_on_next_action(progress)

    # Update progress for next session
    progress["completed_steps"].append(result.action)
    progress["next_action"] = result.next_step

    write_file("claude-progress.txt", format_progress(progress))
    git_commit(f"Completed: {result.action}")

Progress File Structure

# Progress: Book Flight to NYC

## Current Goal

Book morning flight to NYC for tomorrow

## Completed Steps

1. [x] Parsed user intent: destination=NYC, date=tomorrow
2. [x] Inferred departure: SFO (from calendar)
3. [x] Searched flights: 47 options found
4. [x] User clarified: wants LaGuardia, not JFK
5. [x] Filtered to LGA: 18 options

## Current Blocker

8am flight sold out while user was deciding

## Next Action

Present 9am alternative ($12 more)

## Context

- User prefers aisle seats
- Corporate travel policy: max $500
- Departure: SFO
- Arrival: LGA

Why this works: New session reads progress file first. Immediate context on what’s done, what’s blocked, what’s next. No wasted tokens re-discovering state.


Solution 4: Hybrid Memory

For sophisticated agents, combine short-term checkpoints with long-term vector memory.

class HybridMemory:
    def __init__(self, checkpointer, vector_db):
        self.checkpointer = checkpointer  # Short-term
        self.vector_db = vector_db        # Long-term

    def save_session_state(self, thread_id, state):
        """Short-term: current conversation, active task"""
        self.checkpointer.save(thread_id, state)

    def save_insight(self, insight):
        """Long-term: learned patterns, preferences"""
        embedding = embed(insight)
        self.vector_db.insert(embedding, insight)

    def recall_relevant(self, query, k=5):
        """Retrieve relevant long-term memories"""
        return self.vector_db.search(embed(query), k=k)

    def load_context(self, thread_id, current_input):
        """Combine short-term state + relevant long-term memories"""
        state = self.checkpointer.load(thread_id)
        memories = self.recall_relevant(current_input)
        return {**state, "relevant_memories": memories}

When to Use Each

Memory TypeUse ForDon’t Use For
Short-term (Checkpointer)Current conversation, active task statePreferences learned months ago
Long-term (Vector DB)User preferences, learned patterns, domain knowledgeEphemeral conversation turns

Key insight: Query long-term memory as a tool (retrieve when needed), don’t jam everything into context.


Observation Masking

For software engineering agents, most tokens in a turn are observation (test output, file contents). This explodes context fast.

def compact_history(history):
    compacted = []
    for turn in history:
        if turn.type == "observation":
            # Compress verbose output
            compacted.append({
                "type": "observation_summary",
                "content": summarize(turn.content, max_tokens=100)
            })
        else:
            # Keep action/reasoning in full
            compacted.append(turn)
    return compacted

# Before: 50k tokens of test output
# After: 100 token summary of test results

Result: Targets the token-heavy part while preserving decision history.


Common Gotchas

GotchaSymptomFix
Checkpoint too largeSave/load becomes bottleneckPrune old observations, limit history depth
Checkpoint corruptionState lost or inconsistentAtomic writes, versioning, backup checkpoints
Session resume confusionAgent repeats completed tasksExplicit progress files, structured state schema
No checkpoint before executionCan’t tell if step ran on crashCheckpoint intent BEFORE execution
No atomic writesPartial checkpoint on crashUse database transactions, write-ahead logging

Multi-Agent State (Still Fragile)

2025 Reality Check (from research):

“Multi-agent systems are not yet capable of engaging in long-context, proactive discourse with significantly more reliability than a single agent.”

Why Multi-Agent State Is Hard:

  • Context fragmentation across agents
  • Synchronization overhead
  • Network latency disrupts state updates
  • Error compounding from fragmented information

Claude Code’s Solution: Single-threaded subtasking

  • Spawns subtasks but never runs parallel work
  • Main agent retains comprehensive context
  • Prevents error compounding from fragmented state

Recommendation: Start with single-agent, add multi-agent only when necessary.


The Checkpointing Checklist

Before deploying an agent with persistent state:

CHECKPOINTING DEPLOYMENT CHECKLIST
CHECKPOINT STORAGE
[ ] Using PostgreSQL (not MemorySaver) in production
[ ] Connection pooling configured
[ ] Backup strategy defined
[ ] TTL on old checkpoints to prevent unbounded growth

CHECKPOINT TIMING
[ ] Checkpoint BEFORE execution (mark intent)
[ ] Checkpoint AFTER execution (mark completion)
[ ] Resume logic handles in_progress state
[ ] Idempotent external checks for crash recovery

PROGRESS TRACKING
[ ] Explicit progress file for multi-session tasks
[ ] Git commits after significant progress (audit trail)
[ ] Clear next_action for new sessions

CONTEXT MANAGEMENT
[ ] Observation masking for verbose outputs
[ ] History pruning strategy
[ ] Long-term memory separate from session state

Key Takeaways

  1. Context windows aren’t enough. Complex tasks require state that survives sessions.

  2. Checkpoint timing matters. Checkpoint BEFORE execution to know what was attempted. Checkpoint AFTER to know what succeeded.

  3. Progress files bridge sessions. New session reads progress first. No wasted tokens rediscovering state.

  4. Hybrid memory separates concerns. Short-term state in checkpointer. Long-term knowledge in vector DB.

  5. Multi-agent state is fragile. Start single-agent. Add complexity only when necessary.


Next Steps

State persists. But what happens when the agent needs human judgment?

Part 3: Human-in-the-Loop Patterns

Or jump to another topic: