Idempotency & Safe Retries - The Stripe Pattern for Agents
Deep dive into idempotency: the single highest-leverage production requirement. Learn the Stripe pattern, error classification, jitter, and how to prevent cascading retry storms
Prerequisite: This is Part 1 of the Production Agents Deep Dive series. Start with Part 0: Overview for context.
Why This Matters
Your agent calls book_flight(). The API takes 35 seconds to respond. Your timeout is 30 seconds. Agent retries. API processed both requests. Customer is charged twice.
This isn’t a bug. This is correct retry logic meeting real-world latency.
Idempotency is the single most critical production requirement for agents that perform actions with side effects. Without it, retries create duplicates — double bookings, duplicate emails, corrupted state.
The Numbers:
- 68% of teams hit budget overruns in first agent deployments
- 50% cite “runaway tool loops and recursive logic” as the cause
- API downtime surged 60% between Q1 2024 and Q1 2025
- More downtime = more retries = more duplicate operations
What Goes Wrong Without This:
Symptom: Customer charged twice for the same order. Cause: Payment API timed out. Agent retried. Both charges processed. No idempotency key to deduplicate. Symptom: User receives 47 copies of the same email. Cause: Email send succeeded but response was slow. Agent assumed failure. Retried. No deduplication on sends. Symptom: Database has duplicate records with slight variations. Cause: INSERT succeeded, network dropped response. Retry created second record. No upsert or idempotency check.
What Idempotency Means
Idempotent: An operation that produces the same result when called multiple times with the same input.
Idempotent: GET /user/123 → Same user every time (safe to retry) DELETE /file/abc → File deleted, stays deleted (safe to retry) PUT /user/123 {name} → User updated to same value (safe to retry) Not Idempotent: POST /charge/$100 → New charge every time (dangerous to retry) POST /email/send → New email every time (dangerous to retry) INSERT INTO orders → New row every time (dangerous to retry) Made Idempotent: POST /charge/$100 + idempotency_key=xyz123 → Same charge on retry POST /email/send + message_id=abc456 → Same email, no duplicate INSERT ... ON CONFLICT DO NOTHING → Same row, no duplicate
The Stripe Pattern
Stripe processes millions of payments. They can’t afford duplicates. Their pattern is the industry standard:
# Client generates a unique key
def book_flight(flight_id, user_id, task_id, step_id):
# Key must be STABLE across retries
# Bad: f"{user_id}:{timestamp}" - different each retry
# Bad: f"{user_id}:{retry_count}" - different each retry
# Good: f"{user_id}:{task_id}:{step_id}" - same across retries
idempotency_key = f"{user_id}:{task_id}:{step_id}"
return api.book(
flight_id=flight_id,
idempotency_key=idempotency_key
)
# Server checks and stores
def handle_booking(request):
key = request.idempotency_key
# Check if we've processed this before
cached = cache.get(key)
if cached:
return cached # Return stored result, don't reprocess
# First time: process and store result
result = process_booking(request)
cache.set(key, result, ttl=timedelta(hours=24))
return result
Key Generation Rules
| Include | Exclude | Why |
|---|---|---|
user_id | timestamp | Timestamps change on retry |
task_id | retry_count | Retry count changes on retry |
step_id | random() | Random changes on retry |
operation_type | request_id (if regenerated) | Must be stable |
external_reference |
The test: If you retry the same logical operation, does the key stay the same? If not, it’s wrong.
Three Idempotency Strategies
Strategy 1: Idempotency Keys (Stripe Pattern)
Best for: External APIs, payments, bookings
class IdempotentClient:
def __init__(self, cache):
self.cache = cache
def execute(self, operation, idempotency_key):
# Check cache
cached = self.cache.get(idempotency_key)
if cached:
return cached
# Execute and cache
result = operation()
self.cache.set(idempotency_key, result, ttl=86400) # 24 hours
return result
# Usage
client = IdempotentClient(redis_cache)
result = client.execute(
operation=lambda: api.book_flight(flight_id),
idempotency_key=f"{user_id}:{task_id}:book_flight:{flight_id}"
)
Strategy 2: Sequence Numbers
Best for: Internal state changes, ordered operations
class SequencedOperations:
def __init__(self):
self.expected_seq = 1
self.results = {}
def execute(self, seq_num, operation):
# Already processed
if seq_num < self.expected_seq:
return self.results[seq_num]
# Out of order
if seq_num > self.expected_seq:
raise OutOfOrderError(f"Expected {self.expected_seq}, got {seq_num}")
# Process and increment
result = operation()
self.results[seq_num] = result
self.expected_seq += 1
return result
Tradeoff: Simple but requires ordered processing. Doesn’t work well with concurrent clients.
Strategy 3: Time Window Deduplication
Best for: Best-effort deduplication, high-volume low-stakes operations
class TimeWindowDedup:
def __init__(self, window_seconds=300):
self.window = window_seconds
self.seen = {} # hash -> (timestamp, result)
def execute(self, request_hash, operation):
now = time.time()
# Check if seen within window
if request_hash in self.seen:
timestamp, result = self.seen[request_hash]
if now - timestamp < self.window:
return result # Within window, return cached
# Process and cache
result = operation()
self.seen[request_hash] = (now, result)
return result
Tradeoff: Allows some duplicates (if window expires), but prevents immediate retry storms.
Error Classification
Not all errors should be retried. Getting this wrong causes cascading failures.
from http import HTTPStatus
# These errors are transient — retry them
RETRY_ERRORS = [
ConnectionResetError, # Network blip
TimeoutError, # Slow response
HTTPStatus.TOO_MANY_REQUESTS, # 429 - Rate limited
HTTPStatus.SERVICE_UNAVAILABLE, # 503 - Server overloaded
HTTPStatus.GATEWAY_TIMEOUT, # 504 - Upstream timeout
HTTPStatus.BAD_GATEWAY, # 502 - Proxy error
]
# These errors are permanent — don't retry
NEVER_RETRY_ERRORS = [
HTTPStatus.BAD_REQUEST, # 400 - Invalid input
HTTPStatus.UNAUTHORIZED, # 401 - Auth failed
HTTPStatus.FORBIDDEN, # 403 - Not allowed
HTTPStatus.NOT_FOUND, # 404 - Doesn't exist
HTTPStatus.UNPROCESSABLE_ENTITY, # 422 - Business rule rejected
HTTPStatus.CONFLICT, # 409 - State conflict
]
def should_retry(error):
if isinstance(error, tuple(RETRY_ERRORS)):
return True
if hasattr(error, 'status_code'):
return error.status_code in [e.value for e in RETRY_ERRORS if hasattr(e, 'value')]
return False
The rule: Retry infrastructure errors (network, timeout, overload). Don’t retry business errors (validation, auth, not found).
Exponential Backoff with Full Jitter
Naive retry: Wait 1s, retry. All clients retry at the same time. Server overwhelmed again.
Smart retry: Wait random time, increasing with each attempt. Clients spread out. Server recovers.
import random
import time
def retry_with_backoff(
operation,
max_retries=5,
base_delay=0.1,
max_delay=10.0,
idempotency_key=None
):
"""
Exponential backoff with full jitter.
AWS research shows full jitter significantly reduces
synchronized retry storms during outages.
"""
for attempt in range(max_retries):
try:
return operation()
except Exception as e:
if not should_retry(e):
raise # Don't retry permanent errors
if attempt == max_retries - 1:
raise # Last attempt, give up
# Exponential backoff: 0.1, 0.2, 0.4, 0.8, 1.6... capped at max_delay
delay = min(base_delay * (2 ** attempt), max_delay)
# Full jitter: random value between 0 and delay
# This spreads retries across time, preventing thundering herd
jittered_delay = random.uniform(0, delay)
time.sleep(jittered_delay)
Why Full Jitter?
WITHOUT JITTER: Server fails at t=0 All 1000 clients retry at t=1 Server fails again All 1000 clients retry at t=2 Server fails again ... WITH FULL JITTER: Server fails at t=0 Client A retries at t=0.3 Client B retries at t=0.7 Client C retries at t=0.1 ... Load spreads across 0-1 second window Server can handle gradual recovery
Cascading Retry Storm
The nightmare scenario:
1. Payment service has 30-second outage 2. Order processing agents timeout, start retrying → 1000 agents × 3 retries = 3000 payment requests 3. Payment retries trigger inventory checks → Each payment retry calls inventory → 3000 inventory requests 4. Inventory service overwhelmed by traffic → Starts timing out → Agents retry inventory calls 5. Inventory retries trigger shipping checks → Cascade continues 6. Within 60 seconds: → 10x normal load across all services → Multiple services failing → Complete system degradation
Prevention: Circuit Breakers
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.last_failure_time = None
self.state = "CLOSED" # CLOSED, OPEN, HALF_OPEN
def call(self, operation):
if self.state == "OPEN":
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = "HALF_OPEN"
else:
raise CircuitOpenError("Circuit breaker is open")
try:
result = operation()
if self.state == "HALF_OPEN":
self.state = "CLOSED"
self.failure_count = 0
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "OPEN"
raise
# Usage
payment_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)
try:
result = payment_breaker.call(lambda: payment_api.charge(amount))
except CircuitOpenError:
# Don't even try — circuit is open
return escalate_to_human("Payment service unavailable")
Framework-Specific Implementation
LangGraph
from langgraph.checkpoint.postgres import PostgresSaver
# LangGraph has built-in retry support
graph = StateGraph(AgentState)
# Configure per-node retry
@graph.node(retry_policy=RetryPolicy(max_attempts=3, backoff_factor=2))
def call_external_api(state):
# Idempotency key from state
key = f"{state['user_id']}:{state['task_id']}:{state['step']}"
return api.call(idempotency_key=key)
# Checkpointing enables safe retry from last known state
app = graph.compile(checkpointer=PostgresSaver.from_conn_string(DATABASE_URL))
Temporal
from temporalio import activity, workflow
@activity.defn
async def book_flight(flight_id: str, idempotency_key: str) -> BookingResult:
"""
Temporal activities have at-least-once execution guarantee.
Your idempotent implementation provides no-more-than-once business effect.
Together = effective exactly-once execution.
"""
return await api.book(flight_id, idempotency_key=idempotency_key)
@workflow.defn
class BookingWorkflow:
@workflow.run
async def run(self, request: BookingRequest) -> BookingResult:
# Temporal handles retries with configurable policy
return await workflow.execute_activity(
book_flight,
args=[request.flight_id, f"{request.user_id}:{request.booking_id}"],
retry_policy=RetryPolicy(
initial_interval=timedelta(seconds=1),
maximum_interval=timedelta(seconds=30),
backoff_coefficient=2.0,
maximum_attempts=5,
non_retryable_error_types=["ValidationError", "AuthError"]
)
)
Common Gotchas
| Gotcha | Symptom | Fix |
|---|---|---|
| Timestamp in key | Retries create duplicates | Use stable identifiers only |
| Key too broad | Different operations collide | Include operation type in key |
| Key too narrow | Same operation not deduplicated | Include all relevant context |
| No TTL on cache | Memory leak | Set 24-48 hour TTL |
| Caching failures | Retrying failed ops returns failure | Only cache successful results |
| Retrying 400s | Wasted requests, never succeeds | Classify errors properly |
| No jitter | Thundering herd on recovery | Always use full jitter |
The Idempotency Checklist
Before deploying an agent with external actions:
KEY GENERATION [ ] Keys use stable identifiers (user_id, task_id, step_id) [ ] Keys do NOT include timestamps or retry counts [ ] Keys include operation type to prevent collisions [ ] Keys are deterministic (same input = same key) ERROR HANDLING [ ] Errors classified as RETRY vs NEVER_RETRY [ ] 4xx errors (except 429) are not retried [ ] 5xx and network errors are retried [ ] Max retry limit is set BACKOFF [ ] Exponential backoff implemented [ ] Full jitter added to prevent thundering herd [ ] Max delay cap prevents infinite waits [ ] Base delay appropriate for the API CIRCUIT BREAKERS [ ] Circuit breaker on each external dependency [ ] Failure threshold tuned for the service [ ] Recovery timeout allows service to stabilize [ ] Open circuit has graceful fallback
Key Takeaways
-
Idempotency is not optional. Every action with side effects needs deduplication strategy.
-
Keys must be stable. If the key changes on retry, it’s not idempotent.
-
Classify errors. Retry infrastructure errors. Don’t retry business errors.
-
Always use jitter. Without it, you’ll cause the outage you’re trying to survive.
-
Circuit breakers prevent cascades. One failing service shouldn’t take down everything.
Next Steps
Now that your actions are idempotent, what happens when your agent crashes mid-task?
→ Part 2: State Persistence & Checkpointing
Or jump to another topic:
- Part 3: Human-in-the-Loop — When to escalate to humans
- Part 4: Cost Control — Token budgets and circuit breakers