Idempotency & Safe Retries - The Stripe Pattern for Agents | Intentional / Deliberate / Engineering

Prerequisite: This is Part 1 of the Production Agents Deep Dive series. Start with Part 0: Overview for context.

Why This Matters

Your agent calls book_flight(). The API takes 35 seconds to respond. Your timeout is 30 seconds. Agent retries. API processed both requests. Customer is charged twice.

This isn’t a bug. This is correct retry logic meeting real-world latency.

Idempotency is the single most critical production requirement for agents that perform actions with side effects. Without it, retries create duplicates — double bookings, duplicate emails, corrupted state.

The Numbers:

68% of teams hit budget overruns in first agent deployments
50% cite “runaway tool loops and recursive logic” as the cause
API downtime surged 60% between Q1 2024 and Q1 2025
More downtime = more retries = more duplicate operations

What Goes Wrong Without This:

IDEMPOTENCY FAILURE PATTERNS

Symptom: Customer charged twice for the same order.
Cause:   Payment API timed out. Agent retried. Both charges processed.
       No idempotency key to deduplicate.

Symptom: User receives 47 copies of the same email.
Cause: Email send succeeded but response was slow. Agent assumed failure.
Retried. No deduplication on sends.

Symptom: Database has duplicate records with slight variations.
Cause: INSERT succeeded, network dropped response. Retry created second record.
No upsert or idempotency check.

What Idempotency Means

Idempotent: An operation that produces the same result when called multiple times with the same input.

IDEMPOTENT vs NON-IDEMPOTENT OPERATIONS

Idempotent:
GET /user/123         → Same user every time (safe to retry)
DELETE /file/abc      → File deleted, stays deleted (safe to retry)
PUT /user/123 {name}  → User updated to same value (safe to retry)

Not Idempotent:
POST /charge/$100 → New charge every time (dangerous to retry)
POST /email/send → New email every time (dangerous to retry)
INSERT INTO orders → New row every time (dangerous to retry)

Made Idempotent:
POST /charge/$100 + idempotency_key=xyz123 → Same charge on retry
POST /email/send + message_id=abc456 → Same email, no duplicate
INSERT ... ON CONFLICT DO NOTHING → Same row, no duplicate

The Stripe Pattern

Stripe processes millions of payments. They can’t afford duplicates. Their pattern is the industry standard:

# Client generates a unique key
def book_flight(flight_id, user_id, task_id, step_id):
    # Key must be STABLE across retries
    # Bad:  f"{user_id}:{timestamp}"        - different each retry
    # Bad:  f"{user_id}:{retry_count}"      - different each retry
    # Good: f"{user_id}:{task_id}:{step_id}" - same across retries

    idempotency_key = f"{user_id}:{task_id}:{step_id}"

    return api.book(
        flight_id=flight_id,
        idempotency_key=idempotency_key
    )

# Server checks and stores
def handle_booking(request):
    key = request.idempotency_key

    # Check if we've processed this before
    cached = cache.get(key)
    if cached:
        return cached  # Return stored result, don't reprocess

    # First time: process and store result
    result = process_booking(request)
    cache.set(key, result, ttl=timedelta(hours=24))
    return result

Key Generation Rules

Include	Exclude	Why
`user_id`	`timestamp`	Timestamps change on retry
`task_id`	`retry_count`	Retry count changes on retry
`step_id`	`random()`	Random changes on retry
`operation_type`	`request_id` (if regenerated)	Must be stable
`external_reference`

The test: If you retry the same logical operation, does the key stay the same? If not, it’s wrong.

Three Idempotency Strategies

Strategy 1: Idempotency Keys (Stripe Pattern)

Best for: External APIs, payments, bookings

class IdempotentClient:
    def __init__(self, cache):
        self.cache = cache

    def execute(self, operation, idempotency_key):
        # Check cache
        cached = self.cache.get(idempotency_key)
        if cached:
            return cached

        # Execute and cache
        result = operation()
        self.cache.set(idempotency_key, result, ttl=86400)  # 24 hours
        return result

# Usage
client = IdempotentClient(redis_cache)
result = client.execute(
    operation=lambda: api.book_flight(flight_id),
    idempotency_key=f"{user_id}:{task_id}:book_flight:{flight_id}"
)

Strategy 2: Sequence Numbers

Best for: Internal state changes, ordered operations

class SequencedOperations:
    def __init__(self):
        self.expected_seq = 1
        self.results = {}

    def execute(self, seq_num, operation):
        # Already processed
        if seq_num < self.expected_seq:
            return self.results[seq_num]

        # Out of order
        if seq_num > self.expected_seq:
            raise OutOfOrderError(f"Expected {self.expected_seq}, got {seq_num}")

        # Process and increment
        result = operation()
        self.results[seq_num] = result
        self.expected_seq += 1
        return result

Tradeoff: Simple but requires ordered processing. Doesn’t work well with concurrent clients.

Strategy 3: Time Window Deduplication

Best for: Best-effort deduplication, high-volume low-stakes operations

class TimeWindowDedup:
    def __init__(self, window_seconds=300):
        self.window = window_seconds
        self.seen = {}  # hash -> (timestamp, result)

    def execute(self, request_hash, operation):
        now = time.time()

        # Check if seen within window
        if request_hash in self.seen:
            timestamp, result = self.seen[request_hash]
            if now - timestamp < self.window:
                return result  # Within window, return cached

        # Process and cache
        result = operation()
        self.seen[request_hash] = (now, result)
        return result

Tradeoff: Allows some duplicates (if window expires), but prevents immediate retry storms.

Error Classification

Not all errors should be retried. Getting this wrong causes cascading failures.

from http import HTTPStatus

# These errors are transient — retry them
RETRY_ERRORS = [
    ConnectionResetError,           # Network blip
    TimeoutError,                   # Slow response
    HTTPStatus.TOO_MANY_REQUESTS,   # 429 - Rate limited
    HTTPStatus.SERVICE_UNAVAILABLE, # 503 - Server overloaded
    HTTPStatus.GATEWAY_TIMEOUT,     # 504 - Upstream timeout
    HTTPStatus.BAD_GATEWAY,         # 502 - Proxy error
]

# These errors are permanent — don't retry
NEVER_RETRY_ERRORS = [
    HTTPStatus.BAD_REQUEST,           # 400 - Invalid input
    HTTPStatus.UNAUTHORIZED,          # 401 - Auth failed
    HTTPStatus.FORBIDDEN,             # 403 - Not allowed
    HTTPStatus.NOT_FOUND,             # 404 - Doesn't exist
    HTTPStatus.UNPROCESSABLE_ENTITY,  # 422 - Business rule rejected
    HTTPStatus.CONFLICT,              # 409 - State conflict
]

def should_retry(error):
    if isinstance(error, tuple(RETRY_ERRORS)):
        return True
    if hasattr(error, 'status_code'):
        return error.status_code in [e.value for e in RETRY_ERRORS if hasattr(e, 'value')]
    return False

The rule: Retry infrastructure errors (network, timeout, overload). Don’t retry business errors (validation, auth, not found).

Exponential Backoff with Full Jitter

Naive retry: Wait 1s, retry. All clients retry at the same time. Server overwhelmed again.

Smart retry: Wait random time, increasing with each attempt. Clients spread out. Server recovers.

import random
import time

def retry_with_backoff(
    operation,
    max_retries=5,
    base_delay=0.1,
    max_delay=10.0,
    idempotency_key=None
):
    """
    Exponential backoff with full jitter.

    AWS research shows full jitter significantly reduces
    synchronized retry storms during outages.
    """
    for attempt in range(max_retries):
        try:
            return operation()
        except Exception as e:
            if not should_retry(e):
                raise  # Don't retry permanent errors

            if attempt == max_retries - 1:
                raise  # Last attempt, give up

            # Exponential backoff: 0.1, 0.2, 0.4, 0.8, 1.6... capped at max_delay
            delay = min(base_delay * (2 ** attempt), max_delay)

            # Full jitter: random value between 0 and delay
            # This spreads retries across time, preventing thundering herd
            jittered_delay = random.uniform(0, delay)

            time.sleep(jittered_delay)

Why Full Jitter?

JITTER PREVENTS THUNDERING HERD

WITHOUT JITTER:
Server fails at t=0
All 1000 clients retry at t=1
Server fails again
All 1000 clients retry at t=2
Server fails again
...

WITH FULL JITTER:
Server fails at t=0
Client A retries at t=0.3
Client B retries at t=0.7
Client C retries at t=0.1
...
Load spreads across 0-1 second window
Server can handle gradual recovery

Cascading Retry Storm

The nightmare scenario:

CASCADING RETRY STORM

1. Payment service has 30-second outage

2. Order processing agents timeout, start retrying
 → 1000 agents × 3 retries = 3000 payment requests

3. Payment retries trigger inventory checks
 → Each payment retry calls inventory
 → 3000 inventory requests

4. Inventory service overwhelmed by traffic
 → Starts timing out
 → Agents retry inventory calls

5. Inventory retries trigger shipping checks
 → Cascade continues

6. Within 60 seconds:
 → 10x normal load across all services
 → Multiple services failing
 → Complete system degradation

Prevention: Circuit Breakers

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.last_failure_time = None
        self.state = "CLOSED"  # CLOSED, OPEN, HALF_OPEN

    def call(self, operation):
        if self.state == "OPEN":
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = "HALF_OPEN"
            else:
                raise CircuitOpenError("Circuit breaker is open")

        try:
            result = operation()
            if self.state == "HALF_OPEN":
                self.state = "CLOSED"
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()

            if self.failure_count >= self.failure_threshold:
                self.state = "OPEN"
            raise

# Usage
payment_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)

try:
    result = payment_breaker.call(lambda: payment_api.charge(amount))
except CircuitOpenError:
    # Don't even try — circuit is open
    return escalate_to_human("Payment service unavailable")

Framework-Specific Implementation

LangGraph

from langgraph.checkpoint.postgres import PostgresSaver

# LangGraph has built-in retry support
graph = StateGraph(AgentState)

# Configure per-node retry
@graph.node(retry_policy=RetryPolicy(max_attempts=3, backoff_factor=2))
def call_external_api(state):
    # Idempotency key from state
    key = f"{state['user_id']}:{state['task_id']}:{state['step']}"
    return api.call(idempotency_key=key)

# Checkpointing enables safe retry from last known state
app = graph.compile(checkpointer=PostgresSaver.from_conn_string(DATABASE_URL))

Temporal

from temporalio import activity, workflow

@activity.defn
async def book_flight(flight_id: str, idempotency_key: str) -> BookingResult:
    """
    Temporal activities have at-least-once execution guarantee.
    Your idempotent implementation provides no-more-than-once business effect.
    Together = effective exactly-once execution.
    """
    return await api.book(flight_id, idempotency_key=idempotency_key)

@workflow.defn
class BookingWorkflow:
    @workflow.run
    async def run(self, request: BookingRequest) -> BookingResult:
        # Temporal handles retries with configurable policy
        return await workflow.execute_activity(
            book_flight,
            args=[request.flight_id, f"{request.user_id}:{request.booking_id}"],
            retry_policy=RetryPolicy(
                initial_interval=timedelta(seconds=1),
                maximum_interval=timedelta(seconds=30),
                backoff_coefficient=2.0,
                maximum_attempts=5,
                non_retryable_error_types=["ValidationError", "AuthError"]
            )
        )

Common Gotchas

Gotcha	Symptom	Fix
Timestamp in key	Retries create duplicates	Use stable identifiers only
Key too broad	Different operations collide	Include operation type in key
Key too narrow	Same operation not deduplicated	Include all relevant context
No TTL on cache	Memory leak	Set 24-48 hour TTL
Caching failures	Retrying failed ops returns failure	Only cache successful results
Retrying 400s	Wasted requests, never succeeds	Classify errors properly
No jitter	Thundering herd on recovery	Always use full jitter

The Idempotency Checklist

Before deploying an agent with external actions:

IDEMPOTENCY DEPLOYMENT CHECKLIST

KEY GENERATION
[ ] Keys use stable identifiers (user_id, task_id, step_id)
[ ] Keys do NOT include timestamps or retry counts
[ ] Keys include operation type to prevent collisions
[ ] Keys are deterministic (same input = same key)

ERROR HANDLING
[ ] Errors classified as RETRY vs NEVER_RETRY
[ ] 4xx errors (except 429) are not retried
[ ] 5xx and network errors are retried
[ ] Max retry limit is set

BACKOFF
[ ] Exponential backoff implemented
[ ] Full jitter added to prevent thundering herd
[ ] Max delay cap prevents infinite waits
[ ] Base delay appropriate for the API

CIRCUIT BREAKERS
[ ] Circuit breaker on each external dependency
[ ] Failure threshold tuned for the service
[ ] Recovery timeout allows service to stabilize
[ ] Open circuit has graceful fallback

Key Takeaways

Idempotency is not optional. Every action with side effects needs deduplication strategy.
Keys must be stable. If the key changes on retry, it’s not idempotent.
Classify errors. Retry infrastructure errors. Don’t retry business errors.
Always use jitter. Without it, you’ll cause the outage you’re trying to survive.
Circuit breakers prevent cascades. One failing service shouldn’t take down everything.

Next Steps

Now that your actions are idempotent, what happens when your agent crashes mid-task?

→ Part 2: State Persistence & Checkpointing

Or jump to another topic:

Part 3: Human-in-the-Loop — When to escalate to humans
Part 4: Cost Control — Token budgets and circuit breakers

Production-agents Series