Skip to content

Production-agents Series

Production Agents Overview - Why 98% Haven't Deployed

Deep dive into why most agent deployments fail, the six capabilities tutorials skip, and how to build agents that survive production

Context: This is Part 0 of the Production Agents Deep Dive series. For a quick introduction, read The Agent Loop Is a Lie first.

Why This Matters

You’ve built an agent. It works in development. It demos beautifully. You deploy it.

Then it:

  • Books the same flight twice when the API times out
  • Loses all progress when a user closes their browser
  • Sends 47 follow-up emails because it didn’t know it was waiting for a response
  • Burns through your monthly API budget in 3 hours
  • Does the wrong thing without crashing — and you don’t find out until a customer complains

You’re not alone. Only 2% of organizations have successfully deployed agentic AI at scale. Gartner predicts 40%+ of agentic AI projects will be canceled by 2027 due to cost overruns, unclear business value, and inadequate risk controls.

The problem isn’t your agent’s reasoning. It’s everything around the reasoning that tutorials don’t teach.

What Goes Wrong Without This:

PRODUCTION FAILURE PATTERNS
Symptom: Agent works in dev, fails in production.
Cause:   Dev has no timeouts, no crashes, no concurrent users.
       Production has all of these. Your agent wasn't built for them.

Symptom: Costs spiral out of control after launch.
Cause: Agents consume 5-20x more tokens than simple chains.
Without budgets and circuit breakers, loops run forever.

Symptom: Agent "completes" tasks but users complain about wrong results.
Cause: Agents fail silently. Traditional monitoring misses semantic errors.
You're tracking latency, not correctness.

The Six Capabilities Tutorials Skip

Every tutorial teaches observe-think-act. Here’s what they leave out:

THE AGENT LOOP
   
  OBSERVE    environment
  
        
        
  
     THINK     reasoning
  
        
        
  
      ACT      execute
  
        

  (repeat)
PRODUCTION VIEW

                   PRODUCTION VIEW                                    
                                                                      
                      
   IDEMPOTENCY        STATE           HUMAN                     
                   PERSISTENCE      ESCALATION                  
   Safe retries    Checkpoints      Judgment                    
   No duplicates   Recovery         Approval                    
                      
                                                                   
                                 
                                                                     
                                                                     
                                                       
                      THE LOOP                                      
                      (20% of                                       
                       the work)                                    
                                                       
                                                                     
                                 
                                                                   
                      
      COST         OBSERVABILITY     SECURITY                   
     CONTROL                                                    
   Token budgets   Silent fail     Sandboxing                   
   Circuit break   detection       Prompt inject                
                      

1. Idempotency

The problem: Agent calls book_flight(). API times out. Agent retries. Customer gets charged twice.

The solution: Every action with side effects needs an idempotency key. Same key = same result, no duplicates.

# Bad: Retry creates duplicate
result = api.book(flight_id)

# Good: Retry returns same result
result = api.book(flight_id, idempotency_key=f"{user_id}:{task_id}:{step_id}")

Deep dive: Part 1: Idempotency & Safe Retries


2. State Persistence

The problem: Agent crashes mid-task. User closes browser. Server restarts. All progress lost.

The solution: Checkpoint state after every significant step. Resume from last checkpoint on restart.

# Bad: State in memory
self.state = {}  # Lost on crash

# Good: Checkpoint to durable storage
self.state['in_progress'] = step.id
self.checkpoint()  # Survives crashes
result = step.run()
self.state['completed'].append(step.id)
self.checkpoint()

Deep dive: Part 2: State Persistence & Checkpointing


3. Human-in-the-Loop

The problem: Agent makes $50K decision autonomously. It’s wrong. No one reviewed it.

The solution: Escalate to humans for high-stakes, low-confidence, or irreversible decisions. Not as a fallback — as a feature.

# Bad: Humans are error handlers
except AgentFailed:
    notify_human("Help!")

# Good: Humans are decision makers
if decision.confidence < 0.7 or decision.amount > 500:
    decision = await human.review(decision)

Deep dive: Part 3: Human-in-the-Loop Patterns


4. Cost Control

The problem: Agent enters loop. Loop calls LLM. LLM responds. Loop continues. You wake up to a $10K bill.

The solution: Token budgets per task, circuit breakers for loops, max step limits.

class TokenBudget:
    def __init__(self, max_tokens=50000):
        self.max = max_tokens
        self.used = 0

    def check(self, tokens):
        if self.used + tokens > self.max:
            raise BudgetExceeded()
        self.used += tokens

Deep dive: Part 4: Cost Control & Token Budgets


5. Observability

The problem: Agent completes task. User says result is wrong. You check logs. Latency was fine. No errors. What happened?

The solution: Track tool selection, reasoning traces, confidence scores. Detect semantic failures, not just crashes.

# Bad: Traditional monitoring
metrics.record_latency(response_time)

# Good: Agent-specific observability
audit.log({
    'tool_selected': decision.tool,
    'alternatives_considered': decision.alternatives,
    'confidence': decision.confidence,
    'reasoning': decision.chain_of_thought
})

Deep dive: Part 5: Observability & Silent Failures


6. Security

The problem: Agent reads email. Email contains prompt injection. Agent follows injected instructions. Data exfiltrated.

The solution: Sandbox tool execution. Validate inputs and outputs. Match isolation level to risk.

SECURITY ISOLATION LEVELS
Low risk (RAG, search):       Hardened containers
Medium risk (code execution): gVisor / GKE Sandbox
High risk (financial):        Firecracker MicroVMs

Deep dive: Part 6: Security & Sandboxing


When to Use Agents vs Pipelines

Not every problem needs an agent. Here’s how to decide:

DECISION MATRIX: PIPELINE vs AGENT

                    DECISION MATRIX                                   

                                                                      
  USE A PIPELINE WHEN:              USE AN AGENT WHEN:                
                                                                      
  • Steps are fixed and known       • Steps depend on results         
  • Input  Output is predictable   • Need to adapt to surprises      
  • Failures are simple (retry/fail)• Failures need judgment          
  • No external state changes       • Actions have side effects       
Speed > flexibility             • Flexibility > speed             
                                                                      
  Examples:                         Examples:                         
  • RAG (retrieve  generate)       • Customer support (varies)       
  • Summarization                   • Code generation (iterative)     
  • Classification                  • Research tasks (exploratory)    
  • Extraction                      • Multi-step bookings             
                                                                      

The test: If you can draw the flowchart before running, it’s a pipeline. If the flowchart depends on what happens, it’s an agent.


The Production Checklist

Before deploying an agent:

PRODUCTION DEPLOYMENT CHECKLIST
IDEMPOTENCY
[ ] Every external action has an idempotency key
[ ] Keys use stable identifiers (not timestamps)
[ ] Retries classified (RETRY vs NEVER_RETRY errors)
[ ] Backoff includes jitter

STATE PERSISTENCE
[ ] Checkpointing to durable storage (PostgreSQL in production)
[ ] Checkpoint BEFORE execution, not after
[ ] Resume logic handles in_progress state
[ ] Progress files for multi-session tasks

HUMAN ESCALATION
[ ] Confidence-based routing defined
[ ] High-stakes actions require approval
[ ] Escalation metrics tracked
[ ] Sampling strategy for scale

COST CONTROL
[ ] Token budget per task
[ ] Max step limit per loop
[ ] Circuit breakers on failure spikes
[ ] Cost alerts configured

OBSERVABILITY
[ ] Structured audit logging
[ ] Tool selection tracked
[ ] Semantic failure detection
[ ] Replay capability for debugging

SECURITY
[ ] Sandboxing appropriate to risk level
[ ] Input validation on tool calls
[ ] Output sanitization
[ ] Prompt injection defenses

Series Roadmap

This series covers each capability in depth:

PartTopicWhat You’ll Learn
0Overview (you are here)Why 98% haven’t deployed, the six capabilities
1IdempotencySafe retries, Stripe pattern, cascading failure prevention
2State PersistenceCheckpointing, LangGraph patterns, hybrid memory
3Human-in-the-LoopConfidence routing, scaling without rubber-stamping
4Cost ControlToken budgets, circuit breakers, model routing
5ObservabilitySilent failure detection, semantic monitoring
6Durable ExecutionTemporal, Inngest, Restate, AWS/Azure/GCP offerings
7SecuritySandboxing levels, prompt injection defense
8TestingSimulation-based testing, evaluation metrics

Key Takeaways

  1. The loop is 20% of the work. The other 80% is handling failures, persisting state, controlling costs, and keeping humans in the loop.

  2. Tutorials optimize for understanding. Production optimizes for reliability. They’re different goals.

  3. Agents fail differently than APIs. They don’t crash — they do the wrong thing quietly. Your monitoring needs to catch semantic failures, not just exceptions.

  4. Start with one capability. Add idempotency first (highest leverage). Then state persistence. Then the rest.

  5. Not every problem needs an agent. If steps are fixed, use a pipeline. Agents add flexibility at the cost of complexity.


Next Steps

Ready to go deeper? Start with Part 1: Idempotency & Safe Retries — it’s the single highest-leverage fix for production agents.

Or jump to the capability you need most: