Skip to content

The Agent Loop Is a Lie

The Agent Loop Is a Lie

You’ve seen this diagram:

THE AGENT LOOP
   
  OBSERVE    environment
  
        
        
  
     THINK     reasoning
  
        
        
  
      ACT      execute
  
        

  (repeat)

Elegant. Clean. Fits on a slide.

It’s also a lie.

Not because it’s wrong — it’s a fine abstraction for the happy path. The lie is that tutorials present it as complete. It’s not. It’s maybe 20% of what a production agent does. The other 80% is everything that happens when things go wrong.

What Actually Happens

The loop says: OBSERVE → THINK → ACT.

Production says:

PARTIAL_OBSERVE  MAYBE_THINK  FAILED_ACT
     
   RETRY  TIMEOUT  FALLBACK
     
THINK_AGAIN  DIFFERENT_ACT
     
OBSERVE_SIDE_EFFECTS  CLEANUP

Here’s a real trace. User says “Book me a flight to NYC tomorrow”:

[00:00] Parse intent  destination=NYC, date=tomorrow
      MISSING: departure city, time, airline

[00:01] Should I ask or infer? Calendar shows SF meetings.
DECISION: assume SF departure (risky)

[00:02] Call flight_search(SFO  JFK)
[00:07] TIMEOUT. Retry.
[00:09] 47 flights found.

[00:09] Wait — JFK or LGA or EWR? User said "NYC"
Should have asked. Now have 47 flights to 3 airports.

[00:10] Check history: user took JFK twice before.
Filter to JFK. 18 flights.
Show top 3 by price.

[00:13] User: "I wanted LaGuardia"
INVALIDATE previous reasoning. Re-filter.

[00:15] User: "Book the 8am one"
[00:20] ERROR 409: Seat no longer available.
Flight sold out while user was deciding.

[00:21] Check alternatives. 9am exists, $12 more.
"That flight just sold out. 9am is available. Book it?"

Seven things happened that the loop doesn’t model:

  1. Partial observation — missing departure city
  2. Inference under uncertainty — assumed SF, was wrong about airport
  3. Timeout and retry — API didn’t respond
  4. Mid-task correction — user wanted LGA, not JFK
  5. External state change — flight sold out during conversation
  6. Multi-step sequences — not single atomic actions
  7. Recovery and alternatives — fallback to 9am flight

The loop was a starting point. What we needed was a mess of retries, corrections, and fallbacks.

The Four Gaps

Gap 1: Observation Is Partial

The loop implies you observe, then have full state. Reality: you never have full state.

# Tutorial version
state = agent.observe()  # Returns complete state

# Production
state = agent.observe()  # Always partial

# The real question: do I ask or infer?
if confidence(state) < THRESHOLD:
    if cost_of_asking < cost_of_being_wrong:
        state = agent.ask_for_clarification()
    else:
        state = agent.infer_from_context()
        state.mark_inferred()  # Track what we guessed

The decision to ask vs. infer is a judgment call with real tradeoffs. Asking is slow and annoying. Inferring is fast and risky. The loop pretends this decision doesn’t exist.

Gap 2: Thinking Produces Plans, Not Actions

The loop implies: think once, get one action. Reality: thinking produces a strategy with fallbacks.

# Tutorial
action = agent.think(state)
agent.execute(action)

# Production: think returns a plan
plan = agent.think(state)

# Plan structure:
# - primary: first thing to try
# - fallback: if primary fails
# - timeout: how long to wait
# - retry_policy: how to retry
# - abort_conditions: when to give up
# - human_trigger: when to escalate

result = execute_with_recovery(plan)

The “think” step doesn’t produce an action. It produces a strategy for what to do when things go wrong.

Gap 3: Actions Have Messy Outcomes

The loop implies: action succeeds or fails, back to observe. Reality: actions have partial success, side effects, and delayed consequences.

# Tutorial
if result.success:
    continue
else:
    handle_error()

# Production: three kinds of messy outcomes

# 1. Partial success
if result.partial:
    # Flight booked, but seat not assigned yet
    # Email sent, but delivery not confirmed
    agent.track_pending(result.pending_items)

# 2. Side effects
if result.side_effects:
    # Booking consumed travel credit
    # API call triggered rate limiting
    agent.update_world_model(result.side_effects)

# 3. Deferred results
if result.deferred:
    # Confirmation comes in 24-48 hours
    # Human approval required
    agent.schedule_followup(result.callback)

# 4. Pending operations (critical!)
# Track what you're WAITING FOR, not just what you've DONE
if result.pending:
    # Email sent, waiting for CRM to sync
    # Payment initiated, waiting for confirmation
    agent.track_awaiting(result.pending_items, timeout=300)

Actions aren’t atomic. They’re transactions that might half-succeed, trigger side effects, or need followup later.

The 47 emails bug: An email agent observed “no response,” sent a follow-up, checkpointed, then observed “no response” again (email takes time to sync to CRM). It sent 47 follow-ups before someone noticed. The fix: track pending state — what you’re waiting for, not just what you’ve done.

Gap 4: The Loop Is 20% of the Agent

The loop is the core reasoning cycle. But production agents need infrastructure around it:

WHAT THE LOOP COVERS          WHAT PRODUCTION NEEDS
─────────────────────         ─────────────────────
 Observe                      Observe
 Think                        Think
 Act                          Act
                              + Retry with backoff
                              + Timeout handling
                              + Partial failure recovery
                              + State persistence
                              + Checkpoint/resume
                              + Idempotency
                              + Human escalation
                              + Audit logging
                              + Replay for debugging

The loop is necessary but not sufficient.

The Real Hard Problems

The reviewers are right: I’ve been diagnosing without prescribing. Here are the actual problems you’ll hit and how to solve them.

Problem 1: Idempotency

This is the killer. If book_flight() times out and you retry, do you get two bookings?

# Wrong: retry blindly
def book_flight(flight_id):
    return api.book(flight_id)  # Might create duplicate

# Right: idempotency key
def book_flight(flight_id, idempotency_key):
    # Same key = same result, no duplicate
    return api.book(flight_id, idempotency_key=idempotency_key)

# Generate key from task context (stable identifiers only!)
# Bad:  f"{user_id}:{timestamp}" - different on retry
# Bad:  f"{user_id}:{retry_count}" - different on retry
# Good: f"{user_id}:{task_id}:{step.id}" - same across retries
key = f"{user_id}:{task_id}:{step.id}"

Every external action needs an idempotency strategy:

  • API calls: Use idempotency keys if supported
  • Emails: Check if already sent before sending
  • Database writes: Use upserts or check-then-write
  • Payments: Always use idempotency keys

Problem 2: State Persistence

Agent crashes mid-task. User closes browser. Server restarts. What happens to the booking?

# Wrong: state lives in memory
class Agent:
    def __init__(self):
        self.state = {}  # Lost on crash

# Right: checkpoint to durable storage
class Agent:
    def __init__(self, task_id, storage):
        self.task_id = task_id
        self.storage = storage
        self.state = storage.load(task_id) or {}

    def checkpoint(self):
        self.storage.save(self.task_id, self.state)

    def execute_step(self, step):
        # Checkpoint BEFORE execution (mark intent)
        self.state['in_progress'] = step.id
        self.checkpoint()

        # Execute
        result = step.run()

        # Checkpoint AFTER execution (mark completion)
        self.state['completed'].append(step.id)
        del self.state['in_progress']
        self.state['last_result'] = result
        self.checkpoint()
        return result

    def resume(self):
        """On restart, check if we crashed mid-step"""
        if 'in_progress' in self.state:
            step_id = self.state['in_progress']
            # Check if step actually completed (idempotent read)
            if self.check_step_completed(step_id):
                self.state['completed'].append(step_id)
            # Otherwise, re-execute with idempotency key

Why checkpoint before execution? If you crash between step.run() and checkpoint(), you don’t know if the step ran. Checkpointing intent first lets you detect and handle this on resume.

Caveat: For true reliability, ensure your storage supports atomic writes. If checkpoint() can partially fail, you need write-ahead logging or versioned checkpoints.

Problem 3: Replay for Debugging

Agent failed last night. How do you figure out why?

# Wrong: print statements, hope for the best
def think(state):
    print(f"Thinking about {state}")  # Lost forever
    return decide(state)

# Right: structured audit log
def think(state):
    decision = decide(state)
    audit.log({
        'timestamp': now(),
        'task_id': self.task_id,
        'phase': 'think',
        'input_state': state,
        'decision': decision,
        'reasoning': decision.explanation,
        'confidence': decision.confidence
    })

    return decision

# Now you can replay:
# 1. Load audit log for failed task
# 2. See exact state at each step
# 3. Reproduce the failure deterministically

Good agent frameworks give you replay for free. If yours doesn’t, build it.

Problem 4: Human Escalation (Not Failure)

Human-in-the-loop isn’t a fallback for when the agent fails. It’s a feature for when judgment is needed.

# Wrong: humans are error handlers
except AgentStuck:
    notify_human("Agent failed, please help")

# Right: humans are decision makers
class EscalationPolicy:
    def should_escalate(self, decision):
        # High stakes? Get human approval.
        if decision.involves_payment and decision.amount > 500:
            return True

        # Low confidence? Ask human.
        if decision.confidence < 0.7:
            return True

        # Irreversible? Double-check.
        if not decision.reversible:
            return True

        return False

# In the agent loop:
decision = agent.think(state)
if escalation_policy.should_escalate(decision):
    decision = await human.review(decision)  # Human approves/modifies
result = agent.execute(decision)

Design for human collaboration from day one. The agent prepares options; the human makes judgment calls.

Scaling caveat: Human escalation doesn’t scale linearly. As agent volume grows, human bandwidth doesn’t. At one fintech, month 1 looked great — humans approved/rejected thoughtfully. By month 6, the approval queue was 200 items deep and humans were rubber-stamping everything. Month 9: a $50K fraud slipped through.

The fix: don’t escalate 100% of edge cases. Use sampling — escalate a random 10-20% of borderline decisions. Humans become the audit mechanism, not the last line of defense. Track override rates and adjust thresholds based on what humans actually catch.

The Minimal Production Agent

Here’s what you actually need. Not a diagram — working structure.

class ProductionAgent:
    def __init__(self, task_id, storage, audit, escalation):
        self.task_id = task_id
        self.storage = storage
        self.audit = audit
        self.escalation = escalation
        self.state = storage.load(task_id) or initial_state()

    async def run(self):
        while not self.state.complete:
            try:
                # 1. Observe (with partial state handling)
                observation = await self.observe()
                if observation.needs_clarification:
                    await self.ask_user(observation.questions)
                    continue

                # 2. Think (produces plan, not action)
                plan = self.think(observation)
                self.audit.log('plan', plan)

                # 3. Check if human needed
                if self.escalation.should_escalate(plan):
                    plan = await self.escalation.get_approval(plan)

                # 4. Execute with recovery
                for step in plan.steps:
                    result = await self.execute_with_retry(
                        step,
                        idempotency_key=self.make_key(step),
                        timeout=plan.timeout,
                        retries=plan.retries
                    )

                    self.state.record(step, result)
                    self.storage.checkpoint(self.state)  # Persist

                    if result.failed and not plan.fallback:
                        return await self.escalation.handoff(
                            "Step failed, no fallback",
                            self.state
                        )

            except Exception as e:
                self.audit.log('error', e, self.state)
                return await self.escalation.handoff(str(e), self.state)

        return self.state.result

~50 lines. Handles:

  • Partial observation → ask for clarification
  • Plans with fallbacks → not single actions
  • Human escalation → built-in, not bolted-on
  • Idempotency → key per step
  • Persistence → checkpoint after each step
  • Audit → log for replay
  • Recovery → retry with backoff, fallback, or escalate

What Changes Tomorrow

If you’re building agents:

1. Add idempotency keys to every external action. This is the single highest-leverage fix. Without it, retries create duplicates.

2. Checkpoint state after each step. Use Redis, Postgres, S3, whatever. Just don’t keep state in memory only.

3. Log decisions, not just actions. When debugging, you need to know why the agent did something, not just what it did.

4. Design human escalation as a feature. Define when to escalate before things go wrong. Not “agent failed,” but “agent needs judgment.”

5. Build replay from day one. You will need to debug a failed agent run. Make sure you can reproduce it.

6. Set timeouts on everything. Every API call, every LLM call, every user wait. Without timeouts, agents hang forever and resources leak.

The loop is a fine abstraction. It’s just not complete. Wrap it in recovery, persistence, and human collaboration. Then you have a production agent.


Appendix: Architecture Diagrams

What Tutorials Leave Out

This is what a production agent actually needs — the orchestrator, error handling, state management, and human handoff that tutorials skip:

Production Agent Components
                  
                       ORCHESTRATOR        
                    (not in tutorials!)    
                  
                              
      
                                                    
                                                    
        
 LOOP EXECUTOR      ERROR HANDLER      STATE MANAGER 
  (the loop)        (recovery)         (memory)      
        
                                                    
                                                    
        
 TOOL EXECUTOR      HUMAN HANDOFF      MONITORING    
 (actual work)      (escalation)      (observability)
        

The Full Production View

PRODUCTION AGENT ARCHITECTURE

                    EVENT DISPATCHER                          
  (user input, tool responses, timeouts, external events)     

                             
        
                                                    
                                                    
        
 TASK MANAGER      TOOL ROUTER       HUMAN INTERFACE  
                                                      
 - queue           - dispatch        - clarifications 
 - prioritize      - timeout         - approvals      
 - interrupt       - retry           - escalations    
 - checkpoint      - idempotency     - handoffs       
        
                                               
       
                           
                           
              
                   STATE MANAGER      
                                      
               - world model          
               - conversation history 
               - pending operations   
               - checkpoints          
               - audit log            
              
                           
                           
              
                    REASONING         
                                      
               - plan generation      
               - confidence scoring   
               - fallback selection   
               - escalation triggers  
              

The loop lives inside REASONING. Everything else makes it production-ready.


Go Deeper: Production Agents Series

This post covers the “what.” The deep dive series covers the “how.”

PartTopicWhat You’ll Learn
0OverviewWhy 98% of orgs haven’t deployed agents at scale
1IdempotencySafe retries, the Stripe pattern, cascading failure prevention
2State PersistenceCheckpointing, LangGraph patterns, hybrid memory
3Human-in-the-LoopConfidence routing, scaling without rubber-stamping
4Cost ControlToken budgets, circuit breakers, model routing
5ObservabilitySilent failure detection, semantic monitoring
6Durable ExecutionTemporal, Inngest, Restate, AWS/Azure/GCP offerings
7SecuritySandboxing levels, prompt injection defense
8TestingSimulation-based testing, evaluation metrics