The Agent Loop Is a Lie
The Agent Loop Is a Lie
You’ve seen this diagram:
┌───────────┐ ┌─▶│ OBSERVE │ ← environment │ └─────┬─────┘ │ │ │ ▼ │ ┌───────────┐ │ │ THINK │ ← reasoning │ └─────┬─────┘ │ │ │ ▼ │ ┌───────────┐ │ │ ACT │ → execute │ └─────┬─────┘ │ │ └────────┘ (repeat)
Elegant. Clean. Fits on a slide.
It’s also a lie.
Not because it’s wrong — it’s a fine abstraction for the happy path. The lie is that tutorials present it as complete. It’s not. It’s maybe 20% of what a production agent does. The other 80% is everything that happens when things go wrong.
What Actually Happens
The loop says: OBSERVE → THINK → ACT.
Production says:
PARTIAL_OBSERVE → MAYBE_THINK → FAILED_ACT ↓ RETRY → TIMEOUT → FALLBACK ↓ THINK_AGAIN → DIFFERENT_ACT ↓ OBSERVE_SIDE_EFFECTS → CLEANUP
Here’s a real trace. User says “Book me a flight to NYC tomorrow”:
Parse intent → destination=NYC, date=tomorrow MISSING: departure city, time, airline Should I ask or infer? Calendar shows SF meetings. DECISION: assume SF departure (risky) Call flight_search(SFO → JFK) TIMEOUT. Retry. 47 flights found. Wait — JFK or LGA or EWR? User said "NYC" Should have asked. Now have 47 flights to 3 airports. Check history: user took JFK twice before. Filter to JFK. 18 flights. Show top 3 by price. User: "I wanted LaGuardia" INVALIDATE previous reasoning. Re-filter. User: "Book the 8am one" ERROR 409: Seat no longer available. Flight sold out while user was deciding. Check alternatives. 9am exists, $12 more. "That flight just sold out. 9am is available. Book it?"
Seven things happened that the loop doesn’t model:
- Partial observation — missing departure city
- Inference under uncertainty — assumed SF, was wrong about airport
- Timeout and retry — API didn’t respond
- Mid-task correction — user wanted LGA, not JFK
- External state change — flight sold out during conversation
- Multi-step sequences — not single atomic actions
- Recovery and alternatives — fallback to 9am flight
The loop was a starting point. What we needed was a mess of retries, corrections, and fallbacks.
The Four Gaps
Gap 1: Observation Is Partial
The loop implies you observe, then have full state. Reality: you never have full state.
# Tutorial version
state = agent.observe() # Returns complete state
# Production
state = agent.observe() # Always partial
# The real question: do I ask or infer?
if confidence(state) < THRESHOLD:
if cost_of_asking < cost_of_being_wrong:
state = agent.ask_for_clarification()
else:
state = agent.infer_from_context()
state.mark_inferred() # Track what we guessed
The decision to ask vs. infer is a judgment call with real tradeoffs. Asking is slow and annoying. Inferring is fast and risky. The loop pretends this decision doesn’t exist.
Gap 2: Thinking Produces Plans, Not Actions
The loop implies: think once, get one action. Reality: thinking produces a strategy with fallbacks.
# Tutorial
action = agent.think(state)
agent.execute(action)
# Production: think returns a plan
plan = agent.think(state)
# Plan structure:
# - primary: first thing to try
# - fallback: if primary fails
# - timeout: how long to wait
# - retry_policy: how to retry
# - abort_conditions: when to give up
# - human_trigger: when to escalate
result = execute_with_recovery(plan)
The “think” step doesn’t produce an action. It produces a strategy for what to do when things go wrong.
Gap 3: Actions Have Messy Outcomes
The loop implies: action succeeds or fails, back to observe. Reality: actions have partial success, side effects, and delayed consequences.
# Tutorial
if result.success:
continue
else:
handle_error()
# Production: three kinds of messy outcomes
# 1. Partial success
if result.partial:
# Flight booked, but seat not assigned yet
# Email sent, but delivery not confirmed
agent.track_pending(result.pending_items)
# 2. Side effects
if result.side_effects:
# Booking consumed travel credit
# API call triggered rate limiting
agent.update_world_model(result.side_effects)
# 3. Deferred results
if result.deferred:
# Confirmation comes in 24-48 hours
# Human approval required
agent.schedule_followup(result.callback)
# 4. Pending operations (critical!)
# Track what you're WAITING FOR, not just what you've DONE
if result.pending:
# Email sent, waiting for CRM to sync
# Payment initiated, waiting for confirmation
agent.track_awaiting(result.pending_items, timeout=300)
Actions aren’t atomic. They’re transactions that might half-succeed, trigger side effects, or need followup later.
The 47 emails bug: An email agent observed “no response,” sent a follow-up, checkpointed, then observed “no response” again (email takes time to sync to CRM). It sent 47 follow-ups before someone noticed. The fix: track pending state — what you’re waiting for, not just what you’ve done.
Gap 4: The Loop Is 20% of the Agent
The loop is the core reasoning cycle. But production agents need infrastructure around it:
WHAT THE LOOP COVERS WHAT PRODUCTION NEEDS
───────────────────── ─────────────────────
✓ Observe ✓ Observe
✓ Think ✓ Think
✓ Act ✓ Act
+ Retry with backoff
+ Timeout handling
+ Partial failure recovery
+ State persistence
+ Checkpoint/resume
+ Idempotency
+ Human escalation
+ Audit logging
+ Replay for debugging
The loop is necessary but not sufficient.
The Real Hard Problems
The reviewers are right: I’ve been diagnosing without prescribing. Here are the actual problems you’ll hit and how to solve them.
Problem 1: Idempotency
This is the killer. If book_flight() times out and you retry, do you get two bookings?
# Wrong: retry blindly
def book_flight(flight_id):
return api.book(flight_id) # Might create duplicate
# Right: idempotency key
def book_flight(flight_id, idempotency_key):
# Same key = same result, no duplicate
return api.book(flight_id, idempotency_key=idempotency_key)
# Generate key from task context (stable identifiers only!)
# Bad: f"{user_id}:{timestamp}" - different on retry
# Bad: f"{user_id}:{retry_count}" - different on retry
# Good: f"{user_id}:{task_id}:{step.id}" - same across retries
key = f"{user_id}:{task_id}:{step.id}"
Every external action needs an idempotency strategy:
- API calls: Use idempotency keys if supported
- Emails: Check if already sent before sending
- Database writes: Use upserts or check-then-write
- Payments: Always use idempotency keys
Problem 2: State Persistence
Agent crashes mid-task. User closes browser. Server restarts. What happens to the booking?
# Wrong: state lives in memory
class Agent:
def __init__(self):
self.state = {} # Lost on crash
# Right: checkpoint to durable storage
class Agent:
def __init__(self, task_id, storage):
self.task_id = task_id
self.storage = storage
self.state = storage.load(task_id) or {}
def checkpoint(self):
self.storage.save(self.task_id, self.state)
def execute_step(self, step):
# Checkpoint BEFORE execution (mark intent)
self.state['in_progress'] = step.id
self.checkpoint()
# Execute
result = step.run()
# Checkpoint AFTER execution (mark completion)
self.state['completed'].append(step.id)
del self.state['in_progress']
self.state['last_result'] = result
self.checkpoint()
return result
def resume(self):
"""On restart, check if we crashed mid-step"""
if 'in_progress' in self.state:
step_id = self.state['in_progress']
# Check if step actually completed (idempotent read)
if self.check_step_completed(step_id):
self.state['completed'].append(step_id)
# Otherwise, re-execute with idempotency key
Why checkpoint before execution? If you crash between step.run() and checkpoint(), you don’t know if the step ran. Checkpointing intent first lets you detect and handle this on resume.
Caveat: For true reliability, ensure your storage supports atomic writes. If checkpoint() can partially fail, you need write-ahead logging or versioned checkpoints.
Problem 3: Replay for Debugging
Agent failed last night. How do you figure out why?
# Wrong: print statements, hope for the best
def think(state):
print(f"Thinking about {state}") # Lost forever
return decide(state)
# Right: structured audit log
def think(state):
decision = decide(state)
audit.log({
'timestamp': now(),
'task_id': self.task_id,
'phase': 'think',
'input_state': state,
'decision': decision,
'reasoning': decision.explanation,
'confidence': decision.confidence
})
return decision
# Now you can replay:
# 1. Load audit log for failed task
# 2. See exact state at each step
# 3. Reproduce the failure deterministically
Good agent frameworks give you replay for free. If yours doesn’t, build it.
Problem 4: Human Escalation (Not Failure)
Human-in-the-loop isn’t a fallback for when the agent fails. It’s a feature for when judgment is needed.
# Wrong: humans are error handlers
except AgentStuck:
notify_human("Agent failed, please help")
# Right: humans are decision makers
class EscalationPolicy:
def should_escalate(self, decision):
# High stakes? Get human approval.
if decision.involves_payment and decision.amount > 500:
return True
# Low confidence? Ask human.
if decision.confidence < 0.7:
return True
# Irreversible? Double-check.
if not decision.reversible:
return True
return False
# In the agent loop:
decision = agent.think(state)
if escalation_policy.should_escalate(decision):
decision = await human.review(decision) # Human approves/modifies
result = agent.execute(decision)
Design for human collaboration from day one. The agent prepares options; the human makes judgment calls.
Scaling caveat: Human escalation doesn’t scale linearly. As agent volume grows, human bandwidth doesn’t. At one fintech, month 1 looked great — humans approved/rejected thoughtfully. By month 6, the approval queue was 200 items deep and humans were rubber-stamping everything. Month 9: a $50K fraud slipped through.
The fix: don’t escalate 100% of edge cases. Use sampling — escalate a random 10-20% of borderline decisions. Humans become the audit mechanism, not the last line of defense. Track override rates and adjust thresholds based on what humans actually catch.
The Minimal Production Agent
Here’s what you actually need. Not a diagram — working structure.
class ProductionAgent:
def __init__(self, task_id, storage, audit, escalation):
self.task_id = task_id
self.storage = storage
self.audit = audit
self.escalation = escalation
self.state = storage.load(task_id) or initial_state()
async def run(self):
while not self.state.complete:
try:
# 1. Observe (with partial state handling)
observation = await self.observe()
if observation.needs_clarification:
await self.ask_user(observation.questions)
continue
# 2. Think (produces plan, not action)
plan = self.think(observation)
self.audit.log('plan', plan)
# 3. Check if human needed
if self.escalation.should_escalate(plan):
plan = await self.escalation.get_approval(plan)
# 4. Execute with recovery
for step in plan.steps:
result = await self.execute_with_retry(
step,
idempotency_key=self.make_key(step),
timeout=plan.timeout,
retries=plan.retries
)
self.state.record(step, result)
self.storage.checkpoint(self.state) # Persist
if result.failed and not plan.fallback:
return await self.escalation.handoff(
"Step failed, no fallback",
self.state
)
except Exception as e:
self.audit.log('error', e, self.state)
return await self.escalation.handoff(str(e), self.state)
return self.state.result
~50 lines. Handles:
- Partial observation → ask for clarification
- Plans with fallbacks → not single actions
- Human escalation → built-in, not bolted-on
- Idempotency → key per step
- Persistence → checkpoint after each step
- Audit → log for replay
- Recovery → retry with backoff, fallback, or escalate
What Changes Tomorrow
If you’re building agents:
1. Add idempotency keys to every external action. This is the single highest-leverage fix. Without it, retries create duplicates.
2. Checkpoint state after each step. Use Redis, Postgres, S3, whatever. Just don’t keep state in memory only.
3. Log decisions, not just actions. When debugging, you need to know why the agent did something, not just what it did.
4. Design human escalation as a feature. Define when to escalate before things go wrong. Not “agent failed,” but “agent needs judgment.”
5. Build replay from day one. You will need to debug a failed agent run. Make sure you can reproduce it.
6. Set timeouts on everything. Every API call, every LLM call, every user wait. Without timeouts, agents hang forever and resources leak.
The loop is a fine abstraction. It’s just not complete. Wrap it in recovery, persistence, and human collaboration. Then you have a production agent.
Appendix: Architecture Diagrams
What Tutorials Leave Out
This is what a production agent actually needs — the orchestrator, error handling, state management, and human handoff that tutorials skip:
┌─────────────────────────┐ │ ORCHESTRATOR │ │ (not in tutorials!) │ └───────────┬─────────────┘ │ ┌───────────────────────┼───────────────────────┐ │ │ │ ▼ ▼ ▼ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ LOOP EXECUTOR │ │ ERROR HANDLER │ │ STATE MANAGER │ │ (the loop) │ │ (recovery) │ │ (memory) │ └───────────────┘ └───────────────┘ └───────────────┘ │ │ │ ▼ ▼ ▼ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ TOOL EXECUTOR │ │ HUMAN HANDOFF │ │ MONITORING │ │ (actual work) │ │ (escalation) │ │(observability)│ └───────────────┘ └───────────────┘ └───────────────┘
The Full Production View
┌──────────────────────────────────────────────────────────────┐ │ EVENT DISPATCHER │ │ (user input, tool responses, timeouts, external events) │ └────────────────────────────┬─────────────────────────────────┘ │ ┌────────────────────┼────────────────────────┐ │ │ │ ▼ ▼ ▼ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │ TASK MANAGER │ │ TOOL ROUTER │ │ HUMAN INTERFACE │ │ │ │ │ │ │ │ - queue │ │ - dispatch │ │ - clarifications │ │ - prioritize │ │ - timeout │ │ - approvals │ │ - interrupt │ │ - retry │ │ - escalations │ │ - checkpoint │ │ - idempotency│ │ - handoffs │ └──────┬───────┘ └──────┬───────┘ └────────┬─────────┘ │ │ │ └───────────────────┼─────────────────────┘ │ ▼ ┌────────────────────────┐ │ STATE MANAGER │ │ │ │ - world model │ │ - conversation history │ │ - pending operations │ │ - checkpoints │ │ - audit log │ └────────────────────────┘ │ ▼ ┌────────────────────────┐ │ REASONING │ │ │ │ - plan generation │ │ - confidence scoring │ │ - fallback selection │ │ - escalation triggers │ └────────────────────────┘
The loop lives inside REASONING. Everything else makes it production-ready.
Go Deeper: Production Agents Series
This post covers the “what.” The deep dive series covers the “how.”
| Part | Topic | What You’ll Learn |
|---|---|---|
| 0 | Overview | Why 98% of orgs haven’t deployed agents at scale |
| 1 | Idempotency | Safe retries, the Stripe pattern, cascading failure prevention |
| 2 | State Persistence | Checkpointing, LangGraph patterns, hybrid memory |
| 3 | Human-in-the-Loop | Confidence routing, scaling without rubber-stamping |
| 4 | Cost Control | Token budgets, circuit breakers, model routing |
| 5 | Observability | Silent failure detection, semantic monitoring |
| 6 | Durable Execution | Temporal, Inngest, Restate, AWS/Azure/GCP offerings |
| 7 | Security | Sandboxing levels, prompt injection defense |
| 8 | Testing | Simulation-based testing, evaluation metrics |