Skip to content

Production-agents Series

Cost Control & Token Budgets - Preventing $10K Surprises

Deep dive into cost control for production agents: token budgets, circuit breakers, model routing, max step limits, and preventing runaway loops that burn through API credits

Prerequisite: This is Part 4 of the Production Agents Deep Dive series. Start with Part 0: Overview for context.

Why This Matters

Your agent enters a loop. Loop calls LLM. LLM responds. Loop continues. You wake up to a $10K bill.

This isn’t hypothetical:

  • 68% of teams hit budget overruns in first agent deployments
  • 50% cite “runaway tool loops and recursive logic” as the cause
  • Agents consume 5-20x more tokens than simple chains

What Goes Wrong Without This:

COST CONTROL FAILURE PATTERNS
Symptom: Monthly API bill 10x higher than expected.
Cause:   Agent retry loop when external API was down.
       No circuit breaker. Kept calling LLM for 6 hours.

Symptom: Single user task consumed $500 in tokens.
Cause: Complex research task with no budget limit.
Agent kept gathering more context, expanding scope.

Symptom: Costs vary wildly between identical requests.
Cause: No model routing. Using GPT-4 for tasks GPT-3.5 handles fine.
No visibility into per-task costs.

Why Agents Are Expensive

Agents aren’t just more LLM calls. They’re structurally more expensive.

FactorSimple ChainAgent
LLM calls per task1-35-50+
Context size growthNoneAccumulates each turn
RetriesRareCommon (external dependencies)
Tool outputs in contextMinimalLarge (file contents, API responses)
LoopsNoneYes (observe-think-act)

Example cost breakdown:

COST COMPARISON: RAG vs AGENT
Simple RAG query:
1 embedding call:     $0.0001
1 completion call:    $0.01
Total:                $0.01

Agent research task:
5 planning calls: $0.05
20 tool calls: $0.20
10 analysis calls: $0.10
3 retry loops: $0.15
Total: $0.50

50x more expensive for a single task. At scale, this compounds.


Pattern 1: Token Budgets

Every task gets a budget. Exceed it, gracefully stop.

class TokenBudget:
    def __init__(self, max_tokens=50000, warn_at=0.8):
        self.max = max_tokens
        self.warn_threshold = warn_at
        self.used = 0
        self.warning_issued = False

    def consume(self, tokens):
        self.used += tokens

        if not self.warning_issued and self.used >= self.max * self.warn_threshold:
            self.warning_issued = True
            logger.warning(f"Token budget at {self.used}/{self.max} ({self.warn_threshold*100}%)")

        if self.used >= self.max:
            raise TokenBudgetExceeded(
                used=self.used,
                max=self.max,
                message="Task exceeded token budget. Gracefully stopping."
            )

    @property
    def remaining(self):
        return max(0, self.max - self.used)

    @property
    def percentage_used(self):
        return self.used / self.max

# Usage in agent
budget = TokenBudget(max_tokens=100000)

for step in agent_loop():
    try:
        response = llm.call(prompt)
        budget.consume(response.usage.total_tokens)
    except TokenBudgetExceeded:
        return agent.graceful_shutdown("Budget exceeded")

Budget Sizing Guidelines

Task TypeSuggested BudgetRationale
Simple Q&A5,000 tokens1-2 turns max
Document analysis50,000 tokensLarge context, few turns
Research task100,000 tokensMany tool calls, iteration
Code generation150,000 tokensMultiple files, testing
Complex workflow500,000 tokensMulti-step, human-in-loop

Start conservative. Increase based on actual usage patterns, not guesses.


Pattern 2: Circuit Breakers for Loops

Agents loop. Loops can run forever. Circuit breakers stop them.

class LoopBreaker:
    def __init__(self, max_iterations=25, max_same_action=3):
        self.max_iterations = max_iterations
        self.max_same_action = max_same_action
        self.iterations = 0
        self.action_history = []

    def check(self, action):
        self.iterations += 1
        self.action_history.append(action)

        # Too many total iterations
        if self.iterations >= self.max_iterations:
            raise LoopLimitExceeded(
                f"Agent exceeded {self.max_iterations} iterations"
            )

        # Stuck in same action
        recent = self.action_history[-self.max_same_action:]
        if len(recent) == self.max_same_action and len(set(recent)) == 1:
            raise StuckInLoop(
                f"Agent repeated '{action}' {self.max_same_action} times"
            )

# Usage
breaker = LoopBreaker(max_iterations=25, max_same_action=3)

while not done:
    action = agent.decide()
    breaker.check(action.type)  # Raises if stuck
    result = agent.execute(action)

Loop Detection Strategies

StrategyDetectsImplementation
Max iterationsRunaway loopsCounter, hard limit
Same action repeatedStuck agentTrack last N actions
No progressSpinning without resultsTrack state changes
Time limitSlow infinite loopsWall clock timeout

Pattern 3: Model Routing

Use expensive models only when needed.

class ModelRouter:
    def __init__(self):
        self.models = {
            "simple": "gpt-4o-mini",      # $0.15/1M input
            "standard": "gpt-4o",         # $5/1M input
            "complex": "claude-opus",     # $15/1M input
        }

    def route(self, task):
        # Classify task complexity
        if task.type in ["clarification", "formatting", "simple_qa"]:
            return self.models["simple"]

        if task.requires_reasoning or task.type in ["analysis", "planning"]:
            return self.models["standard"]

        if task.type in ["code_review", "complex_research", "multi_step"]:
            return self.models["complex"]

        return self.models["standard"]  # Default

# Usage
router = ModelRouter()
model = router.route(current_task)
response = llm.call(model=model, prompt=prompt)

Model Cost Comparison (Dec 2024)

ModelInput (per 1M)Output (per 1M)Use For
GPT-4o-mini$0.15$0.60Formatting, simple tasks
GPT-4o$5$15Standard reasoning
Claude Sonnet$3$15Balanced cost/quality
Claude Opus$15$75Complex tasks, code
GPT-4-turbo$10$30Legacy compatibility

The math: If 60% of your tasks can use mini models, you save ~95% on those tasks.


Pattern 4: Cost Tracking

You can’t control what you don’t measure.

class CostTracker:
    # Pricing per 1K tokens (update as needed)
    PRICING = {
        "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
        "gpt-4o": {"input": 0.005, "output": 0.015},
        "claude-sonnet": {"input": 0.003, "output": 0.015},
        "claude-opus": {"input": 0.015, "output": 0.075},
    }

    def __init__(self, alert_threshold=10.0):
        self.total_cost = 0
        self.cost_by_model = {}
        self.cost_by_task_type = {}
        self.alert_threshold = alert_threshold

    def record(self, model, input_tokens, output_tokens, task_type=None):
        pricing = self.PRICING.get(model, {"input": 0.01, "output": 0.03})

        cost = (
            (input_tokens * pricing["input"] / 1000) +
            (output_tokens * pricing["output"] / 1000)
        )

        self.total_cost += cost
        self.cost_by_model[model] = self.cost_by_model.get(model, 0) + cost

        if task_type:
            self.cost_by_task_type[task_type] = (
                self.cost_by_task_type.get(task_type, 0) + cost
            )

        if self.total_cost >= self.alert_threshold:
            self.trigger_alert()

        return cost

    def trigger_alert(self):
        alert.send(
            channel="slack-finops",
            message=f"Agent cost alert: ${self.total_cost:.2f} exceeded threshold"
        )

    def report(self):
        return {
            "total_cost": self.total_cost,
            "by_model": self.cost_by_model,
            "by_task_type": self.cost_by_task_type,
        }

Cost Attribution Dimensions

DimensionHow to TrackWhy It Matters
Per requestTag spans with request_idIdentify expensive requests
Per userTag with user_idFair billing, abuse detection
Per task typeClassify tasksOptimize high-cost task types
Per modelTrack model in each callValidate routing effectiveness
Per featureFeature flags on tasksROI by feature

Pattern 5: Max Step Limits

Hard limits prevent catastrophic runaway.

class AgentExecutor:
    def __init__(self, max_steps=50, max_tool_calls=100):
        self.max_steps = max_steps
        self.max_tool_calls = max_tool_calls

    def run(self, task):
        steps = 0
        tool_calls = 0

        while not task.is_complete():
            steps += 1

            if steps > self.max_steps:
                return self.force_completion(
                    task,
                    reason=f"Exceeded max steps ({self.max_steps})"
                )

            action = self.agent.decide(task)

            if action.is_tool_call:
                tool_calls += 1
                if tool_calls > self.max_tool_calls:
                    return self.force_completion(
                        task,
                        reason=f"Exceeded max tool calls ({self.max_tool_calls})"
                    )

            task = self.agent.execute(action)

        return task.result

    def force_completion(self, task, reason):
        logger.warning(f"Force completing task: {reason}")
        return self.agent.summarize_progress(task, interrupted=True)

Alerting Strategy

# Example alerting rules

alerts:
  - name: high_cost_request
    condition: request_cost > $5
    severity: warning
    action: log_and_review

  - name: budget_exceeded
    condition: daily_cost > $100
    severity: critical
    action: page_oncall

  - name: runaway_loop
    condition: iterations > 30
    severity: critical
    action: kill_and_alert

  - name: cost_spike
    condition: hourly_cost > 3x_average
    severity: warning
    action: investigate

  - name: model_misrouting
    condition: expensive_model_on_simple_task
    severity: info
    action: log_for_review

Common Gotchas

GotchaSymptomFix
No budget on devWorks in dev, explodes in prodBudget in all environments
Budget too tightTasks fail legitimatelyMonitor actual usage, adjust
No graceful shutdownTask fails with no resultsImplement partial result return
Static routingOver-using expensive modelsDynamic complexity detection
No per-user limitsOne user burns budget for allUser-level quotas
Alerting too lateSee bill at end of monthReal-time cost monitoring

The Cost Control Checklist

Before deploying an agent:

COST CONTROL DEPLOYMENT CHECKLIST
TOKEN BUDGETS
[ ] Per-task budget defined
[ ] Warning at 80% threshold
[ ] Graceful shutdown when exceeded
[ ] Budget sizes based on actual usage data

LOOP PROTECTION
[ ] Max iterations limit
[ ] Same-action detection
[ ] Time limit as backstop
[ ] Progress tracking (no-op detection)

MODEL ROUTING
[ ] Task complexity classification
[ ] Model selection based on task
[ ] Default model is cost-efficient
[ ] Override for critical tasks

COST TRACKING
[ ] Per-request cost calculation
[ ] Per-user attribution
[ ] Per-task-type breakdown
[ ] Real-time dashboards

ALERTING
[ ] Per-request cost alerts
[ ] Daily budget alerts
[ ] Anomaly detection
[ ] Oncall escalation configured

Key Takeaways

  1. Agents are 5-20x more expensive than chains. Budget accordingly.

  2. Token budgets are mandatory. No task runs without a limit.

  3. Circuit breakers prevent runaway loops. Max iterations + stuck detection.

  4. Model routing saves 90%+ on simple tasks. Use expensive models selectively.

  5. You can’t control what you don’t measure. Track cost by request, user, task type.


Next Steps

Costs are controlled. But how do you know if your agent is doing the right thing?

Part 5: Observability & Silent Failures

Or jump to another topic: