Token Bucket | Concepts

TL;DR

Token bucket is a rate limiting algorithm where tokens accumulate in a bucket at a fixed rate up to a maximum capacity. Each request consumes a token. If tokens are available, the request proceeds; if not, it’s rejected. This allows bursts up to bucket capacity while enforcing an average rate equal to the refill rate.

Visual Overview

Token Bucket Algorithm

TOKEN BUCKET MECHANICS
┌────────────────────────────────────────────────────┐
│                                                    │
│  Bucket State: [●●●●●] (5/5 tokens)                │
│                                                    │
│  Parameters:                                       │
│  - Capacity: 5 tokens (max burst)                  │
│  - Refill: 1 token/second (average rate)           │
│                                                    │
│  Request 1: [●●●●○] → ALLOWED (4 remaining)        │
│  Request 2: [●●●○○] → ALLOWED (3 remaining)        │
│  Request 3: [●●○○○] → ALLOWED (2 remaining)        │
│  Request 4: [●○○○○] → ALLOWED (1 remaining)        │
│  Request 5: [○○○○○] → ALLOWED (0 remaining)        │
│  Request 6: [○○○○○] → REJECTED (no tokens!)        │
│                                                    │
│  After 3 seconds: [●●●○○] (3 tokens refilled)      │
│                                                    │
└────────────────────────────────────────────────────┘

TWO KEY PARAMETERS
┌────────────────────────────────────────────────────┐
│                                                    │
│  CAPACITY (bucket size)                            │
│  └─ Maximum burst size                             │
│  └─ How many requests can fire instantly           │
│                                                    │
│  REFILL RATE (tokens per second)                   │
│  └─ Steady-state throughput                        │
│  └─ Average rate over time                         │
│                                                    │
│  Example: capacity=100, refill=10/sec              │
│  - Burst: 100 requests instantly                   │
│  - Sustained: 10 requests/second average           │
│                                                    │
└────────────────────────────────────────────────────┘

Core Explanation

What is Token Bucket?

Real-World Analogy: Think of a parking garage with a limited number of spaces. Cars (requests) arrive and need a parking spot (token). The garage has 50 spots (capacity). Cars leave at a steady rate of 1 per minute (refill rate).

If spots are available, cars park immediately
If the garage is full, cars must wait or go elsewhere
After rush hour, spots gradually become available again
A burst of arrivals is fine—up to 50 cars—but then they must slow down

This is why token bucket is perfect for APIs: it allows legitimate burst traffic (page loads, app startup) while preventing sustained overload.

How It Works

The algorithm maintains two pieces of state per rate-limited key:

tokens: Current number of available tokens (float)
last_refill: Timestamp of last refill calculation

Token Bucket State Machine

┌────────────────────────────────────────────────────┐
│                                                    │
│  On Request Arrival:                               │
│  ┌─────────────────────────────────────────────┐   │
│  │ 1. Calculate elapsed time since last_refill │   │
│  │ 2. Add tokens: elapsed × refill_rate        │   │
│  │ 3. Cap at capacity (no overflow)            │   │
│  │ 4. Update last_refill = now                 │   │
│  └─────────────────────────────────────────────┘   │
│                    ↓                               │
│  ┌─────────────────────────────────────────────┐   │
│  │ Check: tokens >= 1?                         │   │
│  │     YES → tokens -= 1, return ALLOW         │   │
│  │     NO  → return REJECT (429 Too Many)      │   │
│  └─────────────────────────────────────────────┘   │
│                                                    │
│  State per key: ~16 bytes (float + timestamp)      │
│  Time complexity: O(1) per request                 │
│                                                    │
└────────────────────────────────────────────────────┘

The Algorithm

on request:
  now = current_time()
  elapsed = now - last_refill
  tokens_to_add = elapsed * refill_rate
  tokens = min(capacity, tokens + tokens_to_add)
  last_refill = now

  if tokens >= 1:
    tokens -= 1
    return ALLOW
  else:
    return REJECT

Key Properties

Property	Value	Notes
Time Complexity	O(1) per request	Simple arithmetic
Space Complexity	O(1) per key	~16 bytes (tokens + timestamp)
Burst Tolerance	Up to capacity	Allows full burst immediately
Average Rate	refill_rate	Long-term steady state

Token Bucket vs Leaky Bucket

TOKEN BUCKET
┌────────────────────────────────────────────────────┐
│  Requests → Check Tokens → Allow/Reject            │
│                                                    │
│  [●●●●●] full bucket                               │
│  ↓↓↓↓↓ (5 rapid requests)                          │
│  [○○○○○] empty → REJECT until refill               │
│                                                    │
│  ✓ Allows bursts up to capacity                    │
│  ✓ Good for bursty but legitimate traffic          │
│  ✓ Simple to implement                             │
└────────────────────────────────────────────────────┘

LEAKY BUCKET
┌────────────────────────────────────────────────────┐
│  Requests → Queue → Drain at fixed rate            │
│                                                    │
│  ↓↓↓↓↓ (5 rapid requests)                          │
│  [●●●●●] queue fills                               │
│     ↓ (drains one at a time at fixed rate)         │
│    ●→out  ●→out  ●→out...                          │
│                                                    │
│  ✓ Smooths output to constant rate                 │
│  ✓ Good for shaping traffic to downstream          │
│  ✕ Adds latency (queueing)                         │
└────────────────────────────────────────────────────┘

KEY DIFFERENCE
┌────────────────────────────────────────────────────┐
│  Token Bucket: "Can you handle this burst?"        │
│    → Yes/No immediately (no queueing)              │
│                                                    │
│  Leaky Bucket: "Process requests at steady rate"   │
│    → Queues bursts, drains smoothly                │
│                                                    │
│  Use Token Bucket for: API rate limiting           │
│  Use Leaky Bucket for: Network traffic shaping     │
└────────────────────────────────────────────────────┘

Real Systems Using Token Bucket

System	Implementation	Configuration	Use Case
AWS API Gateway	Native token bucket	Configurable per stage	API management
NGINX	ngx_http_limit_req_module	burst parameter controls capacity	Web server rate limiting
Envoy Proxy	Local/global rate limiting	Configurable via xDS	Service mesh
Google Cloud Endpoints	Token bucket style	Quota configuration	API quotas
Kong Gateway	Rate limiting plugin	Configurable policy	API gateway

Note: Implementations may vary slightly. Verify current behavior in official documentation.

Configuration Patterns

Common Token Bucket Configurations

API ENDPOINT PATTERNS (Illustrative)
┌────────────────────────────────────────────────────┐
│  Endpoint Type     │ Capacity │ Refill    │ Notes  │
│────────────────────┼──────────┼───────────┼─────── │
│  Standard API      │ 100      │ 10/sec    │ Burst  │
│  Expensive Query   │ 10       │ 1/sec     │ Heavy  │
│  Login/Auth        │ 5        │ 1/min     │ Slow   │
│  Webhook Receiver  │ 500      │ 100/sec   │ Spiky  │
└────────────────────────────────────────────────────┘

TIERED RATE LIMITS
┌────────────────────────────────────────────────────┐
│  Tier       │ Bucket Size │ Refill Rate │ Monthly  │
│─────────────┼─────────────┼─────────────┼───────── │
│  Free       │ Small       │ Low         │ Trial    │
│  Developer  │ Medium      │ Medium      │ Basic    │
│  Business   │ Large       │ High        │ Standard │
│  Enterprise │ Custom      │ Custom      │ Custom   │
└────────────────────────────────────────────────────┘
(Exact limits vary by provider—consult documentation)

When to Use Token Bucket

✓ Perfect Use Cases

Use Case	Scenario	Requirement	Configuration	Trade-off
Public APIs with bursty clients	REST API called by mobile apps	Allow burst on app open, enforce average rate	capacity=50, refill=10/sec	Allows short bursts, blocks sustained overload
CDN/edge rate limiting	Edge nodes rate-limit per IP	Fast O(1) check, minimal state	capacity=100, refill=20/sec per IP	Larger capacity = more tolerance for legitimate spikes
Tiered API access	Free vs paid API tiers	Different burst/rate per subscription	Free(10,1), Pro(100,10), Enterprise(1000,100)	Need to track tier per API key
Internal service protection	Database connection rate limiting	Prevent connection storms	capacity=20, refill=5/sec per service	May queue legitimate requests during spikes

✕ When NOT to Use

When Token Bucket May Not Fit

STRICT BILLING/QUOTA ENFORCEMENT
Problem: Token bucket allows bursts that may exceed quota
Example: "1000 requests/hour" → user burns 100 instantly
Alternative: Sliding window log for exact counting
         (counter variant is also approximate)
When OK: If slight overage is acceptable

SMOOTH OUTPUT REQUIRED
Problem: Token bucket doesn't smooth output rate
Example: Database can handle 10 req/sec, not 100 in 1 second
Alternative: Leaky bucket (queues and drains steadily)
When OK: If downstream handles bursts fine

GLOBAL DISTRIBUTED RATE LIMITING
Problem: Local buckets can allow N × limit globally
Example: 10 regions × 100 burst = 1000 concurrent burst
Alternative: Centralized limiter or eventual sync
When OK: If approximate limiting is acceptable

Interview Application

Common Interview Question

Q: “How would you implement rate limiting for a public API? Walk me through token bucket and when you’d choose it over alternatives.”

Strong Answer:

“I’d implement rate limiting using a token bucket algorithm for most API use cases. Here’s my reasoning:

Why Token Bucket:

Burst tolerance: Real clients are bursty—mobile apps make 10 requests on launch, then go quiet. Token bucket handles this gracefully.

O(1) per request: Just arithmetic—no scanning, no sorting. Critical for high-throughput APIs.

Simple state: Two values per key (tokens, timestamp). Easy to store in Redis.

The Algorithm:

Each request checks: do I have tokens? If yes, consume one and proceed. If no, reject with 429.

Tokens refill at a steady rate (e.g., 10/second) up to capacity (e.g., 100).

Capacity controls burst size; refill rate controls sustained throughput.

Distributed Implementation:

Store {tokens, last_refill} in Redis per API key

Use Lua script for atomic check-and-decrement

Handle Redis failures by failing open (allow request)—better to risk some overload than block everyone

When I’d choose alternatives:

Sliding window: When I need exact counting for billing (no burst overage)

Leaky bucket: When I need to smooth output to a downstream service

Fixed window: Never—boundary burst problem makes it unreliable

Headers I’d return:

X-RateLimit-Limit: Bucket capacity

X-RateLimit-Remaining: Current tokens

X-RateLimit-Reset: When bucket refills to capacity

Retry-After: Seconds until tokens available (on 429)”

Follow-up: What’s the boundary burst problem and how does token bucket handle it?

“The boundary burst problem affects fixed window rate limiting. If the window resets at the top of each minute, a client can make 100 requests at 0:59 and 100 more at 1:00—effectively 200 requests in 2 seconds while technically staying within ‘100/minute’.

Token bucket avoids this because it doesn’t have discrete windows. Tokens accumulate continuously. If you burn all tokens at 0:59, you don’t magically get more at 1:00—you wait for refill. The maximum burst is always capped at capacity, regardless of timing.”

Follow-up: How do you handle token bucket in a multi-region deployment?

“There are three approaches with different trade-offs:

Sticky routing: Route each user to one region. Simple but adds latency for users far from their assigned region.

Synchronized global state: Single Redis cluster, all regions check it. Accurate but adds cross-region latency to every request.

Local with eventual sync: Each region has local bucket, periodically sync totals. Fast but user might get N × limit globally.

For most APIs, I’d choose option 3—slightly exceeding limits is acceptable compared to adding 100ms+ latency to every request. For billing-critical endpoints, option 2 or sticky routing.”

Code Example

Token Bucket Rate Limiter (Python)

import time
from dataclasses import dataclass
from typing import Tuple, Dict

@dataclass
class BucketState:
    """State for a single rate-limited key."""
    tokens: float
    last_refill: float

class TokenBucketRateLimiter:
    """
    In-memory token bucket rate limiter.

    For production, replace self.buckets with Redis storage.
    """

    def __init__(self, capacity: int, refill_rate: float):
        """
        Args:
            capacity: Maximum tokens (burst size)
            refill_rate: Tokens added per second
    # ... omitted: keep concept snippets short
    # Wait 3 seconds for refill
    print("\nWaiting 3 seconds...")
    time.sleep(3)

    # Try again
    print("\nAfter waiting:")
    for i in range(4):
        allowed, info = limiter.is_allowed(user)
        status = "✓ ALLOWED" if allowed else "✗ REJECTED"
        print(f"  Request {i+1}: {status} (remaining: {info['remaining']})")

Token Bucket with Redis (Production)

import time
import redis

class RedisTokenBucket:
    """
    Distributed token bucket using Redis.

    Uses Lua script for atomic check-and-decrement.
    """

    LUA_SCRIPT = """
    local key = KEYS[1]
    local capacity = tonumber(ARGV[1])
    local refill_rate = tonumber(ARGV[2])
    local now = tonumber(ARGV[3])

    -- Get current state (or initialize)
    local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
    local tokens = tonumber(bucket[1]) or capacity
    local last_refill = tonumber(bucket[2]) or now

    -- Calculate refill
    # ... omitted: keep concept snippets short
        result = self.script(
            keys=[f"ratelimit:{key}"],
            args=[self.capacity, self.refill_rate, now]
        )

        return result[0] == 1, {
            "remaining": result[1],
            "limit": self.capacity,
            "reset_at": result[2]
        }

Express.js Middleware

const Redis = require('ioredis');

const redis = new Redis(process.env.REDIS_URL);

// Token bucket middleware factory
function tokenBucket({ capacity, refillRate, keyGenerator }) {
  const luaScript = `
    local key = KEYS[1]
    local capacity = tonumber(ARGV[1])
    local refill_rate = tonumber(ARGV[2])
    local now = tonumber(ARGV[3])

    local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
    local tokens = tonumber(bucket[1]) or capacity
    local last_refill = tonumber(bucket[2]) or now

    local elapsed = now - last_refill
    tokens = math.min(capacity, tokens + elapsed * refill_rate)

    local allowed = 0
    if tokens >= 1 then
      tokens = tokens - 1
  // ... omitted: keep concept snippets short
    }
  };
}

// Usage
app.use('/api/', tokenBucket({
  capacity: 100,      // Burst of 100
  refillRate: 10,     // 10 requests/second sustained
  keyGenerator: (req) => req.user?.id || req.ip,
}));

See It In Action:

Rate Limiting Explainer - Compares token bucket with sliding window

Related Concepts:

Rate Limiting - Parent concept
Sliding Window - Alternative: smoother enforcement, no burst
Leaky Bucket - Smooths output rate with queueing

Quick Self-Check

Can explain token bucket in 60 seconds?
Understand the difference between capacity and refill rate?
Know why token bucket allows bursts while leaky bucket smooths output?
Can implement atomic token bucket in Redis with Lua?
Understand the trade-off with fixed window (boundary burst problem)?
Know when to choose sliding window over token bucket?