TL;DR
Token bucket is a rate limiting algorithm where tokens accumulate in a bucket at a fixed rate up to a maximum capacity. Each request consumes a token. If tokens are available, the request proceeds; if not, it’s rejected. This allows bursts up to bucket capacity while enforcing an average rate equal to the refill rate.
Visual Overview
TOKEN BUCKET MECHANICS ┌────────────────────────────────────────────────────┐ │ │ │ Bucket State: [●●●●●] (5/5 tokens) │ │ │ │ Parameters: │ │ - Capacity: 5 tokens (max burst) │ │ - Refill: 1 token/second (average rate) │ │ │ │ Request 1: [●●●●○] → ALLOWED (4 remaining) │ │ Request 2: [●●●○○] → ALLOWED (3 remaining) │ │ Request 3: [●●○○○] → ALLOWED (2 remaining) │ │ Request 4: [●○○○○] → ALLOWED (1 remaining) │ │ Request 5: [○○○○○] → ALLOWED (0 remaining) │ │ Request 6: [○○○○○] → REJECTED (no tokens!) │ │ │ │ After 3 seconds: [●●●○○] (3 tokens refilled) │ │ │ └────────────────────────────────────────────────────┘ TWO KEY PARAMETERS ┌────────────────────────────────────────────────────┐ │ │ │ CAPACITY (bucket size) │ │ └─ Maximum burst size │ │ └─ How many requests can fire instantly │ │ │ │ REFILL RATE (tokens per second) │ │ └─ Steady-state throughput │ │ └─ Average rate over time │ │ │ │ Example: capacity=100, refill=10/sec │ │ - Burst: 100 requests instantly │ │ - Sustained: 10 requests/second average │ │ │ └────────────────────────────────────────────────────┘
Core Explanation
What is Token Bucket?
Real-World Analogy: Think of a parking garage with a limited number of spaces. Cars (requests) arrive and need a parking spot (token). The garage has 50 spots (capacity). Cars leave at a steady rate of 1 per minute (refill rate).
- If spots are available, cars park immediately
- If the garage is full, cars must wait or go elsewhere
- After rush hour, spots gradually become available again
- A burst of arrivals is fine—up to 50 cars—but then they must slow down
This is why token bucket is perfect for APIs: it allows legitimate burst traffic (page loads, app startup) while preventing sustained overload.
How It Works
The algorithm maintains two pieces of state per rate-limited key:
- tokens: Current number of available tokens (float)
- last_refill: Timestamp of last refill calculation
┌────────────────────────────────────────────────────┐ │ │ │ On Request Arrival: │ │ ┌─────────────────────────────────────────────┐ │ │ │ 1. Calculate elapsed time since last_refill │ │ │ │ 2. Add tokens: elapsed × refill_rate │ │ │ │ 3. Cap at capacity (no overflow) │ │ │ │ 4. Update last_refill = now │ │ │ └─────────────────────────────────────────────┘ │ │ ↓ │ │ ┌─────────────────────────────────────────────┐ │ │ │ Check: tokens >= 1? │ │ │ │ YES → tokens -= 1, return ALLOW │ │ │ │ NO → return REJECT (429 Too Many) │ │ │ └─────────────────────────────────────────────┘ │ │ │ │ State per key: ~16 bytes (float + timestamp) │ │ Time complexity: O(1) per request │ │ │ └────────────────────────────────────────────────────┘
The Algorithm
on request:
now = current_time()
elapsed = now - last_refill
tokens_to_add = elapsed * refill_rate
tokens = min(capacity, tokens + tokens_to_add)
last_refill = now
if tokens >= 1:
tokens -= 1
return ALLOW
else:
return REJECT
Key Properties
| Property | Value | Notes |
|---|---|---|
| Time Complexity | O(1) per request | Simple arithmetic |
| Space Complexity | O(1) per key | ~16 bytes (tokens + timestamp) |
| Burst Tolerance | Up to capacity | Allows full burst immediately |
| Average Rate | refill_rate | Long-term steady state |
Token Bucket vs Leaky Bucket
TOKEN BUCKET ┌────────────────────────────────────────────────────┐ │ Requests → Check Tokens → Allow/Reject │ │ │ │ [●●●●●] full bucket │ │ ↓↓↓↓↓ (5 rapid requests) │ │ [○○○○○] empty → REJECT until refill │ │ │ │ ✓ Allows bursts up to capacity │ │ ✓ Good for bursty but legitimate traffic │ │ ✓ Simple to implement │ └────────────────────────────────────────────────────┘ LEAKY BUCKET ┌────────────────────────────────────────────────────┐ │ Requests → Queue → Drain at fixed rate │ │ │ │ ↓↓↓↓↓ (5 rapid requests) │ │ [●●●●●] queue fills │ │ ↓ (drains one at a time at fixed rate) │ │ ●→out ●→out ●→out... │ │ │ │ ✓ Smooths output to constant rate │ │ ✓ Good for shaping traffic to downstream │ │ ✕ Adds latency (queueing) │ └────────────────────────────────────────────────────┘ KEY DIFFERENCE ┌────────────────────────────────────────────────────┐ │ Token Bucket: "Can you handle this burst?" │ │ → Yes/No immediately (no queueing) │ │ │ │ Leaky Bucket: "Process requests at steady rate" │ │ → Queues bursts, drains smoothly │ │ │ │ Use Token Bucket for: API rate limiting │ │ Use Leaky Bucket for: Network traffic shaping │ └────────────────────────────────────────────────────┘
Real Systems Using Token Bucket
| System | Implementation | Configuration | Use Case |
|---|---|---|---|
| AWS API Gateway | Native token bucket | Configurable per stage | API management |
| NGINX | ngx_http_limit_req_module | burst parameter controls capacity | Web server rate limiting |
| Envoy Proxy | Local/global rate limiting | Configurable via xDS | Service mesh |
| Google Cloud Endpoints | Token bucket style | Quota configuration | API quotas |
| Kong Gateway | Rate limiting plugin | Configurable policy | API gateway |
Note: Implementations may vary slightly. Verify current behavior in official documentation.
Configuration Patterns
API ENDPOINT PATTERNS (Illustrative) ┌────────────────────────────────────────────────────┐ │ Endpoint Type │ Capacity │ Refill │ Notes │ │────────────────────┼──────────┼───────────┼─────── │ │ Standard API │ 100 │ 10/sec │ Burst │ │ Expensive Query │ 10 │ 1/sec │ Heavy │ │ Login/Auth │ 5 │ 1/min │ Slow │ │ Webhook Receiver │ 500 │ 100/sec │ Spiky │ └────────────────────────────────────────────────────┘ TIERED RATE LIMITS ┌────────────────────────────────────────────────────┐ │ Tier │ Bucket Size │ Refill Rate │ Monthly │ │─────────────┼─────────────┼─────────────┼───────── │ │ Free │ Small │ Low │ Trial │ │ Developer │ Medium │ Medium │ Basic │ │ Business │ Large │ High │ Standard │ │ Enterprise │ Custom │ Custom │ Custom │ └────────────────────────────────────────────────────┘ (Exact limits vary by provider—consult documentation)
When to Use Token Bucket
✓ Perfect Use Cases
| Use Case | Scenario | Requirement | Configuration | Trade-off |
|---|---|---|---|---|
| Public APIs with bursty clients | REST API called by mobile apps | Allow burst on app open, enforce average rate | capacity=50, refill=10/sec | Allows short bursts, blocks sustained overload |
| CDN/edge rate limiting | Edge nodes rate-limit per IP | Fast O(1) check, minimal state | capacity=100, refill=20/sec per IP | Larger capacity = more tolerance for legitimate spikes |
| Tiered API access | Free vs paid API tiers | Different burst/rate per subscription | Free(10,1), Pro(100,10), Enterprise(1000,100) | Need to track tier per API key |
| Internal service protection | Database connection rate limiting | Prevent connection storms | capacity=20, refill=5/sec per service | May queue legitimate requests during spikes |
✕ When NOT to Use
STRICT BILLING/QUOTA ENFORCEMENT Problem: Token bucket allows bursts that may exceed quota Example: "1000 requests/hour" → user burns 100 instantly Alternative: Sliding window log for exact counting (counter variant is also approximate) When OK: If slight overage is acceptable SMOOTH OUTPUT REQUIRED Problem: Token bucket doesn't smooth output rate Example: Database can handle 10 req/sec, not 100 in 1 second Alternative: Leaky bucket (queues and drains steadily) When OK: If downstream handles bursts fine GLOBAL DISTRIBUTED RATE LIMITING Problem: Local buckets can allow N × limit globally Example: 10 regions × 100 burst = 1000 concurrent burst Alternative: Centralized limiter or eventual sync When OK: If approximate limiting is acceptable
Interview Application
Common Interview Question
Q: “How would you implement rate limiting for a public API? Walk me through token bucket and when you’d choose it over alternatives.”
Strong Answer:
“I’d implement rate limiting using a token bucket algorithm for most API use cases. Here’s my reasoning:
Why Token Bucket:
- Burst tolerance: Real clients are bursty—mobile apps make 10 requests on launch, then go quiet. Token bucket handles this gracefully.
- O(1) per request: Just arithmetic—no scanning, no sorting. Critical for high-throughput APIs.
- Simple state: Two values per key (tokens, timestamp). Easy to store in Redis.
The Algorithm:
- Each request checks: do I have tokens? If yes, consume one and proceed. If no, reject with 429.
- Tokens refill at a steady rate (e.g., 10/second) up to capacity (e.g., 100).
- Capacity controls burst size; refill rate controls sustained throughput.
Distributed Implementation:
- Store
{tokens, last_refill}in Redis per API key- Use Lua script for atomic check-and-decrement
- Handle Redis failures by failing open (allow request)—better to risk some overload than block everyone
When I’d choose alternatives:
- Sliding window: When I need exact counting for billing (no burst overage)
- Leaky bucket: When I need to smooth output to a downstream service
- Fixed window: Never—boundary burst problem makes it unreliable
Headers I’d return:
X-RateLimit-Limit: Bucket capacityX-RateLimit-Remaining: Current tokensX-RateLimit-Reset: When bucket refills to capacityRetry-After: Seconds until tokens available (on 429)”
Follow-up: What’s the boundary burst problem and how does token bucket handle it?
“The boundary burst problem affects fixed window rate limiting. If the window resets at the top of each minute, a client can make 100 requests at 0:59 and 100 more at 1:00—effectively 200 requests in 2 seconds while technically staying within ‘100/minute’.
Token bucket avoids this because it doesn’t have discrete windows. Tokens accumulate continuously. If you burn all tokens at 0:59, you don’t magically get more at 1:00—you wait for refill. The maximum burst is always capped at capacity, regardless of timing.”
Follow-up: How do you handle token bucket in a multi-region deployment?
“There are three approaches with different trade-offs:
Sticky routing: Route each user to one region. Simple but adds latency for users far from their assigned region.
Synchronized global state: Single Redis cluster, all regions check it. Accurate but adds cross-region latency to every request.
Local with eventual sync: Each region has local bucket, periodically sync totals. Fast but user might get N × limit globally.
For most APIs, I’d choose option 3—slightly exceeding limits is acceptable compared to adding 100ms+ latency to every request. For billing-critical endpoints, option 2 or sticky routing.”
Code Example
Token Bucket Rate Limiter (Python)
import time
from dataclasses import dataclass
from typing import Tuple, Dict
@dataclass
class BucketState:
"""State for a single rate-limited key."""
tokens: float
last_refill: float
class TokenBucketRateLimiter:
"""
In-memory token bucket rate limiter.
For production, replace self.buckets with Redis storage.
"""
def __init__(self, capacity: int, refill_rate: float):
"""
Args:
capacity: Maximum tokens (burst size)
refill_rate: Tokens added per second
# ... omitted: keep concept snippets short
# Wait 3 seconds for refill
print("\nWaiting 3 seconds...")
time.sleep(3)
# Try again
print("\nAfter waiting:")
for i in range(4):
allowed, info = limiter.is_allowed(user)
status = "✓ ALLOWED" if allowed else "✗ REJECTED"
print(f" Request {i+1}: {status} (remaining: {info['remaining']})")
Token Bucket with Redis (Production)
import time
import redis
class RedisTokenBucket:
"""
Distributed token bucket using Redis.
Uses Lua script for atomic check-and-decrement.
"""
LUA_SCRIPT = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
-- Get current state (or initialize)
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Calculate refill
# ... omitted: keep concept snippets short
result = self.script(
keys=[f"ratelimit:{key}"],
args=[self.capacity, self.refill_rate, now]
)
return result[0] == 1, {
"remaining": result[1],
"limit": self.capacity,
"reset_at": result[2]
}
Express.js Middleware
const Redis = require('ioredis');
const redis = new Redis(process.env.REDIS_URL);
// Token bucket middleware factory
function tokenBucket({ capacity, refillRate, keyGenerator }) {
const luaScript = `
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)
local allowed = 0
if tokens >= 1 then
tokens = tokens - 1
// ... omitted: keep concept snippets short
}
};
}
// Usage
app.use('/api/', tokenBucket({
capacity: 100, // Burst of 100
refillRate: 10, // 10 requests/second sustained
keyGenerator: (req) => req.user?.id || req.ip,
}));
Related Content
See It In Action:
- Rate Limiting Explainer - Compares token bucket with sliding window
Related Concepts:
- Rate Limiting - Parent concept
- Sliding Window - Alternative: smoother enforcement, no burst
- Leaky Bucket - Smooths output rate with queueing
Quick Self-Check
- Can explain token bucket in 60 seconds?
- Understand the difference between capacity and refill rate?
- Know why token bucket allows bursts while leaky bucket smooths output?
- Can implement atomic token bucket in Redis with Lua?
- Understand the trade-off with fixed window (boundary burst problem)?
- Know when to choose sliding window over token bucket?
Production signal