Controlling the rate of requests to a service to prevent overload, ensure fair usage, and protect against abuse
TL;DR
Rate limiting controls how many requests a client can make to a service within a time window. It protects backend resources from overload, ensures fair usage across clients, and defends against abuse like brute-force attacks. The two most common algorithms are token bucket (allows bursts) and sliding window (smooth enforcement).
Visual Overview
Core Explanation
What is Rate Limiting?
Real-World Analogy: Think of rate limiting like a nightclub bouncer. The club has a capacity of 100 people. The bouncer lets people in one at a time, but if the club is full, new arrivals must wait outside. Some VIPs (premium API users) might get a higher limit or skip the line.
Rate limiting enforces boundaries on how many requests a client can make:
- Per user: Each authenticated user gets N requests/minute
- Per IP: Unauthenticated requests limited by IP address
- Per API key: Different tiers get different limits
- Global: Protect the entire service from overload
How It Works
Every rate limiter needs to answer two questions:
- Identification: Who is making this request? (user ID, IP, API key)
- Counting: How many requests have they made recently?
The implementation varies by algorithm, but the flow is consistent:
Request arrives → Identify client → Check counter → Allow or Reject
Two Main Algorithms
Token Bucket:
- Tokens accumulate at a steady rate (refill)
- Each request consumes one token
- Burst allowed up to bucket capacity
- Best for: APIs where legitimate traffic is bursty
Sliding Window Counter:
- Counts requests in a sliding time window
- Weights previous window to prevent boundary bursts
- Smooth enforcement, no burst allowance
- Best for: Strict rate enforcement, billing limits
Real Systems Using Rate Limiting
| System | Algorithm Style | Notes | Use Case |
|---|---|---|---|
| GitHub API | Token bucket style | Tiered by authentication; check current docs for limits | Developer API access |
| Stripe | Sliding window | Different limits for live vs test mode | Payment processing |
| Twitter/X API | Tiered windows | Varies significantly by endpoint and tier | Social media API |
| AWS API Gateway | Token bucket | Fully configurable per stage | API management |
| Cloudflare | Leaky bucket | Rule-based configuration | Edge rate limiting |
Note: Specific limits change frequently. Always verify current limits in official documentation.
Case Study: API Gateway Rate Limiting
When to Use Rate Limiting
✓ Perfect Use Cases
✕ When NOT to Use (or Use Carefully)
Interview Application
Common Interview Question
Q: “You’re designing an API for a public service. How would you implement rate limiting? What algorithm would you choose?”
Strong Answer:
“I’d implement rate limiting at the API gateway level with a token bucket algorithm. Here’s my approach:
Why Token Bucket:
- Allows legitimate bursts: Users often make multiple quick requests (page load, app startup)
- Simple state: Just two values per client (tokens, last_refill_time)
- Configurable: Capacity controls burst size, refill rate controls steady-state
Implementation:
- Store state in Redis for distributed rate limiting
- Key:
ratelimit:{user_id}with token count and timestamp- Use Redis MULTI/EXEC for atomic check-and-decrement
Configuration:
- Capacity: 100 tokens (max burst)
- Refill: 10 tokens/second (100 req/sec steady-state)
- Different limits per API tier
Response Headers:
X-RateLimit-Limit: Maximum requests allowedX-RateLimit-Remaining: Requests left in windowX-RateLimit-Reset: Unix timestamp when limit resetsRetry-After: Seconds to wait (on 429)Edge Cases:
- Clock skew: Use Redis server time, not client time
- Distributed: Single Redis cluster for consistency
- Failover: Fail open (allow) if Redis unavailable—better to risk abuse than block all users”
Follow-up: How would you handle distributed rate limiting across multiple regions?
“For multi-region, I’d use local rate limiting with global synchronization:
- Each region has local Redis for low-latency checks
- Async sync between regions (eventual consistency)
- Accept that users might get slightly more than limit globally
- Alternative: Single global Redis with latency cost
The trade-off is accuracy vs latency. For most APIs, slightly exceeding limits across regions is acceptable.”
Code Example
Token Bucket Rate Limiter (Python + Redis)
import time
import redis
class TokenBucketRateLimiter:
"""
Distributed token bucket rate limiter using Redis.
Allows bursts up to capacity while enforcing average rate.
"""
def __init__(self, redis_client: redis.Redis, capacity: int, refill_rate: float):
"""
Args:
redis_client: Redis connection
capacity: Maximum tokens (burst size)
refill_rate: Tokens added per second
"""
self.redis = redis_client
self.capacity = capacity
self.refill_rate = refill_rate
def is_allowed(self, key: str) -> tuple[bool, dict]:
"""
Check if request is allowed and consume a token if so.
Returns:
(allowed: bool, info: dict with remaining, reset_at)
"""
now = time.time()
bucket_key = f"ratelimit:{key}"
# Lua script for atomic check-and-update
# This runs entirely on Redis server (no race conditions)
lua_script = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
-- Get current state
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Calculate tokens to add since last refill
local elapsed = now - last_refill
local tokens_to_add = elapsed * refill_rate
tokens = math.min(capacity, tokens + tokens_to_add)
-- Check if we can consume a token
local allowed = 0
if tokens >= 1 then
tokens = tokens - 1
allowed = 1
end
-- Update state
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600) -- Clean up after 1 hour idle
return {allowed, tokens, now + (capacity - tokens) / refill_rate}
"""
result = self.redis.eval(
lua_script,
1, # number of keys
bucket_key,
self.capacity,
self.refill_rate,
now
)
allowed = result[0] == 1
remaining = int(result[1])
reset_at = int(result[2])
return allowed, {
"remaining": remaining,
"limit": self.capacity,
"reset_at": reset_at
}
# Usage example
if __name__ == "__main__":
redis_client = redis.Redis(host='localhost', port=6379, db=0)
# 10 requests/second with burst of 20
limiter = TokenBucketRateLimiter(
redis_client=redis_client,
capacity=20, # Allow burst of 20 requests
refill_rate=10 # Refill 10 tokens/second
)
user_id = "user_123"
# Simulate requests
for i in range(25):
allowed, info = limiter.is_allowed(user_id)
if allowed:
print(f"Request {i+1}: ALLOWED (remaining: {info['remaining']})")
else:
print(f"Request {i+1}: REJECTED (retry at: {info['reset_at']})")
# In real code: return 429 with Retry-After header
Express.js Middleware Example
const rateLimit = require('express-rate-limit');
const RedisStore = require('rate-limit-redis');
const Redis = require('ioredis');
const redis = new Redis({
host: process.env.REDIS_HOST,
port: 6379,
});
// Create rate limiter middleware
const apiLimiter = rateLimit({
store: new RedisStore({
sendCommand: (...args) => redis.call(...args),
}),
// 100 requests per 15 minutes
windowMs: 15 * 60 * 1000,
max: 100,
// Return rate limit info in headers
standardHeaders: true,
legacyHeaders: false,
// Custom key generator (by user ID if authenticated, else IP)
keyGenerator: (req) => {
return req.user?.id || req.ip;
},
// Custom response when rate limited
handler: (req, res) => {
res.status(429).json({
error: 'Too many requests',
message: 'Please try again later',
retryAfter: Math.ceil(req.rateLimit.resetTime / 1000),
});
},
});
// Apply to all API routes
app.use('/api/', apiLimiter);
// Stricter limit for auth endpoints
const authLimiter = rateLimit({
windowMs: 60 * 1000, // 1 minute
max: 5, // 5 attempts per minute
message: 'Too many login attempts, please try again later',
});
app.use('/api/auth/login', authLimiter);
Related Content
See It In Action:
- Rate Limiting Explainer - Visual walkthrough of token bucket vs sliding window
Related Concepts:
- Token Bucket - Burst-tolerant algorithm
- Sliding Window - Smooth enforcement algorithm
- Load Balancing - Distributing traffic across servers
- Circuit Breaker - Failing fast when downstream is unhealthy
Quick Self-Check
- Can explain rate limiting in 60 seconds?
- Understand difference between token bucket and sliding window?
- Know what HTTP 429 means and what headers to return?
- Can implement distributed rate limiting with Redis?
- Understand why sliding window prevents boundary burst?
- Know when to use rate limiting vs circuit breaker?
Interview Notes
80% of API design interviews
Powers systems at All production APIs
Resource protection query improvement
Fair usage enforcement