Skip to content

Rate Limiting

Controlling the rate of requests to a service to prevent overload, ensure fair usage, and protect against abuse

TL;DR

Rate limiting controls how many requests a client can make to a service within a time window. It protects backend resources from overload, ensures fair usage across clients, and defends against abuse like brute-force attacks. The two most common algorithms are token bucket (allows bursts) and sliding window (smooth enforcement).

Visual Overview

Rate Limiting Flow

Core Explanation

What is Rate Limiting?

Real-World Analogy: Think of rate limiting like a nightclub bouncer. The club has a capacity of 100 people. The bouncer lets people in one at a time, but if the club is full, new arrivals must wait outside. Some VIPs (premium API users) might get a higher limit or skip the line.

Rate limiting enforces boundaries on how many requests a client can make:

  • Per user: Each authenticated user gets N requests/minute
  • Per IP: Unauthenticated requests limited by IP address
  • Per API key: Different tiers get different limits
  • Global: Protect the entire service from overload

How It Works

Every rate limiter needs to answer two questions:

  1. Identification: Who is making this request? (user ID, IP, API key)
  2. Counting: How many requests have they made recently?

The implementation varies by algorithm, but the flow is consistent:

Request arrives  Identify client  Check counter  Allow or Reject

Two Main Algorithms

Token Bucket vs Sliding Window

Token Bucket:

  • Tokens accumulate at a steady rate (refill)
  • Each request consumes one token
  • Burst allowed up to bucket capacity
  • Best for: APIs where legitimate traffic is bursty

Sliding Window Counter:

  • Counts requests in a sliding time window
  • Weights previous window to prevent boundary bursts
  • Smooth enforcement, no burst allowance
  • Best for: Strict rate enforcement, billing limits

Real Systems Using Rate Limiting

SystemAlgorithm StyleNotesUse Case
GitHub APIToken bucket styleTiered by authentication; check current docs for limitsDeveloper API access
StripeSliding windowDifferent limits for live vs test modePayment processing
Twitter/X APITiered windowsVaries significantly by endpoint and tierSocial media API
AWS API GatewayToken bucketFully configurable per stageAPI management
CloudflareLeaky bucketRule-based configurationEdge rate limiting

Note: Specific limits change frequently. Always verify current limits in official documentation.

Case Study: API Gateway Rate Limiting

Multi-Tier Rate Limiting

When to Use Rate Limiting

✓ Perfect Use Cases

Rate Limiting Use Cases

✕ When NOT to Use (or Use Carefully)

When Rate Limiting May Not Fit

Interview Application

Common Interview Question

Q: “You’re designing an API for a public service. How would you implement rate limiting? What algorithm would you choose?”

Strong Answer:

“I’d implement rate limiting at the API gateway level with a token bucket algorithm. Here’s my approach:

Why Token Bucket:

  1. Allows legitimate bursts: Users often make multiple quick requests (page load, app startup)
  2. Simple state: Just two values per client (tokens, last_refill_time)
  3. Configurable: Capacity controls burst size, refill rate controls steady-state

Implementation:

  • Store state in Redis for distributed rate limiting
  • Key: ratelimit:{user_id} with token count and timestamp
  • Use Redis MULTI/EXEC for atomic check-and-decrement

Configuration:

  • Capacity: 100 tokens (max burst)
  • Refill: 10 tokens/second (100 req/sec steady-state)
  • Different limits per API tier

Response Headers:

  • X-RateLimit-Limit: Maximum requests allowed
  • X-RateLimit-Remaining: Requests left in window
  • X-RateLimit-Reset: Unix timestamp when limit resets
  • Retry-After: Seconds to wait (on 429)

Edge Cases:

  • Clock skew: Use Redis server time, not client time
  • Distributed: Single Redis cluster for consistency
  • Failover: Fail open (allow) if Redis unavailable—better to risk abuse than block all users”

Follow-up: How would you handle distributed rate limiting across multiple regions?

“For multi-region, I’d use local rate limiting with global synchronization:

  1. Each region has local Redis for low-latency checks
  2. Async sync between regions (eventual consistency)
  3. Accept that users might get slightly more than limit globally
  4. Alternative: Single global Redis with latency cost

The trade-off is accuracy vs latency. For most APIs, slightly exceeding limits across regions is acceptable.”

Code Example

Token Bucket Rate Limiter (Python + Redis)

import time
import redis

class TokenBucketRateLimiter:
    """
    Distributed token bucket rate limiter using Redis.

    Allows bursts up to capacity while enforcing average rate.
    """

    def __init__(self, redis_client: redis.Redis, capacity: int, refill_rate: float):
        """
        Args:
            redis_client: Redis connection
            capacity: Maximum tokens (burst size)
            refill_rate: Tokens added per second
        """
        self.redis = redis_client
        self.capacity = capacity
        self.refill_rate = refill_rate

    def is_allowed(self, key: str) -> tuple[bool, dict]:
        """
        Check if request is allowed and consume a token if so.

        Returns:
            (allowed: bool, info: dict with remaining, reset_at)
        """
        now = time.time()
        bucket_key = f"ratelimit:{key}"

        # Lua script for atomic check-and-update
        # This runs entirely on Redis server (no race conditions)
        lua_script = """
        local key = KEYS[1]
        local capacity = tonumber(ARGV[1])
        local refill_rate = tonumber(ARGV[2])
        local now = tonumber(ARGV[3])

        -- Get current state
        local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
        local tokens = tonumber(bucket[1]) or capacity
        local last_refill = tonumber(bucket[2]) or now

        -- Calculate tokens to add since last refill
        local elapsed = now - last_refill
        local tokens_to_add = elapsed * refill_rate
        tokens = math.min(capacity, tokens + tokens_to_add)

        -- Check if we can consume a token
        local allowed = 0
        if tokens >= 1 then
            tokens = tokens - 1
            allowed = 1
        end

        -- Update state
        redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
        redis.call('EXPIRE', key, 3600)  -- Clean up after 1 hour idle

        return {allowed, tokens, now + (capacity - tokens) / refill_rate}
        """

        result = self.redis.eval(
            lua_script,
            1,  # number of keys
            bucket_key,
            self.capacity,
            self.refill_rate,
            now
        )

        allowed = result[0] == 1
        remaining = int(result[1])
        reset_at = int(result[2])

        return allowed, {
            "remaining": remaining,
            "limit": self.capacity,
            "reset_at": reset_at
        }


# Usage example
if __name__ == "__main__":
    redis_client = redis.Redis(host='localhost', port=6379, db=0)

    # 10 requests/second with burst of 20
    limiter = TokenBucketRateLimiter(
        redis_client=redis_client,
        capacity=20,       # Allow burst of 20 requests
        refill_rate=10     # Refill 10 tokens/second
    )

    user_id = "user_123"

    # Simulate requests
    for i in range(25):
        allowed, info = limiter.is_allowed(user_id)

        if allowed:
            print(f"Request {i+1}: ALLOWED (remaining: {info['remaining']})")
        else:
            print(f"Request {i+1}: REJECTED (retry at: {info['reset_at']})")
            # In real code: return 429 with Retry-After header

Express.js Middleware Example

const rateLimit = require('express-rate-limit');
const RedisStore = require('rate-limit-redis');
const Redis = require('ioredis');

const redis = new Redis({
  host: process.env.REDIS_HOST,
  port: 6379,
});

// Create rate limiter middleware
const apiLimiter = rateLimit({
  store: new RedisStore({
    sendCommand: (...args) => redis.call(...args),
  }),

  // 100 requests per 15 minutes
  windowMs: 15 * 60 * 1000,
  max: 100,

  // Return rate limit info in headers
  standardHeaders: true,
  legacyHeaders: false,

  // Custom key generator (by user ID if authenticated, else IP)
  keyGenerator: (req) => {
    return req.user?.id || req.ip;
  },

  // Custom response when rate limited
  handler: (req, res) => {
    res.status(429).json({
      error: 'Too many requests',
      message: 'Please try again later',
      retryAfter: Math.ceil(req.rateLimit.resetTime / 1000),
    });
  },
});

// Apply to all API routes
app.use('/api/', apiLimiter);

// Stricter limit for auth endpoints
const authLimiter = rateLimit({
  windowMs: 60 * 1000,  // 1 minute
  max: 5,               // 5 attempts per minute
  message: 'Too many login attempts, please try again later',
});

app.use('/api/auth/login', authLimiter);

See It In Action:

Related Concepts:

Quick Self-Check

  • Can explain rate limiting in 60 seconds?
  • Understand difference between token bucket and sliding window?
  • Know what HTTP 429 means and what headers to return?
  • Can implement distributed rate limiting with Redis?
  • Understand why sliding window prevents boundary burst?
  • Know when to use rate limiting vs circuit breaker?
Interview Notes
💼80% of API design interviews
Interview Relevance
80% of API design interviews
🏭All production APIs
Production Impact
Powers systems at All production APIs
Resource protection
Performance
Resource protection query improvement
📈Fair usage enforcement
Scalability
Fair usage enforcement