Rate Limiting | Concepts

TL;DR

Rate limiting controls how many requests a client can make to a service within a time window. It protects backend resources from overload, ensures fair usage across clients, and defends against abuse like brute-force attacks. The two most common algorithms are token bucket (allows bursts) and sliding window (smooth enforcement).

Visual Overview

Rate Limiting Flow

Core Explanation

What is Rate Limiting?

Real-World Analogy: Think of rate limiting like a nightclub bouncer. The club has a capacity of 100 people. The bouncer lets people in one at a time, but if the club is full, new arrivals must wait outside. Some VIPs (premium API users) might get a higher limit or skip the line.

Rate limiting enforces boundaries on how many requests a client can make:

Per user: Each authenticated user gets N requests/minute
Per IP: Unauthenticated requests limited by IP address
Per API key: Different tiers get different limits
Global: Protect the entire service from overload

How It Works

Every rate limiter needs to answer two questions:

Identification: Who is making this request? (user ID, IP, API key)
Counting: How many requests have they made recently?

The implementation varies by algorithm, but the flow is consistent:

Request arrives → Identify client → Check counter → Allow or Reject

Two Main Algorithms

Token Bucket vs Sliding Window

Token Bucket:

Tokens accumulate at a steady rate (refill)
Each request consumes one token
Burst allowed up to bucket capacity
Best for: APIs where legitimate traffic is bursty

Sliding Window Counter:

Counts requests in a sliding time window
Weights previous window to prevent boundary bursts
Smooth enforcement, no burst allowance
Best for: Strict rate enforcement, billing limits

Real Systems Using Rate Limiting

System	Algorithm Style	Notes	Use Case
GitHub API	Token bucket style	Tiered by authentication; check current docs for limits	Developer API access
Stripe	Sliding window	Different limits for live vs test mode	Payment processing
Twitter/X API	Tiered windows	Varies significantly by endpoint and tier	Social media API
AWS API Gateway	Token bucket	Fully configurable per stage	API management
Cloudflare	Leaky bucket	Rule-based configuration	Edge rate limiting

Note: Specific limits change frequently. Always verify current limits in official documentation.

Case Study: API Gateway Rate Limiting

Multi-Tier Rate Limiting

When to Use Rate Limiting

✓ Perfect Use Cases

Rate Limiting Use Cases

✕ When NOT to Use (or Use Carefully)

When Rate Limiting May Not Fit

Interview Application

Common Interview Question

Q: “You’re designing an API for a public service. How would you implement rate limiting? What algorithm would you choose?”

Strong Answer:

“I’d implement rate limiting at the API gateway level with a token bucket algorithm. Here’s my approach:

Why Token Bucket:

Allows legitimate bursts: Users often make multiple quick requests (page load, app startup)

Simple state: Just two values per client (tokens, last_refill_time)

Configurable: Capacity controls burst size, refill rate controls steady-state

Implementation:

Store state in Redis for distributed rate limiting

Key: ratelimit:{user_id} with token count and timestamp

Use Redis MULTI/EXEC for atomic check-and-decrement

Configuration:

Capacity: 100 tokens (max burst)

Refill: 10 tokens/second (100 req/sec steady-state)

Different limits per API tier

Response Headers:

X-RateLimit-Limit: Maximum requests allowed

X-RateLimit-Remaining: Requests left in window

X-RateLimit-Reset: Unix timestamp when limit resets

Retry-After: Seconds to wait (on 429)

Edge Cases:

Clock skew: Use Redis server time, not client time

Distributed: Single Redis cluster for consistency

Failover: Fail open (allow) if Redis unavailable—better to risk abuse than block all users”

Follow-up: How would you handle distributed rate limiting across multiple regions?

“For multi-region, I’d use local rate limiting with global synchronization:

Each region has local Redis for low-latency checks

Async sync between regions (eventual consistency)

Accept that users might get slightly more than limit globally

Alternative: Single global Redis with latency cost

The trade-off is accuracy vs latency. For most APIs, slightly exceeding limits across regions is acceptable.”

Code Example

Token Bucket Rate Limiter (Python + Redis)

import time
import redis

class TokenBucketRateLimiter:
    """
    Distributed token bucket rate limiter using Redis.

    Allows bursts up to capacity while enforcing average rate.
    """

    def __init__(self, redis_client: redis.Redis, capacity: int, refill_rate: float):
        """
        Args:
            redis_client: Redis connection
            capacity: Maximum tokens (burst size)
            refill_rate: Tokens added per second
        """
        self.redis = redis_client
        self.capacity = capacity
        self.refill_rate = refill_rate

    def is_allowed(self, key: str) -> tuple[bool, dict]:
        """
        Check if request is allowed and consume a token if so.

        Returns:
            (allowed: bool, info: dict with remaining, reset_at)
        """
        now = time.time()
        bucket_key = f"ratelimit:{key}"

        # Lua script for atomic check-and-update
        # This runs entirely on Redis server (no race conditions)
        lua_script = """
        local key = KEYS[1]
        local capacity = tonumber(ARGV[1])
        local refill_rate = tonumber(ARGV[2])
        local now = tonumber(ARGV[3])

        -- Get current state
        local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
        local tokens = tonumber(bucket[1]) or capacity
        local last_refill = tonumber(bucket[2]) or now

        -- Calculate tokens to add since last refill
        local elapsed = now - last_refill
        local tokens_to_add = elapsed * refill_rate
        tokens = math.min(capacity, tokens + tokens_to_add)

        -- Check if we can consume a token
        local allowed = 0
        if tokens >= 1 then
            tokens = tokens - 1
            allowed = 1
        end

        -- Update state
        redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
        redis.call('EXPIRE', key, 3600)  -- Clean up after 1 hour idle

        return {allowed, tokens, now + (capacity - tokens) / refill_rate}
        """

        result = self.redis.eval(
            lua_script,
            1,  # number of keys
            bucket_key,
            self.capacity,
            self.refill_rate,
            now
        )

        allowed = result[0] == 1
        remaining = int(result[1])
        reset_at = int(result[2])

        return allowed, {
            "remaining": remaining,
            "limit": self.capacity,
            "reset_at": reset_at
        }


# Usage example
if __name__ == "__main__":
    redis_client = redis.Redis(host='localhost', port=6379, db=0)

    # 10 requests/second with burst of 20
    limiter = TokenBucketRateLimiter(
        redis_client=redis_client,
        capacity=20,       # Allow burst of 20 requests
        refill_rate=10     # Refill 10 tokens/second
    )

    user_id = "user_123"

    # Simulate requests
    for i in range(25):
        allowed, info = limiter.is_allowed(user_id)

        if allowed:
            print(f"Request {i+1}: ALLOWED (remaining: {info['remaining']})")
        else:
            print(f"Request {i+1}: REJECTED (retry at: {info['reset_at']})")
            # In real code: return 429 with Retry-After header

Express.js Middleware Example

const rateLimit = require('express-rate-limit');
const RedisStore = require('rate-limit-redis');
const Redis = require('ioredis');

const redis = new Redis({
  host: process.env.REDIS_HOST,
  port: 6379,
});

// Create rate limiter middleware
const apiLimiter = rateLimit({
  store: new RedisStore({
    sendCommand: (...args) => redis.call(...args),
  }),

  // 100 requests per 15 minutes
  windowMs: 15 * 60 * 1000,
  max: 100,

  // Return rate limit info in headers
  standardHeaders: true,
  legacyHeaders: false,

  // Custom key generator (by user ID if authenticated, else IP)
  keyGenerator: (req) => {
    return req.user?.id || req.ip;
  },

  // Custom response when rate limited
  handler: (req, res) => {
    res.status(429).json({
      error: 'Too many requests',
      message: 'Please try again later',
      retryAfter: Math.ceil(req.rateLimit.resetTime / 1000),
    });
  },
});

// Apply to all API routes
app.use('/api/', apiLimiter);

// Stricter limit for auth endpoints
const authLimiter = rateLimit({
  windowMs: 60 * 1000,  // 1 minute
  max: 5,               // 5 attempts per minute
  message: 'Too many login attempts, please try again later',
});

app.use('/api/auth/login', authLimiter);

See It In Action:

Rate Limiting Explainer - Visual walkthrough of token bucket vs sliding window

Related Concepts:

Token Bucket - Burst-tolerant algorithm
Sliding Window - Smooth enforcement algorithm
Load Balancing - Distributing traffic across servers
Circuit Breaker - Failing fast when downstream is unhealthy

Quick Self-Check

Can explain rate limiting in 60 seconds?
Understand difference between token bucket and sliding window?
Know what HTTP 429 means and what headers to return?
Can implement distributed rate limiting with Redis?
Understand why sliding window prevents boundary burst?
Know when to use rate limiting vs circuit breaker?