Load Balancing

Distributing incoming requests across multiple servers to optimize resource utilization, minimize latency, and prevent any single server from becoming a bottleneck

TL;DR

Load balancing distributes network traffic or computational workload across multiple servers using algorithms like round-robin, least-connections, or consistent hashing to prevent any single server from being overwhelmed. Essential for scalability, high availability, and optimized resource utilization in systems like AWS ELB, nginx, and HAProxy.

Visual Overview

Load Balancing Overview
WITHOUT LOAD BALANCING (Single Server)

  All traffic  Single Server                   
                                                
  100 req/s                        
                Server                        
               Overload!                      
                                    
                                                
  Problems:                                     
  - Single point of failure
  - Limited capacity
  - High latency under load ✕                   
  - No redundancy


WITH LOAD BALANCING (Distributed)

 Load Balancer                                  
                                 
 100 req  Nginx/                              
 /s   ELB/                                   
  HAProxy                                     
                                 
                                               
                             
                                             
                     
 Server1Server2Server3                    
 33 req/33 req/33 req/                    
  s  s  s                                 
                     
                                                
 Benefits:                                      
  High availability (failover)                 
  Horizontal scalability (add servers)         
  Better resource utilization                  
  Health checks & auto-routing                 


LOAD BALANCING ALGORITHMS COMPARISON

 Round Robin (sequential distribution):         
 Request 1  Server 1                           
 Request 2  Server 2                           
 Request 3  Server 3                           
 Request 4  Server 1 (cycle repeats)           
                                                
 Least Connections (dynamic balancing):         
 Server 1: 5 active connections                 
 Server 2: 3 active connections  (chosen)      
 Server 3: 8 active connections                 
  Route to server with fewest connections      
                                                
 Consistent Hashing (sticky routing):           
 hash(user_id) % num_servers                    
 User 123  Server 2 (always same server)       
 User 456  Server 1 (always same server)       
  Same client always routes to same server     


LAYER 4 VS LAYER 7 LOAD BALANCING

 Layer 4 (Transport Layer - TCP/UDP):           
                                
  Client                                      
  1.2.3.4:5678                                
                                
                                               
 Load balancer sees: IP + Port                  
 Routes based on: TCP connection                
 Cannot see: HTTP headers, URLs, cookies        
                                               
 Backend server receives original connection    
                                                
 + Faster (no HTTP parsing)                     
 + Lower latency (~1-2ms)                       
 - Limited routing logic                        
                                                
 Layer 7 (Application Layer - HTTP):            
                                
  Client                                      
  GET /api/users                              
  Cookie: xyz                                 
                                
                                               
 Load balancer sees: Full HTTP request          
 Routes based on: URL, headers, cookies         
                                               
 /api/users  Backend Pool A                    
 /static/*  Backend Pool B (CDN)              
                                                
 + Advanced routing (path, host, cookie)        
 + SSL termination                              
 - Slower (HTTP parsing, ~5-10ms)               


Core Explanation

What is Load Balancing?

Load balancing is the process of distributing incoming requests across multiple backend servers to:

  1. Optimize resource utilization: No server is overloaded while others are idle
  2. Maximize throughput: Handle more requests by adding servers
  3. Minimize latency: Route to least-loaded or nearest server
  4. Ensure high availability: Route around failed servers

Load Balancing Algorithms

1. Round Robin

Round Robin Algorithm
Simple sequential distribution:

Incoming requests: Backend servers:
Request 1  Server 1
Request 2  Server 2
Request 3  Server 3
Request 4  Server 1 (cycle repeats)

Pros:
 Simple implementation
 Even distribution (if all requests equal)
 Stateless (no tracking needed)

Cons:
 Doesn't account for server capacity
 Doesn't account for request complexity
 Long-running requests can overload one server

Use case: Stateless microservices with uniform requests

2. Weighted Round Robin

Weighted Round Robin Algorithm
Distribute based on server capacity:

Backend servers with weights:
Server 1 (weight=5): More powerful
Server 2 (weight=3): Medium capacity
Server 3 (weight=2): Less powerful

Distribution pattern:
5 requests  Server 1
3 requests  Server 2
2 requests  Server 3
(repeat)

Pros:
 Accounts for heterogeneous server capacity
 Efficient resource utilization

Cons:
 Still doesn't account for dynamic load
 Requires manual weight configuration

Use case: Mixed hardware (different CPU/RAM capacities)

3. Least Connections

Least Connections Algorithm
Route to server with fewest active connections:

Real-time server state:
Server 1: 25 active connections
Server 2: 15 active connections  (chosen)
Server 3: 30 active connections

New request  Server 2 (fewest connections)

Pros:
 Dynamic load balancing
 Accounts for long-running connections
 Better for variable request durations

Cons:
 Requires tracking connection state
 More complex implementation

Use case: HTTP/1.1 keep-alive, websockets, long-polling

4. Weighted Least Connections

Weighted Least Connections Algorithm
Combines least connections with server weights:

Formula: connections / weight

Server 1: 20 connections, weight=5  score = 4.0
Server 2: 12 connections, weight=3  score = 4.0
Server 3: 10 connections, weight=2  score = 5.0  (highest)

Route to Server 1 or 2 (lowest score)

Pros:
 Best of both worlds (capacity + dynamic load)

Use case: Production systems with mixed hardware

5. Least Response Time

Least Response Time Algorithm
Route to server with fastest response time:

Recent response times (moving average):
Server 1: 50ms average
Server 2: 30ms average  (chosen)
Server 3: 100ms average

Pros:
 Optimizes user experience
 Automatically adapts to server performance
 Accounts for network latency

Cons:
 Requires active health checks
 Can amplify cascading failures

Use case: Geo-distributed deployments

6. IP Hash (Consistent Hashing)

IP Hash (Consistent Hashing)
Hash client IP to deterministically select server:

hash(client_ip) % num_servers

Client 1.2.3.4  hash % 3 = 1  Server 1 (always)
Client 5.6.7.8  hash % 3 = 2  Server 2 (always)
Client 9.10.11.12  hash % 3 = 0  Server 3 (always)

Pros:
 Session persistence (same client  same server)
 Useful for caching (server caches client data)
 No shared session storage needed

Cons:
 Uneven distribution if client IPs clustered
 Server addition/removal disrupts assignments

Use case: Stateful applications with server-side sessions

7. Least Bandwidth

Least Bandwidth Algorithm
Route to server currently serving least bandwidth:

Server 1: 500 Mbps
Server 2: 300 Mbps  (chosen)
Server 3: 700 Mbps

Use case: Video streaming, large file downloads

Layer 4 vs Layer 7 Load Balancing

Layer 4 (Transport Layer)

Layer 4 Load Balancing
OSI Layer: Transport (TCP/UDP)

  What it sees:                         
  - Source IP + Port                    
  - Destination IP + Port               
  - TCP/UDP protocol                    
                                        
  What it CAN'T see:                    
  - HTTP headers                        
  - URLs, query parameters              
  - Cookies                             
  - Request body                        
                                        
  Routing decisions based on:           
  - IP address                          
  - Port number                         
  - Protocol (TCP vs UDP)               


Example: AWS Network Load Balancer (NLB)

Pros:
 Very fast (< 1ms latency)
 High throughput (millions of requests/sec)
 Low CPU usage
 Supports any TCP/UDP protocol
 Preserves client IP (pass-through)

Cons:
 No content-based routing
 No SSL termination
 Limited health checks

Use case: TCP-based services, ultra-low latency requirements

Layer 7 (Application Layer)

Layer 7 Load Balancing
OSI Layer: Application (HTTP/HTTPS)

  What it sees:                         
  - Full HTTP request                   
  - Headers (User-Agent, Host, etc.)    
  - URL path and query parameters       
  - Cookies                             
  - Request body                        
                                        
  Routing decisions based on:           
  - URL path: /api/*  API servers      
  - Host header: api.example.com        
  - Cookie: user_id=123                 
  - HTTP method: POST vs GET            
  - Custom headers                      


Example: AWS Application Load Balancer (ALB), nginx

Pros:
 Content-based routing (path, host, headers)
 SSL/TLS termination (decrypt at LB)
 Advanced health checks (HTTP status codes)
 Request/response manipulation
 Web Application Firewall (WAF) integration

Cons:
 Slower (5-10ms latency due to HTTP parsing)
 Higher CPU usage
 More complex configuration

Use case: HTTP microservices, API gateways, web applications

Health Checks & Failover

Health Checks & Failover
Health Check Mechanisms:

1. Active Health Checks:
 
  Load Balancer  Backend Server         
  GET /health every 10 seconds           
                                        
  Server responds: 200 OK               
  or                                     
  Server timeout/error  Mark unhealthy  
 

Configuration:

- Interval: 10s (how often to check)
- Timeout: 5s (max wait for response)
- Unhealthy threshold: 3 (failures before marking down)
- Healthy threshold: 2 (successes before marking up)

2. Passive Health Checks:
 
  Monitor real traffic:                  
  Server returns 5xx errors  Unhealthy  
  Server timeout  Unhealthy             
  Server 2xx responses  Healthy         
 

Failover Flow:

 1. Server 2 fails health check         
 2. Load balancer marks Server 2 DOWN   
 3. New requests  Server 1 & 3 only    
 4. Server 2 recovers                   
 5. Passes health checks (2x)           
 6. Load balancer marks Server 2 UP     
 7. Resume sending traffic to Server 2  


Session Persistence (Sticky Sessions)

Session Persistence (Sticky Sessions)
Problem: User session stored on specific server

Without sticky sessions:

 Request 1: Login  Server 1 (session)  
 Request 2: Get data  Server 2        
 (Server 2 doesn't have session)        


Solution 1: Cookie-based sticky sessions:

 Request 1: Login  Server 1            
 Response: Set-Cookie: server=1         
 Request 2: Cookie: server=1  Server 1 
 (LB reads cookie, routes to Server 1)  


Solution 2: IP hash sticky sessions:

 hash(client_ip) always  same server   
 Client 1.2.3.4  Server 1 (always)     


Solution 3: Session replication (better):

 Store sessions in Redis/Memcached      
 Any server can access session          
 No sticky sessions needed             


Real Systems Using Load Balancing

SystemTypeAlgorithmsKey FeaturesUse Case
AWS ELB (ALB)Layer 7Round robin, least outstanding requestsContent-based routing, SSL terminationHTTP microservices
AWS NLBLayer 4Flow hashUltra-low latency, static IPTCP services, high throughput
nginxLayer 7Round robin, least_conn, ip_hashOpen source, highly configurableWeb servers, API gateway
HAProxyLayer 4/7Weighted RR, least_conn, consistent hashHigh performance, advanced ACLsEnterprise load balancing
EnvoyLayer 7Weighted RR, least_request, ring_hashService mesh, observabilityKubernetes, microservices
CloudflareLayer 7Geo-routing, weighted poolsDDoS protection, CDNGlobal load balancing

Case Study: AWS Application Load Balancer

AWS ALB Architecture

 Internet                                     
                                             
 ALB (multi-AZ for high availability)         
  Availability Zone 1                       
  Availability Zone 2                       
                                             
 Target Groups:                               
  API Servers (port 3000)                   
   /api/_  API target group               
  Web Servers (port 80)                     
   /_  Web target group                   
  Admin Servers (port 8080)                 
  /admin/*  Admin target group            


Routing Rules:

1. Path-based: /api/*  API servers
2. Host-based: admin.example.com  Admin servers
3. Header-based: X-API-Version: v2  V2 servers

Health Checks:

- Protocol: HTTP
- Path: /health
- Interval: 30s
- Timeout: 5s
- Healthy threshold: 5
- Unhealthy threshold: 2

Features:
 SSL/TLS termination (offload from servers)
 WebSocket support
 HTTP/2 support
 Integration with Auto Scaling
 CloudWatch metrics

Case Study: nginx Load Balancer

# nginx.conf - Load Balancer Configuration

# Define upstream backend servers
upstream backend {
    # Load balancing algorithm
    least_conn;  # Use least connections

    # Backend servers with weights
    server backend1.example.com:8080 weight=5;
    server backend2.example.com:8080 weight=3;
    server backend3.example.com:8080 weight=2;

    # Server with max connections limit
    server backend4.example.com:8080 max_conns=100;

    # Backup server (used only if others fail)
    server backup.example.com:8080 backup;

    # Health check configuration
    keepalive 32;  # Keep 32 connections alive
}

    // ... omitted: keep concept snippets short
    server_name example.com;

    ssl_certificate /etc/nginx/ssl/cert.pem;
    ssl_certificate_key /etc/nginx/ssl/key.pem;

    # SSL termination (decrypt here, forward HTTP to backend)
    location / {
        proxy_pass http://backend;
    }
}

When to Use Load Balancing

✓ Perfect Use Cases

Use CaseScenarioSolutionDetailBenefit
High Traffic Web ApplicationsE-commerce site with millions of usersLayer 7 ALB with 50 backend serversRequirements: handle 100,000 requests/secondHorizontal scalability, failover, health checks
Microservices Architecture100+ microservices communicatingService mesh (Envoy/Linkerd) with load balancing per serviceAutomatic service discovery, circuit breaking, observability
Global Applications (Geo-LB)Users worldwide accessing applicationDNS-based load balancing (Route53, Cloudflare)Route: US users → US region, EU users → EU regionLow latency, disaster recovery
Database Read ReplicasRead-heavy application with MySQL replicasLoad balancer distributing reads across 5 replicasAlgorithm: least connections (account for query duration)Scale read throughput

✕ When NOT to Use (or Use Carefully)

Anti-PatternProblemAlternative / SolutionExample
Single Server DeploymentAdds complexity and latency for no benefitDirect connection to serverDevelopment environment, small apps
Stateful TCP Connections (Without Sticky Sessions)Connection state lost on failoverUse connection pooling or client-side retry logicDatabase connections, SSH sessions
Very Low Latency Requirements (< 1ms)Load balancer adds latency (1-10ms)Client-side load balancing (gRPC, Thrift)High-frequency trading, real-time gaming

Interview Application

Common Interview Question

Q: “Design a load balancing solution for a REST API with 10 backend servers. How would you ensure high availability and optimal performance?”

Strong Answer:

“I’d design a multi-layered load balancing solution:

Architecture:

I would put DNS-based routing in front to send users to the nearest datacenter, then use a Layer 7 load balancer such as ALB or nginx for HTTP routing. If the workload has raw TCP services or very low latency requirements, I would add a Layer 4 balancer for that path instead of forcing everything through HTTP-aware routing.

Algorithm Selection:

I would use least-connections for API endpoints because request duration varies and long-running calls should not pile up on one server. Static assets can use round robin because requests are uniform and fast. If the application still stores session state server-side, I would use cookie-based or IP-hash affinity, while calling out that the better fix is to move session state out of the backend instance.

High Availability:

Availability comes from active and passive health signals. The balancer should probe GET /health about every 10 seconds, watch real 5xx traffic, and mark a server unhealthy after a small failure threshold such as three misses. Once a server is unhealthy, it leaves the pool, traffic moves to healthy instances, and retries are bounded by circuit-breaker behavior. I would spread both balancers and servers across three availability zones so one zone loss does not remove the service.

Performance Optimizations:

For performance, I would terminate TLS at the balancer to remove CPU work from backends, keep pooled connections open to reduce handshake overhead, and cache static responses where that does not weaken correctness. Those optimizations help only if the health and retry behavior stays bounded; otherwise they can hide overload until it becomes an incident.

Monitoring:

The dashboard needs request rate, error rate, p50 and p99 latency, health-check state, and traffic distribution per server. Alerts should fire on rising 5xxs, high p99, repeated health-check failures, and skewed distribution because those map directly to user pain and capacity imbalance.

Scaling:

Scaling should be boring: add servers when sustained CPU or queue depth crosses the target, auto-register new instances with the balancer, and drain connections before removing a server. The graceful shutdown path matters as much as scale-out because bad draining turns deploys into user-visible errors.

Trade-offs:

The trade-off is latency versus routing intelligence. Layer 7 may add 5-10ms compared with roughly 1ms for Layer 4, but it gives path routing, host routing, cookie behavior, and TLS termination. For ultra-low latency, I would use Layer 4 or client-side load balancing.”

Code Example

Simple Round Robin Load Balancer

import requests
import time
from typing import List
from dataclasses import dataclass
import threading

@dataclass
class BackendServer:
    """Represents a backend server"""
    host: str
    port: int
    weight: int = 1
    healthy: bool = True
    active_connections: int = 0

class LoadBalancer:
    """
    Simple load balancer implementing multiple algorithms
    """
    def __init__(self, servers: List[BackendServer]):
        self.servers = servers
        self.current_index = 0  # For round robin
    # ... omitted: keep concept snippets short
            result = lb.forward_request('/api/users', algorithm='ip_hash',
                                       client_ip=ip)
            print(f"Client {ip}{result['server']}")
        except Exception as e:
            print(f"Request from {ip} failed: {e}")

    # Get status
    print("\n=== Load Balancer Status ===")
    import json
    print(json.dumps(lb.get_status(), indent=2))

Layer 7 HTTP Load Balancer with Path Routing

from flask import Flask, request, Response
import requests

app = Flask(__name__)

# Define backend pools
BACKEND_POOLS = {
    'api': [
        'http://api1.example.com:3000',
        'http://api2.example.com:3000',
        'http://api3.example.com:3000',
    ],
    'web': [
        'http://web1.example.com:80',
        'http://web2.example.com:80',
    ],
    'admin': [
        'http://admin1.example.com:8080',
    ]
}

# Round robin counters
    # ... omitted: keep concept snippets short
            response.content,
            status=response.status_code,
            headers=dict(response.headers)
        )

    except requests.RequestException as e:
        return Response(f"Bad Gateway: {e}", status=502)

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=80)

Prerequisites:

Related Concepts:

Used In Systems:

  • AWS ELB/ALB/NLB: Cloud load balancing
  • nginx/HAProxy: Open-source load balancers
  • Kubernetes: Service load balancing with kube-proxy

Explained In Detail:

  • System Design Deep Dive - Load balancing in production systems

Quick Self-Check

  • Can explain load balancing in 60 seconds?
  • Know difference between Layer 4 and Layer 7 load balancing?
  • Understand 3+ load balancing algorithms and their trade-offs?
  • Can explain health checks and failover mechanisms?
  • Know when to use sticky sessions vs session replication?
  • Can design a load balancing solution for given requirements?

Production signal

Why this concept matters

Interview 80% of system design interviews
Production AWS ELB, nginx, HAProxy
Performance Optimal response times
Scale Horizontal scaling