How distributed systems copy data across multiple nodes to achieve high availability, fault tolerance, and geographic distribution—and the fundamental trade-offs involved
TL;DR
Replication is copying data across multiple nodes to survive failures, distribute load, and reduce latency. The core trade-off: synchronous replication guarantees consistency but adds latency; asynchronous replication is fast but risks data loss. Choose your strategy based on whether correctness or availability matters more.
Visual Overview
Why Replicate?
| Motivation | Single Node | With Replication |
|---|---|---|
| Availability | Node fails = down | Failover to replica |
| Read Scaling | One node handles all | Distribute reads |
| Latency | Single location | Replicas near users |
| Durability | Disk fails = data loss | Multiple copies survive |
Replication Strategies
1. Single-Leader (Primary-Replica)
One node (leader) accepts all writes. Followers replicate from leader.
Writes
↓
[Leader]
/ | \
↓ ↓ ↓
[F1] [F2] [F3] ← Followers
↑ ↑ ↑
Reads
Used by: PostgreSQL, MySQL, MongoDB, Redis
Pros: Simple, strong consistency possible Cons: Leader is bottleneck, failover complexity
2. Multi-Leader
Multiple nodes accept writes. Replicate changes between leaders.
[Leader DC1] ←────→ [Leader DC2]
| |
[Follower] [Follower]
Used by: CockroachDB (multi-region), collaborative editing
Pros: Local writes in each region Cons: Conflict resolution required
3. Leaderless
All nodes accept reads and writes. Quorum determines success.
Client
/ | \
↓ ↓ ↓
[N1] [N2] [N3]
Used by: Cassandra, DynamoDB, Riak
Pros: No failover needed, high availability Cons: Eventually consistent, conflict resolution
Synchronous vs Asynchronous
Synchronous Replication
Leader waits for replica acknowledgment before confirming write.
Client → Leader: WRITE x=1
Leader → Follower: Replicate x=1
Follower → Leader: ACK
Leader → Client: SUCCESS
Guarantee: If write succeeds, data is on multiple nodes Cost: Latency = leader + slowest replica
Asynchronous Replication
Leader confirms immediately, replicates in background.
Client → Leader: WRITE x=1
Leader → Client: SUCCESS
Leader → Follower: Replicate x=1 (background)
Guarantee: None—leader crash can lose committed writes Benefit: Low latency
Semi-Synchronous
Wait for at least one replica (not all).
W = 2 (write quorum)
Wait for leader + 1 follower
Remaining followers: async
Used by: MySQL semi-sync, PostgreSQL synchronous_commit
Replication Lag
The delay between write on leader and visibility on replica.
Time →
Leader: Write x=1 │ Write x=2 │ Write x=3
Replica: ──────────│ x=1 ──────│ x=2 ──────│ x=3
↑ ↑ ↑
Lag: 50ms Lag: 50ms Lag: 50ms
Lag Causes Problems
Read-your-writes violation:
User: POST update profile
User: GET profile (routed to stale replica)
User: Sees old profile! Confused.
Solutions:
- Read own writes from leader
- Track write timestamp, ensure replica is caught up
- Session affinity to same replica
Failover
When leader fails, promote a follower.
Before: [Leader*] → [F1] → [F2]
↓
After: [Leader*] [F1*] → [F2]
(dead) (new leader)
Challenges:
- Detection: Is leader dead or just slow?
- Election: Which follower becomes leader?
- Reconfiguration: Redirect clients to new leader
- Data loss: Async replication may lose recent writes
Split-Brain
Network partition creates two leaders:
[Old Leader] ──X── [New Leader + Followers]
↑ ↑
Client A Client B
Both accept writes → data divergence!
Prevention: Quorum-based leader election (Raft, Paxos)
Replication Mechanisms
| Mechanism | How It Works | Used By |
|---|---|---|
| Statement-based | Replicate SQL statements | MySQL (legacy) |
| WAL shipping | Send write-ahead log | PostgreSQL |
| Logical (row-based) | Send changed rows | MySQL binlog |
| Trigger-based | Application-level capture | Custom solutions |
Statement-Based Problems
-- Leader executes:
INSERT INTO logs VALUES (NOW(), 'event');
-- Replica executes same statement:
INSERT INTO logs VALUES (NOW(), 'event');
-- NOW() returns different time! Data diverges.
Issues: Non-deterministic functions, auto-increment IDs
WAL Shipping
Send the physical log bytes:
Block 427: [old_bytes] → [new_bytes]
Block 893: [old_bytes] → [new_bytes]
Pros: Exact byte-for-byte replication Cons: Tied to storage engine version
Logical (Row-Based)
{
"table": "users",
"op": "UPDATE",
"before": {"id": 1, "name": "old"},
"after": {"id": 1, "name": "new"}
}
Pros: Version-independent, readable, enables CDC Cons: More verbose than WAL
Production Systems
| System | Strategy | Sync Mode | Lag Typical |
|---|---|---|---|
| PostgreSQL | Single-leader | Configurable | sub-second async |
| MySQL | Single-leader | Semi-sync | sub-second |
| MongoDB | Replica set | Configurable | sub-second |
| Cassandra | Leaderless | Quorum | 10-100ms |
| Kafka | Single-leader per partition | ISR | under 100ms |
Related Content
See It In Action:
- Eventual Consistency - ~80 second animated visual explanation of async replication
Prerequisites:
- Distributed Systems Basics - Foundation concepts
Related Concepts:
- Leader-Follower Replication - Single-leader deep dive
- Consensus - Leader election algorithms
- Eventual Consistency - Async replication guarantees
- Write-Ahead Log - How WAL shipping works
Used In Systems:
- PostgreSQL, MySQL, MongoDB (database replication)
- Kafka (partition replication with ISR)
- Redis (leader-follower, Redis Cluster)
- etcd, Consul (Raft-based replication)
Next Recommended: Leader-Follower Replication - Deep dive into the most common pattern
Interview Notes
85% of distributed systems interviews
Powers systems at Every HA database, Kafka, distributed storage
Sync vs async replication latency query improvement
Read scaling, geographic distribution