Skip to content

Replication

How distributed systems copy data across multiple nodes to achieve high availability, fault tolerance, and geographic distribution—and the fundamental trade-offs involved

TL;DR

Replication is copying data across multiple nodes to survive failures, distribute load, and reduce latency. The core trade-off: synchronous replication guarantees consistency but adds latency; asynchronous replication is fast but risks data loss. Choose your strategy based on whether correctness or availability matters more.

Visual Overview

Replication Overview

Why Replicate?

MotivationSingle NodeWith Replication
AvailabilityNode fails = downFailover to replica
Read ScalingOne node handles allDistribute reads
LatencySingle locationReplicas near users
DurabilityDisk fails = data lossMultiple copies survive

Replication Strategies

1. Single-Leader (Primary-Replica)

One node (leader) accepts all writes. Followers replicate from leader.

         Writes
            
        [Leader]
        /   |   \
               
    [F1]  [F2]  [F3]   Followers
                
         Reads

Used by: PostgreSQL, MySQL, MongoDB, Redis

Pros: Simple, strong consistency possible Cons: Leader is bottleneck, failover complexity

2. Multi-Leader

Multiple nodes accept writes. Replicate changes between leaders.

    [Leader DC1] ──── [Leader DC2]
        |                   |
    [Follower]          [Follower]

Used by: CockroachDB (multi-region), collaborative editing

Pros: Local writes in each region Cons: Conflict resolution required

3. Leaderless

All nodes accept reads and writes. Quorum determines success.

    Client
   /   |   \
          
[N1]  [N2]  [N3]

Used by: Cassandra, DynamoDB, Riak

Pros: No failover needed, high availability Cons: Eventually consistent, conflict resolution

Synchronous vs Asynchronous

Synchronous Replication

Leader waits for replica acknowledgment before confirming write.

Client  Leader: WRITE x=1
Leader  Follower: Replicate x=1
Follower  Leader: ACK
Leader  Client: SUCCESS

Guarantee: If write succeeds, data is on multiple nodes Cost: Latency = leader + slowest replica

Asynchronous Replication

Leader confirms immediately, replicates in background.

Client  Leader: WRITE x=1
Leader  Client: SUCCESS
Leader  Follower: Replicate x=1 (background)

Guarantee: None—leader crash can lose committed writes Benefit: Low latency

Semi-Synchronous

Wait for at least one replica (not all).

W = 2 (write quorum)
Wait for leader + 1 follower
Remaining followers: async

Used by: MySQL semi-sync, PostgreSQL synchronous_commit

Replication Lag

The delay between write on leader and visibility on replica.

Time 
Leader:   Write x=1 │ Write x=2 │ Write x=3
Replica:  ──────────│ x=1 ──────│ x=2 ──────│ x=3
                                          
               Lag: 50ms    Lag: 50ms    Lag: 50ms

Lag Causes Problems

Read-your-writes violation:

User: POST update profile
User: GET profile (routed to stale replica)
User: Sees old profile! Confused.

Solutions:

  • Read own writes from leader
  • Track write timestamp, ensure replica is caught up
  • Session affinity to same replica

Failover

When leader fails, promote a follower.

Before:  [Leader*]  [F1]  [F2]
                       
After:   [Leader*]   [F1*]  [F2]
         (dead)    (new leader)

Challenges:

  1. Detection: Is leader dead or just slow?
  2. Election: Which follower becomes leader?
  3. Reconfiguration: Redirect clients to new leader
  4. Data loss: Async replication may lose recent writes

Split-Brain

Network partition creates two leaders:

[Old Leader] ──X── [New Leader + Followers]
                         
  Client A             Client B

Both accept writes → data divergence!

Prevention: Quorum-based leader election (Raft, Paxos)

Replication Mechanisms

MechanismHow It WorksUsed By
Statement-basedReplicate SQL statementsMySQL (legacy)
WAL shippingSend write-ahead logPostgreSQL
Logical (row-based)Send changed rowsMySQL binlog
Trigger-basedApplication-level captureCustom solutions

Statement-Based Problems

-- Leader executes:
INSERT INTO logs VALUES (NOW(), 'event');

-- Replica executes same statement:
INSERT INTO logs VALUES (NOW(), 'event');
-- NOW() returns different time! Data diverges.

Issues: Non-deterministic functions, auto-increment IDs

WAL Shipping

Send the physical log bytes:

Block 427: [old_bytes]  [new_bytes]
Block 893: [old_bytes]  [new_bytes]

Pros: Exact byte-for-byte replication Cons: Tied to storage engine version

Logical (Row-Based)

{
  "table": "users",
  "op": "UPDATE",
  "before": {"id": 1, "name": "old"},
  "after": {"id": 1, "name": "new"}
}

Pros: Version-independent, readable, enables CDC Cons: More verbose than WAL

Production Systems

SystemStrategySync ModeLag Typical
PostgreSQLSingle-leaderConfigurablesub-second async
MySQLSingle-leaderSemi-syncsub-second
MongoDBReplica setConfigurablesub-second
CassandraLeaderlessQuorum10-100ms
KafkaSingle-leader per partitionISRunder 100ms

See It In Action:

Prerequisites:

Related Concepts:

Used In Systems:

  • PostgreSQL, MySQL, MongoDB (database replication)
  • Kafka (partition replication with ISR)
  • Redis (leader-follower, Redis Cluster)
  • etcd, Consul (Raft-based replication)

Next Recommended: Leader-Follower Replication - Deep dive into the most common pattern

Interview Notes
⭐ Must-Know
💼85% of distributed systems interviews
Interview Relevance
85% of distributed systems interviews
🏭Every HA database, Kafka, distributed storage
Production Impact
Powers systems at Every HA database, Kafka, distributed storage
Sync vs async replication latency
Performance
Sync vs async replication latency query improvement
📈Read scaling, geographic distribution
Scalability
Read scaling, geographic distribution