How distributed systems agree on a single value or state across multiple nodes, enabling coordination despite failures and network partitions
75% of L6+ interviews
Powers systems at etcd, ZooKeeper, Consul
Distributed locks query improvement
Leader election
TL;DR
Consensus is a fundamental problem in distributed systems where multiple nodes must agree on a single value or decision, even in the presence of failures. Algorithms like Raft and Paxos solve this by using leader election, quorums, and log replication to ensure all nodes eventually agree on the same state. Essential for distributed locks, configuration management, and coordination services like etcd, Consul, and ZooKeeper.
Visual Overview
THE CONSENSUS PROBLEM
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β Goal: N nodes agree on single value β
β β
β Node 1 proposes: value = "A" β
β Node 2 proposes: value = "B" β
β Node 3 proposes: value = "A" β
β β β
β Consensus Algorithm runs... β
β β β
β Node 1 decides: value = "A" β β
β Node 2 decides: value = "A" β β
β Node 3 decides: value = "A" β β
β β
β Properties: β
β 1. Agreement: All nodes decide same value β
β 2. Validity: Decided value was proposed β
β 3. Termination: All nodes eventually decide β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
RAFT CONSENSUS (Simplified)
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β Phase 1: LEADER ELECTION β
β ββββββ ββββββ ββββββ β
β β N1 β β N2 β β N3 β β
β ββββββ ββββββ ββββββ β
β Timeout β Start election β
β N1 votes for self, requests votes from N2, N3 β
β N2 votes YES, N3 votes YES β
β N1 becomes LEADER (majority) β β
β β
β Phase 2: LOG REPLICATION β
β Leader receives command: SET x=5 β
β β β
β 1. Leader appends to log: [term=1, SET x=5] β
β 2. Leader sends to followers β
β 3. Followers append to logs β
β 4. Followers ACK β
β 5. Leader receives majority ACK β β
β 6. Leader commits entry β
β 7. Entry applied to state machine β
β β
β Result: All nodes have same log, same state β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
LEADER ELECTION (Detailed)
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β Initial State: 3 followers β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β Follower β β Follower β β Follower β β
β β N1 β β N2 β β N3 β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β
β Election Timeout (N1 times out first): β
β N1 β Candidate (term=1) β
β N1 votes for self (vote count = 1) β
β N1 sends RequestVote to N2, N3 β
β β β
β N2 receives RequestVote: β
β - Term=1 (same as N2) β
β - N2 hasn't voted this term β votes YES β
β β β
β N3 receives RequestVote: β
β - Term=1, N3 hasn't voted β votes YES β
β β β
β N1 receives 2 votes (total 3/3 = majority) β β
β N1 β Leader β
β N1 sends heartbeats to N2, N3 β
β N2, N3 β Followers β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
SPLIT VOTE (and Recovery)
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β N1 and N2 timeout simultaneously β
β β β
β N1 β Candidate (term=1), votes for self β
β N2 β Candidate (term=1), votes for self β
β β β
β N1 requests vote from N2, N3 β
β N2 requests vote from N1, N3 β
β β β
β N3 receives both requests: β
β - Votes for N1 (received first) β
β - Rejects N2 (already voted this term) β
β β β
β Vote count: β
β N1: 2 votes (self + N3) = NOT majority β β
β N2: 1 vote (self only) = NOT majority β β
β β β
β Election times out, no leader elected β
β β β
β Retry with random timeout (prevents tie) β
β N1 times out first β wins next election β β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
Core Explanation
What is Consensus?
Consensus is the problem of getting multiple distributed nodes to agree on a single value, even when:
- Nodes fail (crash)
- Messages are delayed or lost
- Network partitions occur
Consensus Properties:
- Agreement: All non-faulty nodes decide on the same value
- Validity: The decided value must have been proposed by some node
- Termination: All non-faulty nodes eventually decide
- Integrity: Nodes decide at most once
Real-World Analogies:
- Board of directors voting on decision
- Jury reaching verdict
- Politicians passing legislation
Why Consensus is Hard (FLP Impossibility)
The FLP Result (1985):
Fischer, Lynch, Patterson proved:
"In an asynchronous system with even ONE faulty node,
there is NO deterministic algorithm that guarantees
consensus in bounded time"
What this means:
- Async network: Can't distinguish slow vs crashed
- Even 1 failure: Can block progress forever
- Deterministic: No randomness allowed
Real systems work around this by:
1. Timeouts (assume crashed after T seconds)
2. Randomization (random backoff)
3. Partial synchrony (eventual bounds)
Consensus Algorithms
1. Raft (Understandable Consensus)
Raft Design Goals:
- Understandable (simpler than Paxos)
- Practical (production-ready)
- Safe (proven correct)
Key Components:
1. Leader Election
2. Log Replication
3. Safety Guarantees
Terms (Logical Clock):
ββββββββββββββββββββββββββββββββββββββ
β Term 1: [Leader=N1] β
β Term 2: [Leader=N2] (N1 crashed) β
β Term 3: [No leader] (split vote) β
β Term 4: [Leader=N1] β
ββββββββββββββββββββββββββββββββββββββ
Terms ensure:
- Only one leader per term
- Stale leaders detected
- Log ordering preserved
Raft Leader Election:
State Machine:
Follower β Candidate β Leader
β β
βββββββββββ (election timeout)
Election Process:
1. Follower waits for heartbeat from leader
2. If timeout (150-300ms random):
- Increment term
- Become candidate
- Vote for self
- Send RequestVote to all peers
3. Receive votes:
- Majority? β Become leader
- Another leader elected? β Become follower
- Timeout? β Start new election
4. Leader sends heartbeats (prevent new elections)
Safety: Only one leader per term (majority quorum)
Raft Log Replication:
Log Structure:
ββββββββββββββββββββββββββββββββββββββ
β Index: 1 2 3 4 5 β
β Term: 1 1 1 2 3 β
β Cmd: x=3 y=9 x=5 y=2 x=1 β
β Status: β β β β ? β
β (committed) (pending) β
ββββββββββββββββββββββββββββββββββββββ
Replication Steps:
1. Leader receives command from client
2. Leader appends to local log (uncommitted)
3. Leader sends AppendEntries RPC to followers
4. Followers append to logs, return ACK
5. Leader receives majority ACK β commit entry
6. Leader applies to state machine
7. Leader notifies followers to commit
8. Followers apply to state machines
Safety Rules:
- Log Matching: Same index+term β identical logs
- Leader Completeness: Leader has all committed entries
- State Machine Safety: Same log β same state
2. Paxos (Classic Consensus)
Paxos Phases:
Phase 1: PREPARE (Leader Election)
ββββββββββββββββββββββββββββββββββββββββββ
β Proposer: β
β - Generates proposal number N β
β - Sends PREPARE(N) to acceptors β
β β β
β Acceptor: β
β - If N > highest seen: β
β β Promise not to accept N' < N β
β β Return any accepted value β
β - Else: Reject β
ββββββββββββββββββββββββββββββββββββββββββ
Phase 2: ACCEPT (Propose Value)
ββββββββββββββββββββββββββββββββββββββββββ
β Proposer: β
β - Receives majority promises β
β - Choose value (or use returned value)β
β - Send ACCEPT(N, V) to acceptors β
β β β
β Acceptor: β
β - If N >= promised: β
β β Accept (N, V) β
β β Notify learners β
β - Else: Reject β
ββββββββββββββββββββββββββββββββββββββββββ
Safety: Once value chosen, never changes
Liveness: May not terminate (dueling proposers)
Paxos vs Raft:
Paxos:
+ Theoretical foundation (proven in 1989)
+ More flexible (multi-leader variants)
- Complex to understand
- Hard to implement correctly
Raft:
+ Easier to understand (designed for clarity)
+ Easier to implement (clear leader)
+ Better for teaching and adoption
- Slightly less flexible than Multi-Paxos
In Practice:
- etcd uses Raft
- Google Chubby uses Paxos
- Both work well in production
Handling Failures
Node Failures:
Scenario: 5-node cluster, up to 2 failures tolerable
Leader Fails:
1. Followers detect missing heartbeats (timeout)
2. Followers start elections
3. New leader elected (majority quorum)
4. New leader has all committed entries (safety)
5. Processing resumes
Follower Fails:
1. Leader continues with remaining nodes
2. Leader still has majority (3/5 available)
3. Failed node recovers β catches up from leader
Majority Fails:
1. No majority quorum available
2. Cluster unavailable (cannot make progress)
3. Prevents split-brain (consistency > availability)
4. Wait for nodes to recover
Network Partitions:
Scenario: 5 nodes split into [3] and [2]
Partition 1: [N1, N2, N3] (majority)
- Can elect leader β
- Can commit entries β
- Remains available
Partition 2: [N4, N5] (minority)
- Cannot elect leader β
- Cannot commit entries β
- Becomes unavailable
When partition heals:
- Minority nodes recognize higher term
- Minority nodes become followers
- Minority nodes catch up from leader
- Cluster reunited β
Safety: No split-brain (only one partition has quorum)
Use Cases
1. Distributed Configuration
etcd for Kubernetes:
- Store cluster configuration
- Service discovery
- Distributed locks
- Leader election for controllers
Why consensus?
- Consistent view of configuration
- Atomic updates
- Survive node failures
2. Leader Election
Kafka Controller Election:
- One broker is controller
- Controller manages partitions
- If controller fails, elect new one
Using ZooKeeper (consensus-based):
1. Brokers try to create /controller node
2. First to create β becomes controller
3. Others watch node for changes
4. If controller dies, node deleted
5. New election triggered
3. Distributed Locks
Acquiring lock with etcd:
1. Client creates unique lease
2. Client writes key with lease
3. Key creation succeeds β lock acquired
4. Other clients see key exists β wait
5. Lease expires β key deleted β lock released
Consensus guarantees:
- Only one client gets lock (linearizable)
- Lock survives client/server failures
Real Systems Using Consensus
System | Algorithm | Use Case | Key Features |
---|---|---|---|
etcd | Raft | Kubernetes config, locks | Strongly consistent key-value store |
Consul | Raft | Service discovery, config | Multi-datacenter support |
ZooKeeper | Zab (Raft-like) | Coordination, leader election | Widely adopted (Kafka, Hadoop) |
CockroachDB | Raft | Distributed SQL | Range-level consensus |
TiKV | Raft | Distributed key-value | Part of TiDB database |
Spanner | Paxos | Googleβs distributed database | Global consistency with TrueTime |
Case Study: etcd with Raft
etcd Architecture:
ββββββββββββββββββββββββββββββββββββββββββββ
β etcd Cluster (3 nodes) β
β βββββββββββ βββββββββββ βββββββββββ β
β β etcd1 β β etcd2 β β etcd3 β β
β β(Leader) β β(Followerβ β(Followerβ β
β βββββββββββ βββββββββββ βββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββ
β
β (client requests)
β
ββββββββββββββββββββββββββββββββββββββββββββ
β Kubernetes API Server β
β - Reads cluster state from etcd β
β - Writes updates to etcd β
β - Watches for changes β
ββββββββββββββββββββββββββββββββββββββββββββ
Write Flow:
1. Client sends PUT /pods/pod-1 to any etcd node
2. If follower: Forward to leader
3. Leader appends to log
4. Leader replicates to followers
5. Majority ACK (2/3) β commit β
6. Leader applies to state machine
7. Leader responds to client
8. Followers apply to state machines
Guarantees:
β Linearizable reads/writes
β Consistent snapshots
β Survives minority failures (1/3)
β MVCC for historical reads
Case Study: ZooKeeper
ZooKeeper (Zab Protocol):
ZooKeeper Ensemble (3 or 5 nodes):
ββββββββββββββββββββββββββββββββββββββ
β Leader: Processes all writes β
β Followers: Serve reads β
β Observers: Scale reads (no quorum)β
ββββββββββββββββββββββββββββββββββββββ
Kafka Controller Election using ZK:
1. Broker starts, connects to ZooKeeper
2. Tries to create /controller ephemeral node
3. First broker β creates node β becomes controller
4. Other brokers β node exists β become followers
5. All brokers watch /controller for changes
6. Controller dies β ephemeral node deleted
7. All brokers notified β start new election
Guarantees:
β Sequential consistency (not linearizable)
β Atomic updates
β Ordered operations
β Session management with leases
When to Use Consensus
β Perfect Use Cases
Distributed Configuration Management
Scenario: Kubernetes cluster configuration
Requirement: All nodes see same config, survive failures
Solution: etcd with Raft
Benefit: Consistent view, automatic failover
Leader Election
Scenario: Kafka controller election
Requirement: Exactly one controller at all times
Solution: ZooKeeper consensus
Benefit: No split-brain, automatic re-election
Distributed Locks
Scenario: Ensure only one job runs (cron)
Requirement: Mutual exclusion across servers
Solution: etcd lease with Raft
Benefit: Lock survives failures, no duplicate execution
β When NOT to Use
High-Throughput Data Storage
Problem: Consensus is slow (requires majority)
Alternative: Eventual consistency (Cassandra, DynamoDB)
Example: Storing millions of writes/second
Multi-Datacenter with Low Latency
Problem: Consensus requires majority across DCs (high latency)
Alternative: Async replication, conflict resolution
Example: Global social media application
Simple Use Cases
Problem: Consensus is complex (operational overhead)
Alternative: Single leader with backups
Example: Small application with few nodes
Interview Application
Common Interview Question
Q: βHow would you implement a distributed lock service that survives node failures and network partitions?β
Strong Answer:
βIβd build a distributed lock service using consensus (Raft):
Architecture:
- 3 or 5 node cluster running Raft consensus
- Lease-based locks with automatic expiration
- Strong consistency guarantees (linearizable)
Lock Acquisition:
AcquireLock(lockName, leaseDuration): 1. Generate unique client ID 2. Send to leader: CREATE lock/{lockName} with clientID, lease 3. Raft replicates to majority (quorum) 4. If successfully created: Return lock token 5. If already exists: Return failure (lock held)
Lock Release:
ReleaseLock(lockName, clientID): 1. Send to leader: DELETE lock/{lockName} if owner==clientID 2. Raft replicates deletion 3. Majority ACK β lock released
Lease Expiration:
- Lock has TTL (e.g., 30 seconds)
- Client must renew lease (heartbeat every 10s)
- If client crashes: Lease expires, lock auto-released
- Prevents orphaned locks
Handling Failures:
- Client Failure:
- Lease expires β lock released
- Other clients can acquire lock
- Leader Failure:
- Followers detect missing heartbeats
- New election (majority quorum)
- New leader has all committed locks
- Processing resumes
- Network Partition:
- Majority partition: Can grant/release locks β
- Minority partition: Cannot grant locks β (no quorum)
- Prevents split-brain (no duplicate locks)
Consistency Guarantees:
- Mutual Exclusion: Only one client holds lock (consensus ensures)
- Deadlock-Free: Leases prevent orphaned locks
- Fault Tolerance: Survives minority failures
API Design:
// Acquire lock with 30-second lease token = lock_service.acquire("my-lock", ttl=30) if token: try: // Do critical work process_job() finally: lock_service.release("my-lock", token) else: // Lock held by another client retry_later()
Trade-offs:
- Latency: 10-50ms (consensus overhead)
- Throughput: ~1000s locks/sec (consensus bottleneck)
- Availability: Requires majority (unavailable during partition)
But acceptable for coordination use cases where consistency > performance
Real-World Example: etcd implements this exact design for Kubernetes distributed locksβ
Code Example
Simplified Raft-Style Leader Election
import time
import random
import threading
from enum import Enum
from typing import Dict, List
class NodeState(Enum):
FOLLOWER = "follower"
CANDIDATE = "candidate"
LEADER = "leader"
class RaftNode:
"""Simplified Raft node (leader election only)"""
def __init__(self, node_id: int, peers: List[int]):
self.node_id = node_id
self.peers = peers
self.state = NodeState.FOLLOWER
self.current_term = 0
self.voted_for = None
self.leader_id = None
# Timeouts
self.election_timeout = random.uniform(150, 300) / 1000 # ms
self.heartbeat_interval = 50 / 1000 # 50ms
self.last_heartbeat = time.time()
# Vote tracking
self.votes_received = set()
# Thread for running election/heartbeat logic
self.running = True
self.thread = threading.Thread(target=self.run, daemon=True)
self.thread.start()
def run(self):
"""Main loop for node"""
while self.running:
if self.state == NodeState.FOLLOWER:
self._follower_loop()
elif self.state == NodeState.CANDIDATE:
self._candidate_loop()
elif self.state == NodeState.LEADER:
self._leader_loop()
time.sleep(0.01) # 10ms tick
def _follower_loop(self):
"""Follower waits for heartbeat"""
if time.time() - self.last_heartbeat > self.election_timeout:
print(f"Node {self.node_id}: Election timeout, becoming candidate")
self.state = NodeState.CANDIDATE
def _candidate_loop(self):
"""Candidate runs election"""
# Start new election
self.current_term += 1
self.voted_for = self.node_id
self.votes_received = {self.node_id}
print(f"Node {self.node_id}: Starting election for term {self.current_term}")
# Request votes from peers (simplified: assume all grant)
# In real Raft, send RequestVote RPC
granted_votes = len(self.peers) // 2 + 1 # Simulate majority
if len(self.votes_received) >= granted_votes:
print(f"Node {self.node_id}: Won election with {len(self.votes_received)} votes")
self.state = NodeState.LEADER
self.leader_id = self.node_id
else:
# Election timeout, retry
time.sleep(self.election_timeout)
print(f"Node {self.node_id}: Election failed, retrying")
def _leader_loop(self):
"""Leader sends heartbeats"""
print(f"Node {self.node_id}: Sending heartbeats (term {self.current_term})")
# Send heartbeats to all followers
# (In real Raft: AppendEntries RPC)
for peer in self.peers:
if peer != self.node_id:
# Send heartbeat...
pass
time.sleep(self.heartbeat_interval)
def receive_heartbeat(self, term: int, leader_id: int):
"""Handle heartbeat from leader"""
if term >= self.current_term:
self.current_term = term
self.leader_id = leader_id
self.state = NodeState.FOLLOWER
self.last_heartbeat = time.time()
print(f"Node {self.node_id}: Received heartbeat from leader {leader_id}")
def stop(self):
"""Stop node"""
self.running = False
# Usage Example
if __name__ == '__main__':
# Create 3-node cluster
nodes = [
RaftNode(node_id=1, peers=[1, 2, 3]),
RaftNode(node_id=2, peers=[1, 2, 3]),
RaftNode(node_id=3, peers=[1, 2, 3])
]
# Let election run
time.sleep(5)
# Stop all nodes
for node in nodes:
node.stop()
Related Content
Prerequisites:
- Leader-Follower Replication - Replication patterns
- Quorum - Majority voting
Related Concepts:
- Eventual Consistency - Alternative consistency model
- Sharding - Partitioning with consensus
Used In Systems:
- etcd: Kubernetes configuration and coordination
- Consul: Service discovery and configuration
- ZooKeeper: Distributed coordination for Kafka, Hadoop
Explained In Detail:
- Distributed Systems Deep Dive - Consensus algorithms in depth
Quick Self-Check
- Can explain the consensus problem in 60 seconds?
- Understand Raft leader election process?
- Know how log replication works?
- Can explain how consensus handles network partitions?
- Understand when to use consensus vs eventual consistency?
- Know real systems using consensus (etcd, ZooKeeper)?