Skip to content

Consensus

How distributed systems agree on a single value or state across multiple nodes, enabling coordination despite failures and network partitions

TL;DR

Consensus is a fundamental problem in distributed systems where multiple nodes must agree on a single value or decision, even in the presence of failures. Algorithms like Raft and Paxos solve this by using leader election, quorums, and log replication to ensure all nodes eventually agree on the same state. Essential for distributed locks, configuration management, and coordination services like etcd, Consul, and ZooKeeper.

Visual Overview

Consensus Overview

Core Explanation

What is Consensus?

Consensus is the problem of getting multiple distributed nodes to agree on a single value, even when:

  • Nodes fail (crash)
  • Messages are delayed or lost
  • Network partitions occur

Consensus Properties:

  1. Agreement: All non-faulty nodes decide on the same value
  2. Validity: The decided value must have been proposed by some node
  3. Termination: All non-faulty nodes eventually decide
  4. Integrity: Nodes decide at most once

Real-World Analogies:

  • Board of directors voting on decision
  • Jury reaching verdict
  • Politicians passing legislation

Why Consensus is Hard (FLP Impossibility)

The FLP Result (1985):

FLP Impossibility Result

Consensus Algorithms

1. Raft (Understandable Consensus)

Raft Design Goals

Raft Leader Election:

Raft Leader Election

Raft Log Replication:

Raft Log Replication

2. Paxos (Classic Consensus)

Paxos Phases

Paxos vs Raft:

Paxos vs Raft Comparison

Handling Failures

Node Failures:

Node Failures

Network Partitions:

Network Partitions

Use Cases

1. Distributed Configuration

Distributed Configuration

2. Leader Election

Leader Election

3. Distributed Locks

Distributed Locks

Real Systems Using Consensus

SystemAlgorithmUse CaseKey Features
etcdRaftKubernetes config, locksStrongly consistent key-value store
ConsulRaftService discovery, configMulti-datacenter support
ZooKeeperZab (Raft-like)Coordination, leader electionWidely adopted (Kafka, Hadoop)
CockroachDBRaftDistributed SQLRange-level consensus
TiKVRaftDistributed key-valuePart of TiDB database
SpannerPaxosGoogle’s distributed databaseGlobal consistency with TrueTime

Case Study: etcd with Raft

etcd with Raft Architecture

Case Study: ZooKeeper

ZooKeeper (Zab Protocol)

When to Use Consensus

✓ Perfect Use Cases

Distributed Configuration Management

Distributed Configuration Management Use Case

Leader Election

Leader Election Use Case

Distributed Locks

Distributed Locks Use Case

✕ When NOT to Use

High-Throughput Data Storage

High-Throughput Data Storage Warning

Multi-Datacenter with Low Latency

Multi-Datacenter Warning

Simple Use Cases

Simple Use Cases Warning

Interview Application

Common Interview Question

Q: “How would you implement a distributed lock service that survives node failures and network partitions?”

Strong Answer:

“I’d build a distributed lock service using consensus (Raft):

Architecture:

  • 3 or 5 node cluster running Raft consensus
  • Lease-based locks with automatic expiration
  • Strong consistency guarantees (linearizable)

Lock Acquisition:

AcquireLock(lockName, leaseDuration):
  1. Generate unique client ID
  2. Send to leader: CREATE lock/{lockName} with clientID, lease
  3. Raft replicates to majority (quorum)
  4. If successfully created: Return lock token
  5. If already exists: Return failure (lock held)

Lock Release:

ReleaseLock(lockName, clientID):
  1. Send to leader: DELETE lock/{lockName} if owner==clientID
  2. Raft replicates deletion
  3. Majority ACK  lock released

Lease Expiration:

  • Lock has TTL (e.g., 30 seconds)
  • Client must renew lease (heartbeat every 10s)
  • If client crashes: Lease expires, lock auto-released
  • Prevents orphaned locks

Handling Failures:

  1. Client Failure:
    • Lease expires → lock released
    • Other clients can acquire lock
  2. Leader Failure:
    • Followers detect missing heartbeats
    • New election (majority quorum)
    • New leader has all committed locks
    • Processing resumes
  3. Network Partition:
    • Majority partition: Can grant/release locks ✓
    • Minority partition: Cannot grant locks ✗ (no quorum)
    • Prevents split-brain (no duplicate locks)

Consistency Guarantees:

  • Mutual Exclusion: Only one client holds lock (consensus ensures)
  • Deadlock-Free: Leases prevent orphaned locks
  • Fault Tolerance: Survives minority failures

API Design:

// Acquire lock with 30-second lease
token = lock_service.acquire("my-lock", ttl=30)

if token:
  try:
    // Do critical work
    process_job()
  finally:
    lock_service.release("my-lock", token)
else:
  // Lock held by another client
  retry_later()

Trade-offs:

  • Latency: 10-50ms (consensus overhead)
  • Throughput: ~1000s locks/sec (consensus bottleneck)
  • Availability: Requires majority (unavailable during partition)

But acceptable for coordination use cases where consistency > performance

Real-World Example: etcd implements this exact design for Kubernetes distributed locks”

Code Example

Simplified Raft-Style Leader Election

import time
import random
import threading
from enum import Enum
from typing import Dict, List

class NodeState(Enum):
    FOLLOWER = "follower"
    CANDIDATE = "candidate"
    LEADER = "leader"

class RaftNode:
    """Simplified Raft node (leader election only)"""

    def __init__(self, node_id: int, peers: List[int]):
        self.node_id = node_id
        self.peers = peers
        self.state = NodeState.FOLLOWER

        self.current_term = 0
        self.voted_for = None
        self.leader_id = None

        # Timeouts
        self.election_timeout = random.uniform(150, 300) / 1000  # ms
        self.heartbeat_interval = 50 / 1000  # 50ms
        self.last_heartbeat = time.time()

        # Vote tracking
        self.votes_received = set()

        # Thread for running election/heartbeat logic
        self.running = True
        self.thread = threading.Thread(target=self.run, daemon=True)
        self.thread.start()

    def run(self):
        """Main loop for node"""
        while self.running:
            if self.state == NodeState.FOLLOWER:
                self._follower_loop()
            elif self.state == NodeState.CANDIDATE:
                self._candidate_loop()
            elif self.state == NodeState.LEADER:
                self._leader_loop()

            time.sleep(0.01)  # 10ms tick

    def _follower_loop(self):
        """Follower waits for heartbeat"""
        if time.time() - self.last_heartbeat > self.election_timeout:
            print(f"Node {self.node_id}: Election timeout, becoming candidate")
            self.state = NodeState.CANDIDATE

    def _candidate_loop(self):
        """Candidate runs election"""
        # Start new election
        self.current_term += 1
        self.voted_for = self.node_id
        self.votes_received = {self.node_id}

        print(f"Node {self.node_id}: Starting election for term {self.current_term}")

        # Request votes from peers (simplified: assume all grant)
        # In real Raft, send RequestVote RPC
        granted_votes = len(self.peers) // 2 + 1  # Simulate majority

        if len(self.votes_received) >= granted_votes:
            print(f"Node {self.node_id}: Won election with {len(self.votes_received)} votes")
            self.state = NodeState.LEADER
            self.leader_id = self.node_id
        else:
            # Election timeout, retry
            time.sleep(self.election_timeout)
            print(f"Node {self.node_id}: Election failed, retrying")

    def _leader_loop(self):
        """Leader sends heartbeats"""
        print(f"Node {self.node_id}: Sending heartbeats (term {self.current_term})")

        # Send heartbeats to all followers
        # (In real Raft: AppendEntries RPC)
        for peer in self.peers:
            if peer != self.node_id:
                # Send heartbeat...
                pass

        time.sleep(self.heartbeat_interval)

    def receive_heartbeat(self, term: int, leader_id: int):
        """Handle heartbeat from leader"""
        if term >= self.current_term:
            self.current_term = term
            self.leader_id = leader_id
            self.state = NodeState.FOLLOWER
            self.last_heartbeat = time.time()

            print(f"Node {self.node_id}: Received heartbeat from leader {leader_id}")

    def stop(self):
        """Stop node"""
        self.running = False

# Usage Example
if __name__ == '__main__':
    # Create 3-node cluster
    nodes = [
        RaftNode(node_id=1, peers=[1, 2, 3]),
        RaftNode(node_id=2, peers=[1, 2, 3]),
        RaftNode(node_id=3, peers=[1, 2, 3])
    ]

    # Let election run
    time.sleep(5)

    # Stop all nodes
    for node in nodes:
        node.stop()

See It In Action:

Prerequisites:

Related Concepts:

Used In Systems:

  • etcd: Kubernetes configuration and coordination
  • Consul: Service discovery and configuration
  • ZooKeeper: Distributed coordination for Kafka, Hadoop

Explained In Detail:

  • Distributed Systems Deep Dive - Consensus algorithms in depth

Quick Self-Check

  • Can explain the consensus problem in 60 seconds?
  • Understand Raft leader election process?
  • Know how log replication works?
  • Can explain how consensus handles network partitions?
  • Understand when to use consensus vs eventual consistency?
  • Know real systems using consensus (etcd, ZooKeeper)?
Interview Notes
⭐ Must-Know
💼75% of L6+ interviews
Interview Relevance
75% of L6+ interviews
🏭etcd, ZooKeeper, Consul
Production Impact
Powers systems at etcd, ZooKeeper, Consul
Distributed locks
Performance
Distributed locks query improvement
📈Leader election
Scalability
Leader election