Skip to content

Distributed Systems Basics

The fundamental concepts of distributed computing: how multiple machines coordinate to appear as a single coherent system, navigating network partitions, failures, and the CAP theorem

TL;DR

A distributed system is a collection of independent computers that appear to users as a single coherent system. The fundamental challenge: coordinating multiple machines over an unreliable network while handling partial failures. Key constraints are captured by the CAP theorem—during network partitions, you must choose between consistency and availability.

Visual Overview

Distributed Systems Overview

Why Distributed Systems?

The Single Machine Limit

Every system eventually hits the limits of a single machine:

ConstraintSingle Machine LimitDistributed Solution
CPU~128 cores1000s of machines
Memory~12 TBPetabytes across cluster
Storage~100 TBExabytes across cluster
AvailabilityMachine fails = downRedundancy, failover
LatencyGeography boundEdge locations worldwide

Core Motivations

  1. Scalability: Handle more load than one machine can
  2. Availability: Continue operating despite failures
  3. Latency: Serve users from nearby locations
  4. Cost: Commodity hardware vs expensive mainframes

The CAP Theorem

Proven by Seth Gilbert and Nancy Lynch (2002), building on Eric Brewer’s conjecture:

During a network partition, a distributed system must choose between Consistency and Availability.

The Three Properties

Consistency (C): Every read returns the most recent write or an error. All nodes see the same data at the same time.

Availability (A): Every request to a non-failing node receives a response (no timeouts, no errors).

Partition Tolerance (P): The system continues operating despite arbitrary network partitions.

Why You Can’t Have All Three

Consider two nodes during a network partition:

Node A                    Node B
   │                         │
   │    Network Partition    │
   │        ═══════X═════    │
   │                         │
Client writes X=1 to A       │
   │                         │
   │   Cannot replicate      │
   │   to B (partition)      │
   │                         │
   │                   Client reads X from B
   │                         │
   │        CHOICE:          │
   │   Return X=0 (stale)    │   AP: Available but inconsistent
   │   Return error          │   CP: Consistent but unavailable

Real-World CAP Choices

SystemChoiceBehavior During Partition
ZooKeeperCPMinority partition stops serving
etcdCPRequires quorum for reads/writes
CassandraAPContinues serving, eventual consistency
DynamoDBAP (default)Eventually consistent reads
SpannerCPSacrifices availability for consistency

Fallacies of Distributed Computing

Classic misconceptions that lead to system failures:

  1. The network is reliable — Packets drop, connections break
  2. Latency is zero — Cross-continent: 100-200ms RTT
  3. Bandwidth is infinite — Congestion, throttling
  4. The network is secure — Man-in-middle, spoofing
  5. Topology doesn’t change — Nodes join, leave, fail
  6. There is one administrator — Multi-team, multi-org
  7. Transport cost is zero — Serialization, bandwidth costs
  8. The network is homogeneous — Different protocols, versions

Key Concepts

Network Partitions

A partition occurs when network failure divides the cluster:

Before Partition:
[N1]  [N2]  [N3]  [N4]  [N5]

After Partition:
[N1]  [N2]     ═══X═══     [N3]  [N4]  [N5]
 └─ Minority ─┘               └─── Majority ───┘

Split-brain: Both sides think they’re the leader → data divergence.

Failure Detection

How do you know if a node is dead or just slow?

MethodMechanismTrade-off
HeartbeatsPeriodic “I’m alive” messagesTimeout = false positives
Phi AccrualProbability of failureMore accurate, complex
GossipNodes share failure infoEventually consistent

Ordering and Time

In distributed systems, “happened before” is ambiguous:

Clock TypeMechanismUse Case
Lamport ClocksLogical countersCausal ordering
Vector ClocksPer-node countersDetect conflicts
TrueTimeAtomic clocks + GPSGoogle Spanner
HLCHybrid logical/physicalCockroachDB

Patterns for Handling Failures

1. Replication

Copy data to multiple nodes:

  • Leader-Follower: One writes, others follow
  • Multi-Leader: Multiple writers, conflict resolution
  • Leaderless: All nodes equal, quorum writes

2. Quorum Systems

Require majority agreement: W + R > N

N = 5 replicas
W = 3 (write to 3 nodes)
R = 3 (read from 3 nodes)

At least one node in R saw the write in W
 Guarantees consistency

3. Consensus Algorithms

Agree on a single value despite failures:

  • Paxos: Theoretical foundation, complex
  • Raft: Understandable consensus
  • Zab: ZooKeeper’s algorithm

Related Concepts:

Used In Systems:

  • Every cloud-native application
  • Databases (PostgreSQL, MySQL with replication)
  • Message queues (Kafka, RabbitMQ)
  • Coordination services (ZooKeeper, etcd, Consul)

Next Recommended: Consensus - Learn how distributed systems agree on values despite failures

Interview Notes
⭐ Must-Know
💼90% of system design interviews
Interview Relevance
90% of system design interviews
🏭Every cloud-native system
Production Impact
Powers systems at Every cloud-native system
Network latency, coordination
Performance
Network latency, coordination query improvement
📈Horizontal scaling foundations
Scalability
Horizontal scaling foundations