TL;DR
Failover is the process of automatically transferring operations from a failed primary system to a standby backup, minimizing downtime and ensuring service continuity. It’s essential for high-availability systems, enabling recovery from hardware failures, software crashes, and network issues within seconds to minutes.
Visual Overview
NORMAL OPERATION (Leader-Follower) ┌────────────────────────────────────────────────────┐ │ │ │ Clients → Primary/Leader │ │ ↓ │ │ [Replicating] │ │ ↓ │ │ Standby/Follower (Hot/Warm/Cold) │ │ │ │ Status: Primary is HEALTHY ✓ │ └────────────────────────────────────────────────────┘ FAILURE DETECTION ┌────────────────────────────────────────────────────┐ │ Heartbeat Monitor │ │ ↓ │ │ Primary ✗ (No heartbeat for 10 seconds) │ │ ↓ │ │ Failure Threshold Exceeded │ │ ↓ │ │ TRIGGER FAILOVER! │ └────────────────────────────────────────────────────┘ FAILOVER PROCESS ┌────────────────────────────────────────────────────┐ │ T0: Primary fails (crash, network partition) │ │ T1: Health checker detects failure (5-10s) │ │ T2: Standby promoted to new primary (2-5s) │ │ T3: Clients redirected to new primary (1-2s) │ │ T4: Service restored │ │ │ │ Total downtime: 8-17 seconds │ │ │ │ New topology: │ │ Clients → New Primary (old Standby) ✓ │ │ ↓ │ │ Old Primary (offline) ✗ │ └────────────────────────────────────────────────────┘ SPLIT-BRAIN PROBLEM (What Can Go Wrong) ┌────────────────────────────────────────────────────┐ │ Network Partition: │ │ │ │ Region A ║ Region B │ │ Primary ✓ ║ Standby │ │ (still alive) ║ (promoted to Primary) ✓ │ │ ↓ ║ ↓ │ │ Clients A ║ Clients B │ │ write X=1 ║ write X=2 │ │ ║ │ │ Result: TWO PRIMARIES = DATA DIVERGENCE ✕ │ │ │ │ Solution: Fencing (kill old primary with authority)│ └────────────────────────────────────────────────────┘
Core Explanation
What is Failover?
Failover is the automatic or manual process of switching from a failed primary system to a standby backup system to maintain service availability. It’s a key component of high-availability (HA) architectures.
Key Components:
- Primary/Leader: Active system handling requests
- Standby/Follower: Backup system ready to take over
- Health Monitor: Detects primary failures via heartbeats
- Failover Orchestrator: Promotes standby to primary
- Fencing Mechanism: Prevents split-brain scenarios
Standby Types
1. Hot Standby (Active-Passive)
Setup: Primary: Handles all traffic Standby: Fully synced, ready to serve immediately Replication: Continuous, synchronous or near-sync Failover Time: Seconds (fastest) Cost: High (standby hardware running idle) Example: Primary DB: PostgreSQL with streaming replication Standby DB: Replica in sync, can become primary instantly Use Case: Financial systems, databases (99.99% uptime)
2. Warm Standby
Setup: Primary: Handles all traffic Standby: Partially synced, needs brief preparation Replication: Periodic or continuous async Failover Time: Minutes Cost: Medium (standby runs but lighter workload) Example: Primary: Application server with session state Standby: Server running but not in load balancer Use Case: Web applications, moderate SLA (99.9% uptime)
3. Cold Standby
Setup: Primary: Handles all traffic Standby: Offline, restored from backup Replication: Periodic backups only Failover Time: Hours (restore from backup + configure) Cost: Low (standby hardware can be repurposed) Example: Primary: Production database Standby: Backup snapshots on S3, restore to new instance Use Case: Development systems, cost-sensitive applications
Failure Detection
Heartbeat Monitoring:
HEARTBEAT PROTOCOL: ┌─────────────────────────────────────────────┐ │ Every 1-5 seconds: │ │ │ │ Primary → "I'm alive" → Health Monitor │ │ │ │ If no heartbeat for N seconds: │ │ - N = 10s: Aggressive (false positives) │ │ - N = 30s: Conservative (slower failover) │ │ - N = 10s with 3 retries: Balanced ✓ │ └─────────────────────────────────────────────┘ What to Monitor: ✓ Process is running (liveness probe) ✓ Service is responsive (readiness probe) ✓ Database connections work ✓ Disk space available ✓ CPU/Memory not exhausted False Positive Causes: - Temporary network glitch - GC pause (Java stop-the-world) - High CPU causing timeout - Switch/router failure (not server failure) Mitigation: - Multiple independent monitors - Quorum-based decision (3/5 monitors agree) - Grace period + retries
Failover Process Steps
Step-by-Step Breakdown:
1. FAILURE DETECTION (5-30s) - Health monitor misses N heartbeats - Multiple monitors reach consensus - Declare primary unhealthy 2. FENCING (1-5s) - Prevent split-brain - Kill old primary (STONITH: Shoot The Other Node In The Head) - Revoke old primary's network access - Acquire distributed lock/lease 3. PROMOTION (2-10s) - Standby catches up replication lag (if any) - Standby promoted to primary role - Update metadata (e.g., Kafka controller registry) 4. DNS/ROUTING UPDATE (1-60s) - Update DNS to point to new primary - Or update load balancer configuration - Or use floating IP (instant failover) 5. CLIENT RECONNECTION (0-30s) - Clients detect connection failure - Retry with backoff - Discover new primary endpoint - Resume operations Total Downtime: 8-135 seconds (depends on config)
RTO vs RPO
Recovery Time Objective (RTO):
RTO = Maximum acceptable downtime Examples: - E-commerce site: RTO = 5 minutes (lose sales) - Payment processing: RTO = 30 seconds (critical) - Analytics dashboard: RTO = 30 minutes (non-critical) Factors affecting RTO: - Standby type (hot < warm < cold) - Failure detection time - Promotion process complexity - Client reconnection speed
Recovery Point Objective (RPO):
RPO = Maximum acceptable data loss Examples: - Bank transactions: RPO = 0 (no loss acceptable) - User comments: RPO = 5 minutes (acceptable) - Log aggregation: RPO = 1 hour (can lose some logs) Factors affecting RPO: - Replication strategy (sync vs async) - Commit acknowledgment (quorum required?) - Checkpoint frequency - Backup frequency (for cold standby) TRADEOFF: Low RPO (sync replication) = Higher latency High RPO (async replication) = Lower latency, more data loss
Split-Brain Problem
What is Split-Brain?
Network partition causes both nodes to think they're primary: Before partition: Primary A (serving traffic) ← Health Monitor → Standby B After partition: Primary A (still serving) ║ Standby B (promoted, now serving) ↓ ║ ↓ Clients in Region A ║ Clients in Region B write X=1 ║ write X=2 PROBLEM: Divergent data, corruption when partition heals
Prevention Techniques:
1. Fencing (STONITH)
Before promoting standby, KILL old primary Methods: - Power off via IPMI/iLO - Network isolation (block at switch) - Kernel panic trigger - Forceful process termination Use case: Shared storage systems (SAN)
2. Distributed Locks / Leases
Acquire lock before becoming primary Example (etcd): 1. Standby tries to acquire lease: /leader 2. Only succeeds if old primary's lease expired 3. Old primary cannot write without valid lease 4. Result: At most one primary at a time ✓ Use case: Kafka controller election
3. Quorum / Witness Node
Use majority voting to determine primary Setup: 3 nodes (A, B, C) - A is primary (has 2/3 votes) - Network partition: A isolated, B+C can see each other - B+C have majority (2/3), can elect new primary - A has minority (1/3), cannot remain primary - Result: Only B or C can be new primary Use case: Distributed databases (Cassandra, Riak)
4. Generation Number / Epoch
Increment version number on each failover Primary A has epoch=5 After failover, Primary B has epoch=6 If A comes back, it sees epoch=5 < 6 A demotes itself and syncs from B Use case: Kafka, ZooKeeper
Real Systems Using Failover
| System | Failover Type | Detection Time | RTO | RPO | Split-Brain Prevention |
|---|---|---|---|---|---|
| PostgreSQL | Hot Standby (streaming replication) | 10-30s | ~30s | 0-5s | Witness server, fencing |
| MySQL | Semi-sync replication | 10-30s | ~1min | 0 (sync ack) | GTID, virtual IP |
| Redis Sentinel | Hot Standby (Sentinel monitors) | 5-15s | ~10s | 0-1s | Quorum (majority of Sentinels) |
| Kafka Controller | Hot Standby (ZooKeeper election) | 10-30s | ~20s | 0 (committed log) | ZooKeeper leader election, epoch |
| AWS RDS | Multi-AZ (automated failover) | 30-60s | 60-120s | 0 | AWS orchestration |
| Cassandra | Leaderless (no failover needed) | N/A | N/A | N/A | Quorum reads/writes |
Case Study: PostgreSQL Failover
PostgreSQL Streaming Replication + Patroni Architecture: ┌─────────────────────────────────────────────────┐ │ Primary (Leader) │ │ ↓ (continuous WAL streaming) │ │ Standby 1 (sync replica, zero lag) │ │ Standby 2 (async replica, small lag) │ │ │ │ Health Monitor: Patroni (uses etcd for DCS) │ └─────────────────────────────────────────────────┘ Failure Scenario: 1. Primary crashes (hardware failure) 2. Patroni on each node detects (10s heartbeat miss) 3. Standbys try to acquire leadership lock in etcd 4. Standby 1 (most up-to-date) acquires lock 5. Standby 1 runs pg_ctl promote 6. Standby 1 becomes new primary (~5s) 7. Patroni updates DNS or floating IP (2s) 8. Applications reconnect to new primary (~5s) Total downtime: ~22 seconds RPO: 0 (sync replication to Standby 1) Configuration: synchronous_standby_names = 'standby1' synchronous_commit = on wal_level = replica max_wal_senders = 5 Result: Zero data loss, ~20s downtime
Case Study: Kafka Controller Failover
Kafka Controller Election (via ZooKeeper) Normal Operation: - One broker is controller (manages partition leaders) - Controller broker has ephemeral node in ZooKeeper: /controller Failure: 1. Controller broker crashes 2. ZooKeeper detects session timeout (6-10s) 3. ZooKeeper deletes /controller ephemeral node 4. All brokers watch /controller for changes 5. Brokers race to create /controller node 6. First to create becomes new controller 7. New controller loads cluster metadata from ZK 8. New controller sends LeaderAndIsr requests to brokers 9. Partition leaders updated, cluster operational Total downtime: ~10-20s (partition leadership updates) RPO: 0 (committed messages replicated to ISR) Split-Brain Prevention: - ZooKeeper ensures only one /controller node - Controller Epoch incremented on each election - Brokers reject requests with old epoch
When to Use Failover
Use Failover When:
High Availability Required
Scenario: E-commerce checkout service Requirement: 99.95% uptime (4 hours downtime/year) Solution: Hot standby with automatic failover Trade-off: Cost of redundant infrastructure
Data Loss Unacceptable
Scenario: Payment transaction database Requirement: Zero data loss (RPO = 0) Solution: Synchronous replication to hot standby Trade-off: Higher write latency
RTO Measured in Seconds/Minutes
Scenario: Live video streaming control plane Requirement: Failover < 30 seconds Solution: Hot standby with fast health checks Trade-off: False positives from aggressive timeouts
When NOT to Use Failover:
Stateless Services
Problem: Failover is overkill Solution: Use load balancer with multiple active instances Example: Stateless REST APIs, web servers Benefit: Simpler, no failover orchestration needed
Eventually Consistent Systems
Problem: Failover adds complexity without benefit Solution: Multi-master or leaderless replication Example: Cassandra, DynamoDB (quorum writes) Benefit: No single point of failure, continuous availability
Cost-Sensitive Non-Critical Systems
Problem: Hot standby doubles infrastructure cost Solution: Use cold standby (restore from backup) Example: Development databases, analytics pipelines Benefit: Save money, accept longer downtime
Interview Application
Common Interview Question
Q: “Design a highly available database for a payment system. How would you handle primary database failure?”
Strong Answer:
“For a payment system where data loss is unacceptable, I’d design a hot standby failover system with these characteristics:
Architecture:
- Primary database with synchronous replication to hot standby
- Hot standby in different availability zone (same region for low latency)
- Health monitoring via Patroni or similar tool (10-second heartbeat)
- Distributed coordination using etcd or ZooKeeper for split-brain prevention
Failure Handling:
- Detection (10s): Patroni detects primary unresponsive after 2 missed heartbeats
- Fencing (2s): Revoke old primary’s write access via network isolation
- Promotion (5s): Standby promoted to primary, acquires leadership lock in etcd
- Routing (5s): Update floating IP or DNS to point to new primary
- Reconnection (5s): Applications retry connections, resume transactions
Guarantees:
- RTO: ~27 seconds (acceptable for payment system)
- RPO: 0 seconds (synchronous replication means zero data loss)
- Consistency: Transactions committed to both primary and standby before ACK
Trade-offs:
- Latency: Synchronous replication adds ~5-10ms to write latency
- Cost: Hot standby doubles database infrastructure cost
- Complexity: Patroni/etcd adds operational complexity
Split-Brain Prevention:
- Use distributed lock in etcd (only one primary can hold lock)
- Primary must renew lease every 5 seconds or lose write access
- Generation numbers (epochs) to detect stale primaries
Real-World Example:
- Similar to AWS RDS Multi-AZ: synchronous replication, automatic failover
- Or PostgreSQL + Patroni (used by Zalando, widely adopted)”
Why This Answer Works:
- Identifies appropriate failover type (hot standby) for use case
- Explains step-by-step process with timing
- Discusses RTO/RPO trade-offs explicitly
- Addresses split-brain problem proactively
- References real implementations
Code Example
Implementing Simple Failover with Health Checks
// Health Monitor for Failover Detection
class HealthMonitor {
constructor(primaryUrl, standbyUrl, config) {
this.primaryUrl = primaryUrl;
this.standbyUrl = standbyUrl;
this.config = {
heartbeatInterval: config.heartbeatInterval || 5000, // 5s
failureThreshold: config.failureThreshold || 3, // 3 misses
...config,
};
this.consecutiveFailures = 0;
this.currentPrimary = primaryUrl;
this.isFailedOver = false;
}
async start() {
setInterval(() => this.checkHealth(), this.config.heartbeatInterval);
}
async checkHealth() {
try {
const response = await fetch(`${this.currentPrimary}/health`, {
timeout: 3000, // 3s timeout
});
if (response.ok) {
// Primary is healthy
this.consecutiveFailures = 0;
console.log(
`[${new Date().toISOString()}] Primary healthy: ${this.currentPrimary}`
);
} else {
this.handleFailure();
}
} catch (error) {
// Network error, timeout, or server down
this.handleFailure();
}
}
handleFailure() {
this.consecutiveFailures++;
console.log(
`[${new Date().toISOString()}] Primary unhealthy (${this.consecutiveFailures}/${this.config.failureThreshold})`
);
if (this.consecutiveFailures >= this.config.failureThreshold) {
this.triggerFailover();
}
}
async triggerFailover() {
if (this.isFailedOver) {
console.log("Already failed over, skipping");
return;
}
console.log(`[${new Date().toISOString()}] TRIGGERING FAILOVER!`);
try {
// Step 1: Fence old primary (prevent split-brain)
await this.fenceOldPrimary();
// Step 2: Promote standby to primary
await this.promoteStandby();
// Step 3: Update routing
this.currentPrimary = this.standbyUrl;
this.isFailedOver = true;
this.consecutiveFailures = 0;
console.log(
`[${new Date().toISOString()}] Failover complete. New primary: ${this.currentPrimary}`
);
// Notify operators
await this.sendAlert("Failover completed", {
oldPrimary: this.primaryUrl,
newPrimary: this.standbyUrl,
});
} catch (error) {
console.error("Failover failed:", error);
await this.sendAlert("Failover FAILED", { error: error.message });
}
}
async fenceOldPrimary() {
// In production: disable old primary at network/firewall level
// Or send STONITH command to power management
console.log("Fencing old primary (preventing writes)...");
try {
await fetch(`${this.primaryUrl}/admin/disable`, {
method: "POST",
timeout: 2000,
});
} catch (error) {
// Old primary might be completely down, that's OK
console.log("Could not fence old primary (might be completely down)");
}
}
async promoteStandby() {
console.log("Promoting standby to primary...");
const response = await fetch(`${this.standbyUrl}/admin/promote`, {
method: "POST",
timeout: 10000, // Give it time to catch up replication
});
if (!response.ok) {
throw new Error("Failed to promote standby");
}
// Wait for promotion to complete
await this.waitForStandbyReady();
}
async waitForStandbyReady() {
const maxWait = 30000; // 30s max
const startTime = Date.now();
while (Date.now() - startTime < maxWait) {
try {
const response = await fetch(`${this.standbyUrl}/health`);
if (response.ok) {
const data = await response.json();
if (data.role === "primary") {
console.log("Standby successfully promoted to primary");
return;
}
}
} catch (error) {
// Still promoting, retry
}
await new Promise(resolve => setTimeout(resolve, 1000)); // Wait 1s
}
throw new Error("Standby promotion timeout");
}
async sendAlert(message, details) {
// In production: Send to PagerDuty, Slack, email, etc.
console.error(`ALERT: ${message}`, details);
}
}
// Usage
const monitor = new HealthMonitor(
"http://primary.db.example.com:5432",
"http://standby.db.example.com:5432",
{
heartbeatInterval: 5000, // Check every 5 seconds
failureThreshold: 3, // Failover after 3 consecutive failures
}
);
monitor.start();
// Expected timeline on failure:
// T0: Primary crashes
// T5: First health check fails (1/3)
// T10: Second health check fails (2/3)
// T15: Third health check fails (3/3) → Trigger failover
// T16: Fence old primary (1s)
// T21: Promote standby (5s)
// T22: Update routing
// Total downtime: ~22 seconds
Split-Brain Prevention with Distributed Lock
// Using etcd for distributed locking to prevent split-brain
const { Etcd3 } = require("etcd3");
class FailoverCoordinator {
constructor(etcdHosts, nodeId) {
this.etcd = new Etcd3({ hosts: etcdHosts });
this.nodeId = nodeId;
this.leaderKey = "/cluster/leader";
this.lease = null;
this.isLeader = false;
}
async tryBecomeLeader() {
try {
// Create a lease (TTL = 10 seconds)
this.lease = this.etcd.lease(10);
// Try to acquire leader lock with this lease
const result = await this.etcd
.if(this.leaderKey, "Create", "==", 0) // Only if key doesn't exist
.then(
this.etcd.put(this.leaderKey).value(this.nodeId).lease(this.lease)
)
.else(this.etcd.get(this.leaderKey))
.commit();
if (result.succeeded) {
this.isLeader = true;
console.log(`[${this.nodeId}] Became leader!`);
// Keep renewing lease to maintain leadership
this.startLeaseRenewal();
return true;
} else {
const currentLeader = result.responses[0].kvs[0].value.toString();
console.log(
`[${this.nodeId}] Failed to become leader. Current leader: ${currentLeader}`
);
return false;
}
} catch (error) {
console.error(`[${this.nodeId}] Error acquiring leadership:`, error);
return false;
}
}
async startLeaseRenewal() {
// Renew lease every 5 seconds (TTL is 10s, so we have buffer)
this.renewalInterval = setInterval(async () => {
try {
await this.lease.keepaliveOnce();
console.log(`[${this.nodeId}] Lease renewed`);
} catch (error) {
console.error(
`[${this.nodeId}] Failed to renew lease, losing leadership`
);
this.isLeader = false;
clearInterval(this.renewalInterval);
// Try to become leader again
setTimeout(() => this.tryBecomeLeader(), 1000);
}
}, 5000);
}
async stepDown() {
if (this.lease) {
await this.lease.revoke();
clearInterval(this.renewalInterval);
}
this.isLeader = false;
console.log(`[${this.nodeId}] Stepped down from leadership`);
}
canWrite() {
// Only leader can write (prevents split-brain)
return this.isLeader;
}
}
// Usage on Primary and Standby nodes:
// Primary node
const primary = new FailoverCoordinator(["localhost:2379"], "node-primary");
await primary.tryBecomeLeader(); // Acquires lock
// Standby node
const standby = new FailoverCoordinator(["localhost:2379"], "node-standby");
await standby.tryBecomeLeader(); // Fails (primary holds lock)
// Simulate primary failure (lease expires after 10s without renewal)
// ... network partition or crash ...
// After 10s, standby can acquire lock
await standby.tryBecomeLeader(); // Succeeds! Becomes new leader
// If old primary comes back:
await primary.tryBecomeLeader(); // Fails! Standby is now leader
// Old primary cannot write without leadership lock ✓
Related Content
Prerequisites:
- Distributed Systems Basics - Foundation concepts
Related Concepts:
- Leader-Follower Replication - Understanding replication for standby
- Consensus - Leader election algorithms
- Quorum - Majority-based decisions for failover
- Health Checks - Failure detection mechanisms
Used In Systems:
- High-availability databases (PostgreSQL, MySQL, Redis)
- Distributed coordination (ZooKeeper, etcd)
- Message brokers (Kafka controller election)
Explained In Detail:
- Distributed Systems Deep Dive - Failover patterns in depth
Quick Self-Check
- Can explain failover in 60 seconds?
- Understand difference between hot/warm/cold standby?
- Can explain RTO vs RPO trade-offs?
- Understand split-brain problem and prevention techniques?
- Know failure detection with heartbeats and thresholds?
- Can design failover for a production database?
Production signal