Failover is the process of automatically transferring operations from a failed primary system to a standby backup, minimizing downtime and ensuring service continuity. It’s essential for high-availability systems, enabling recovery from hardware failures, software crashes, and network issues within seconds to minutes.
Visual Overview
Failover Overview
Failover Overview
NORMAL OPERATION (Leader-Follower)
┌────────────────────────────────────────────────────┐│││ Clients →Primary/Leader││↓││ [Replicating] ││↓││Standby/Follower (Hot/Warm/Cold) ││││ Status: Primary is HEALTHY✓│└────────────────────────────────────────────────────┘FAILURE DETECTION┌────────────────────────────────────────────────────┐│Heartbeat Monitor││↓││Primary✗ (No heartbeat for 10 seconds) ││↓││ Failure Threshold Exceeded ││↓││TRIGGER FAILOVER! │└────────────────────────────────────────────────────┘FAILOVER PROCESS┌────────────────────────────────────────────────────┐│ T0: Primaryfails (crash, network partition) ││ T1: Health checker detects failure (5-10s) ││ T2: Standbypromoted to new primary (2-5s) ││ T3: Clients redirected to new primary (1-2s) ││ T4: Service restored││││ Total downtime: 8-17 seconds ││││ New topology: ││ Clients →New Primary (old Standby) ✓││↓││ Old Primary (offline) ✗│└────────────────────────────────────────────────────┘SPLIT-BRAIN PROBLEM (What Can Go Wrong)
┌────────────────────────────────────────────────────┐│Network Partition: ││││Region A║Region B││Primary✓║Standby││ (still alive) ║ (promoted to Primary) ✓││↓║↓││ Clients A ║ Clients B ││ write X=1 ║ write X=2 ││║││ Result: TWO PRIMARIES = DATA DIVERGENCE ✕ ││││ Solution: Fencing (kill old primary with authority)│└────────────────────────────────────────────────────┘
Core Explanation
What is Failover?
Failover is the automatic or manual process of switching from a failed primary system to a standby backup system to maintain service availability. It’s a key component of high-availability (HA) architectures.
Key Components:
Primary/Leader: Active system handling requests
Standby/Follower: Backup system ready to take over
Health Monitor: Detects primary failures via heartbeats
Failover Orchestrator: Promotes standby to primary
Fencing Mechanism: Prevents split-brain scenarios
Standby Types
1. Hot Standby (Active-Passive)
Hot Standby
Hot Standby
Setup:
Primary: Handles all traffic
Standby: Fully synced, ready to serve immediately
Replication: Continuous, synchronous or near-sync
Failover Time: Seconds (fastest)
Cost: High (standby hardware running idle)
Example:
Primary DB: PostgreSQL with streaming replication
Standby DB: Replica in sync, can become primary instantlyUse Case: Financial systems, databases (99.99% uptime)
2. Warm Standby
Warm Standby
Warm Standby
Setup:
Primary: Handles all traffic
Standby: Partially synced, needs brief preparation
Replication: Periodic or continuous async
Failover Time: Minutes
Cost: Medium (standby runs but lighter workload)
Example:
Primary: Application server with session state
Standby: Server running but not in load balancer
Use Case: Web applications, moderate SLA (99.9% uptime)
3. Cold Standby
Cold Standby
Cold Standby
Setup:
Primary: Handles all traffic
Standby: Offline, restored from backup
Replication: Periodic backups only
Failover Time: Hours (restore from backup + configure)
Cost: Low (standby hardware can be repurposed)
Example:
Primary: Production database
Standby: Backup snapshots on S3, restore to new instance
Use Case: Development systems, cost-sensitive applications
Failure Detection
Heartbeat Monitoring:
Heartbeat Monitoring
Heartbeat Monitoring
HEARTBEAT PROTOCOL:
┌─────────────────────────────────────────────┐│ Every 1-5 seconds: ││││ Primary → "I'm alive" → Health Monitor││││ If no heartbeat for N seconds: ││ - N = 10s: Aggressive (false positives) ││ - N = 30s: Conservative (slower failover) ││ - N = 10s with 3 retries: Balanced✓│└─────────────────────────────────────────────┘What to Monitor:
✓ Process is running (liveness probe)
✓ Service is responsive (readiness probe)
✓ Database connections work
✓ Disk space available
✓ CPU/Memory not exhausted
False Positive Causes:
- Temporary network glitch
- GC pause (Java stop-the-world)
- High CPU causing timeout
- Switch/router failure (not server failure)
Mitigation:
- Multiple independent monitors
- Quorum-based decision (3/5 monitors agree)
- Grace period + retries
Failover Process Steps
Step-by-Step Breakdown:
Failover Process Steps
Failover Process Steps
1. FAILURE DETECTION (5-30s)
- Health monitor misses N heartbeats
- Multiple monitors reach consensus
- Declare primary unhealthy
2. FENCING (1-5s)
- Prevent split-brain
- Kill old primary (STONITH: Shoot The Other Node In The Head)
- Revoke old primary's network access
- Acquire distributed lock/lease
3. PROMOTION (2-10s)
- Standby catches up replication lag (if any)
- Standby promoted to primary role
- Update metadata (e.g., Kafka controller registry)
4. DNS/ROUTING UPDATE (1-60s)
- Update DNS to point to new primary
- Or update load balancer configuration
- Or use floating IP (instant failover)
5. CLIENT RECONNECTION (0-30s)
- Clients detect connection failure
- Retry with backoff
- Discover new primary endpoint
- Resume operations
Total Downtime: 8-135 seconds (depends on config)
RPO = Maximum acceptable data lossExamples:
- Bank transactions: RPO = 0 (no loss acceptable)
- User comments: RPO = 5 minutes (acceptable)
- Log aggregation: RPO = 1 hour (can lose some logs)
Factors affecting RPO:
- Replication strategy (sync vs async)
- Commit acknowledgment (quorum required?)
- Checkpoint frequency
- Backup frequency (for cold standby)
TRADEOFF:
Low RPO (sync replication) = Higher latency
High RPO (async replication) = Lower latency, more data loss
Split-Brain Problem
What is Split-Brain?
Split-Brain Problem
Split-Brain Problem
Network partition causes both nodes to think they're primary:
Before partition:
Primary A (serving traffic) ←Health Monitor→Standby B
After partition:
Primary A (still serving) ║Standby B (promoted, now serving)
↓║↓
Clients in Region A║ Clients in Region Bwrite X=1 ║write X=2
PROBLEM: Divergent data, corruption when partition heals
Prevention Techniques:
1. Fencing (STONITH)
Fencing (STONITH)
Fencing (STONITH)
Before promoting standby, KILL old primary
Methods:
- Power off via IPMI/iLO
- Network isolation (block at switch)
- Kernel panic trigger
- Forceful process terminationUse case: Shared storage systems (SAN)
2. Distributed Locks / Leases
Distributed Locks / Leases
Distributed Locks / Leases
Acquire lock before becoming primary
Example (etcd):
1. Standby tries to acquire lease: /leader
2. Only succeeds if old primary's lease expired
3. Old primary cannot write without valid lease
4. Result: At most one primary at a time ✓Use case: Kafka controller election
3. Quorum / Witness Node
Quorum / Witness Node
Quorum / Witness Node
Use majority voting to determine primary
Setup: 3 nodes (A, B, C)
- A is primary (has 2/3 votes)
- Network partition: A isolated, B+C can see each other
- B+C have majority (2/3), can electnew primary
- A has minority (1/3), cannot remain primary
- Result: Only B or C can be new primaryUse case: Distributed databases (Cassandra, Riak)
4. Generation Number / Epoch
Generation Number / Epoch
Generation Number / Epoch
Increment version number on each failover
Primary A has epoch=5
After failover, Primary B has epoch=6
If A comes back, it sees epoch=5 < 6
A demotes itself and syncs from B
Use case: Kafka, ZooKeeper
Real Systems Using Failover
System
Failover Type
Detection Time
RTO
RPO
Split-Brain Prevention
PostgreSQL
Hot Standby (streaming replication)
10-30s
~30s
0-5s
Witness server, fencing
MySQL
Semi-sync replication
10-30s
~1min
0 (sync ack)
GTID, virtual IP
Redis Sentinel
Hot Standby (Sentinel monitors)
5-15s
~10s
0-1s
Quorum (majority of Sentinels)
Kafka Controller
Hot Standby (ZooKeeper election)
10-30s
~20s
0 (committed log)
ZooKeeper leader election, epoch
AWS RDS
Multi-AZ (automated failover)
30-60s
60-120s
0
AWS orchestration
Cassandra
Leaderless (no failover needed)
N/A
N/A
N/A
Quorum reads/writes
Case Study: PostgreSQL Failover
PostgreSQL Failover with Patroni
PostgreSQL Failover with Patroni
PostgreSQL Streaming Replication + PatroniArchitecture:
┌─────────────────────────────────────────────────┐│Primary (Leader) ││↓ (continuous WALstreaming) ││Standby 1 (sync replica, zero lag) ││Standby 2 (async replica, small lag) ││││ Health Monitor: Patroni (uses etcd for DCS) │└─────────────────────────────────────────────────┘Failure Scenario:
1. Primarycrashes (hardware failure)
2. Patroni on each node detects (10s heartbeat miss)
3. Standbys try to acquire leadership lock in etcd
4. Standby 1 (most up-to-date) acquires lock
5. Standby 1 runs pg_ctl promote
6. Standby 1 becomes new primary (~5s)
7. Patroniupdates DNS or floating IP (2s)
8. Applications reconnect to new primary (~5s)
Total downtime: ~22 seconds
RPO: 0 (sync replication to Standby 1)
Configuration:
synchronous_standby_names = 'standby1'
synchronous_commit = on
wal_level = replica
max_wal_senders = 5
Result: Zero data loss, ~20s downtime
Case Study: Kafka Controller Failover
Kafka Controller Failover
Kafka Controller Failover
Kafka Controller Election (via ZooKeeper)
Normal Operation:
- One broker is controller (manages partition leaders)
- Controller broker has ephemeral node in ZooKeeper: /controller
Failure:
1. Controller broker crashes
2. ZooKeeperdetects session timeout (6-10s)
3. ZooKeeperdeletes /controller ephemeral node
4. All brokers watch /controller for changes
5. Brokers race to create /controller node
6. First to create becomes new controller
7. New controller loads cluster metadata from ZK
8. New controller sends LeaderAndIsr requests to brokers
9. Partition leaders updated, cluster operational
Total downtime: ~10-20s (partition leadership updates)
RPO: 0 (committed messages replicated to ISR)
Split-Brain Prevention:
- ZooKeeper ensures only one /controller node
- Controller Epoch incremented on each election
- Brokers reject requests with old epoch
When to Use Failover
Use Failover When:
High Availability Required
High Availability Use Case
High Availability Use Case
Scenario: E-commerce checkout service
Requirement: 99.95% uptime (4 hours downtime/year)
Solution: Hot standby with automatic failover
Trade-off: Cost of redundant infrastructure
Data Loss Unacceptable
Zero Data Loss Use Case
Zero Data Loss Use Case
Scenario: Payment transaction database
Requirement: Zero data loss (RPO = 0)
Solution: Synchronous replication to hot standby
Trade-off: Higher write latency
RTO Measured in Seconds/Minutes
Fast RTO Use Case
Fast RTO Use Case
Scenario: Live video streaming control plane
Requirement: Failover < 30 secondsSolution: Hot standby with fast health checks
Trade-off: False positives from aggressive timeouts
When NOT to Use Failover:
Stateless Services
Stateless Services Warning
Stateless Services Warning
Problem: Failover is overkillSolution: Use load balancer with multiple active instances
Example: Stateless REST APIs, web servers
Benefit: Simpler, no failover orchestration needed
Eventually Consistent Systems
Eventually Consistent Systems Warning
Eventually Consistent Systems Warning
Problem: Failover adds complexity without benefit
Solution: Multi-master or leaderless replication
Example: Cassandra, DynamoDB (quorum writes)
Benefit: No single point of failure, continuous availability
Cost-Sensitive Non-Critical Systems
Cost-Sensitive Systems Warning
Cost-Sensitive Systems Warning
Problem: Hot standby doubles infrastructure costSolution: Use cold standby (restore from backup)
Example: Development databases, analytics pipelines
Benefit: Save money, accept longer downtime
Interview Application
Common Interview Question
Q: “Design a highly available database for a payment system. How would you handle primary database failure?”
Strong Answer:
“For a payment system where data loss is unacceptable, I’d design a hot standby failover system with these characteristics:
Architecture:
Primary database with synchronous replication to hot standby
Hot standby in different availability zone (same region for low latency)
Health monitoring via Patroni or similar tool (10-second heartbeat)
Distributed coordination using etcd or ZooKeeper for split-brain prevention
Failure Handling:
Detection (10s): Patroni detects primary unresponsive after 2 missed heartbeats
Fencing (2s): Revoke old primary’s write access via network isolation
Promotion (5s): Standby promoted to primary, acquires leadership lock in etcd
Routing (5s): Update floating IP or DNS to point to new primary