Failover | Concepts

TL;DR

Failover is the process of automatically transferring operations from a failed primary system to a standby backup, minimizing downtime and ensuring service continuity. It’s essential for high-availability systems, enabling recovery from hardware failures, software crashes, and network issues within seconds to minutes.

Visual Overview

Failover Overview

NORMAL OPERATION (Leader-Follower)
┌────────────────────────────────────────────────────┐
│                                                    │
│  Clients → Primary/Leader                          │
│               ↓                                    │
│          [Replicating]                             │
│               ↓                                    │
│          Standby/Follower (Hot/Warm/Cold)          │
│                                                    │
│  Status: Primary is HEALTHY ✓                      │
└────────────────────────────────────────────────────┘

FAILURE DETECTION
┌────────────────────────────────────────────────────┐
│ Heartbeat Monitor                                  │
│ ↓                                                  │
│ Primary ✗ (No heartbeat for 10 seconds)            │
│ ↓                                                  │
│ Failure Threshold Exceeded                         │
│ ↓                                                  │
│ TRIGGER FAILOVER!                                  │
└────────────────────────────────────────────────────┘

FAILOVER PROCESS
┌────────────────────────────────────────────────────┐
│ T0: Primary fails (crash, network partition)       │
│ T1: Health checker detects failure (5-10s)         │
│ T2: Standby promoted to new primary (2-5s)         │
│ T3: Clients redirected to new primary (1-2s)       │
│ T4: Service restored                               │
│                                                    │
│ Total downtime: 8-17 seconds                       │
│                                                    │
│ New topology:                                      │
│ Clients → New Primary (old Standby) ✓              │
│ ↓                                                  │
│ Old Primary (offline) ✗                            │
└────────────────────────────────────────────────────┘

SPLIT-BRAIN PROBLEM (What Can Go Wrong)
┌────────────────────────────────────────────────────┐
│ Network Partition:                                 │
│                                                    │
│ Region A ║ Region B                                │
│ Primary ✓ ║ Standby                                │
│ (still alive) ║ (promoted to Primary) ✓            │
│ ↓ ║ ↓                                              │
│ Clients A ║ Clients B                              │
│ write X=1 ║ write X=2                              │
│ ║                                                  │
│ Result: TWO PRIMARIES = DATA DIVERGENCE ✕          │
│                                                    │
│ Solution: Fencing (kill old primary with authority)│
└────────────────────────────────────────────────────┘

Core Explanation

What is Failover?

Failover is the automatic or manual process of switching from a failed primary system to a standby backup system to maintain service availability. It’s a key component of high-availability (HA) architectures.

Key Components:

Primary/Leader: Active system handling requests
Standby/Follower: Backup system ready to take over
Health Monitor: Detects primary failures via heartbeats
Failover Orchestrator: Promotes standby to primary
Fencing Mechanism: Prevents split-brain scenarios

Standby Types

1. Hot Standby (Active-Passive)

Hot Standby

Setup:
Primary: Handles all traffic
Standby: Fully synced, ready to serve immediately

Replication: Continuous, synchronous or near-sync
Failover Time: Seconds (fastest)
Cost: High (standby hardware running idle)

Example:
Primary DB: PostgreSQL with streaming replication
Standby DB: Replica in sync, can become primary instantly

Use Case: Financial systems, databases (99.99% uptime)

2. Warm Standby

Warm Standby

Setup:
Primary: Handles all traffic
Standby: Partially synced, needs brief preparation

Replication: Periodic or continuous async
Failover Time: Minutes
Cost: Medium (standby runs but lighter workload)

Example:
Primary: Application server with session state
Standby: Server running but not in load balancer

Use Case: Web applications, moderate SLA (99.9% uptime)

3. Cold Standby

Cold Standby

Setup:
Primary: Handles all traffic
Standby: Offline, restored from backup

Replication: Periodic backups only
Failover Time: Hours (restore from backup + configure)
Cost: Low (standby hardware can be repurposed)

Example:
Primary: Production database
Standby: Backup snapshots on S3, restore to new instance

Use Case: Development systems, cost-sensitive applications

Failure Detection

Heartbeat Monitoring:

Heartbeat Monitoring

HEARTBEAT PROTOCOL:
┌─────────────────────────────────────────────┐
│  Every 1-5 seconds:                         │
│                                             │
│  Primary → "I'm alive" → Health Monitor     │
│                                             │
│  If no heartbeat for N seconds:             │
│  - N = 10s: Aggressive (false positives)    │
│  - N = 30s: Conservative (slower failover)  │
│  - N = 10s with 3 retries: Balanced ✓       │
└─────────────────────────────────────────────┘

What to Monitor:
✓ Process is running (liveness probe)
✓ Service is responsive (readiness probe)
✓ Database connections work
✓ Disk space available
✓ CPU/Memory not exhausted

False Positive Causes:

- Temporary network glitch
- GC pause (Java stop-the-world)
- High CPU causing timeout
- Switch/router failure (not server failure)

Mitigation:

- Multiple independent monitors
- Quorum-based decision (3/5 monitors agree)
- Grace period + retries

Failover Process Steps

Step-by-Step Breakdown:

Failover Process Steps

1. FAILURE DETECTION (5-30s)
 - Health monitor misses N heartbeats
 - Multiple monitors reach consensus
 - Declare primary unhealthy

2. FENCING (1-5s)
 - Prevent split-brain
 - Kill old primary (STONITH: Shoot The Other Node In The Head)
 - Revoke old primary's network access
 - Acquire distributed lock/lease

3. PROMOTION (2-10s)
 - Standby catches up replication lag (if any)
 - Standby promoted to primary role
 - Update metadata (e.g., Kafka controller registry)

4. DNS/ROUTING UPDATE (1-60s)
 - Update DNS to point to new primary
 - Or update load balancer configuration
 - Or use floating IP (instant failover)

5. CLIENT RECONNECTION (0-30s)
 - Clients detect connection failure
 - Retry with backoff
 - Discover new primary endpoint
 - Resume operations

Total Downtime: 8-135 seconds (depends on config)

RTO vs RPO

Recovery Time Objective (RTO):

Recovery Time Objective (RTO)

RTO = Maximum acceptable downtime

Examples:

- E-commerce site: RTO = 5 minutes (lose sales)
- Payment processing: RTO = 30 seconds (critical)
- Analytics dashboard: RTO = 30 minutes (non-critical)

Factors affecting RTO:

- Standby type (hot < warm < cold)
- Failure detection time
- Promotion process complexity
- Client reconnection speed

Recovery Point Objective (RPO):

Recovery Point Objective (RPO)

RPO = Maximum acceptable data loss

Examples:

- Bank transactions: RPO = 0 (no loss acceptable)
- User comments: RPO = 5 minutes (acceptable)
- Log aggregation: RPO = 1 hour (can lose some logs)

Factors affecting RPO:

- Replication strategy (sync vs async)
- Commit acknowledgment (quorum required?)
- Checkpoint frequency
- Backup frequency (for cold standby)

TRADEOFF:
Low RPO (sync replication) = Higher latency
High RPO (async replication) = Lower latency, more data loss

Split-Brain Problem

What is Split-Brain?

Split-Brain Problem

Network partition causes both nodes to think they're primary:

Before partition:
Primary A (serving traffic) ← Health Monitor → Standby B

After partition:
Primary A (still serving) ║ Standby B (promoted, now serving)
↓ ║ ↓
Clients in Region A ║ Clients in Region B
write X=1 ║ write X=2

PROBLEM: Divergent data, corruption when partition heals

Prevention Techniques:

1. Fencing (STONITH)

Fencing (STONITH)

Before promoting standby, KILL old primary

Methods:

- Power off via IPMI/iLO
- Network isolation (block at switch)
- Kernel panic trigger
- Forceful process termination

Use case: Shared storage systems (SAN)

2. Distributed Locks / Leases

Distributed Locks / Leases

Acquire lock before becoming primary

Example (etcd):

1. Standby tries to acquire lease: /leader
2. Only succeeds if old primary's lease expired
3. Old primary cannot write without valid lease
4. Result: At most one primary at a time ✓

Use case: Kafka controller election

3. Quorum / Witness Node

Quorum / Witness Node

Use majority voting to determine primary

Setup: 3 nodes (A, B, C)

- A is primary (has 2/3 votes)
- Network partition: A isolated, B+C can see each other
- B+C have majority (2/3), can elect new primary
- A has minority (1/3), cannot remain primary
- Result: Only B or C can be new primary

Use case: Distributed databases (Cassandra, Riak)

4. Generation Number / Epoch

Generation Number / Epoch

Increment version number on each failover

Primary A has epoch=5
After failover, Primary B has epoch=6
If A comes back, it sees epoch=5 < 6
A demotes itself and syncs from B

Use case: Kafka, ZooKeeper

Real Systems Using Failover

System	Failover Type	Detection Time	RTO	RPO	Split-Brain Prevention
PostgreSQL	Hot Standby (streaming replication)	10-30s	~30s	0-5s	Witness server, fencing
MySQL	Semi-sync replication	10-30s	~1min	0 (sync ack)	GTID, virtual IP
Redis Sentinel	Hot Standby (Sentinel monitors)	5-15s	~10s	0-1s	Quorum (majority of Sentinels)
Kafka Controller	Hot Standby (ZooKeeper election)	10-30s	~20s	0 (committed log)	ZooKeeper leader election, epoch
AWS RDS	Multi-AZ (automated failover)	30-60s	60-120s	0	AWS orchestration
Cassandra	Leaderless (no failover needed)	N/A	N/A	N/A	Quorum reads/writes

Case Study: PostgreSQL Failover

PostgreSQL Failover with Patroni

PostgreSQL Streaming Replication + Patroni

Architecture:
┌─────────────────────────────────────────────────┐
│ Primary (Leader)                                │
│ ↓ (continuous WAL streaming)                    │
│ Standby 1 (sync replica, zero lag)              │
│ Standby 2 (async replica, small lag)            │
│                                                 │
│ Health Monitor: Patroni (uses etcd for DCS)     │
└─────────────────────────────────────────────────┘

Failure Scenario:

1. Primary crashes (hardware failure)
2. Patroni on each node detects (10s heartbeat miss)
3. Standbys try to acquire leadership lock in etcd
4. Standby 1 (most up-to-date) acquires lock
5. Standby 1 runs pg_ctl promote
6. Standby 1 becomes new primary (~5s)
7. Patroni updates DNS or floating IP (2s)
8. Applications reconnect to new primary (~5s)

Total downtime: ~22 seconds
RPO: 0 (sync replication to Standby 1)

Configuration:
synchronous_standby_names = 'standby1'
synchronous_commit = on
wal_level = replica
max_wal_senders = 5

Result: Zero data loss, ~20s downtime

Case Study: Kafka Controller Failover

Kafka Controller Failover

Kafka Controller Election (via ZooKeeper)

Normal Operation:

- One broker is controller (manages partition leaders)
- Controller broker has ephemeral node in ZooKeeper: /controller

Failure:

1. Controller broker crashes
2. ZooKeeper detects session timeout (6-10s)
3. ZooKeeper deletes /controller ephemeral node
4. All brokers watch /controller for changes
5. Brokers race to create /controller node
6. First to create becomes new controller
7. New controller loads cluster metadata from ZK
8. New controller sends LeaderAndIsr requests to brokers
9. Partition leaders updated, cluster operational

Total downtime: ~10-20s (partition leadership updates)
RPO: 0 (committed messages replicated to ISR)

Split-Brain Prevention:

- ZooKeeper ensures only one /controller node
- Controller Epoch incremented on each election
- Brokers reject requests with old epoch

When to Use Failover

Use Failover When:

Trigger	Scenario	Requirement	Solution	Trade-off
High Availability Required	E-commerce checkout service	99.95% uptime (4 hours downtime/year)	Hot standby with automatic failover	Cost of redundant infrastructure
Data Loss Unacceptable	Payment transaction database	Zero data loss (RPO = 0)	Synchronous replication to hot standby	Higher write latency
RTO Measured in Seconds/Mins	Live video streaming control plane	Failover < 30 seconds	Hot standby with fast health checks	False positives from aggressive timeouts

When NOT to Use Failover:

Anti-pattern	Problem	Solution	Example	Benefit
Stateless Services	Failover is overkill	Use load balancer with multiple active instances	Stateless REST APIs, web servers	Simpler, no failover orchestration needed
Eventually Consistent Systems	Failover adds complexity without benefit	Multi-master or leaderless replication	Cassandra, DynamoDB (quorum writes)	No single point of failure, continuous availability
Cost-Sensitive Non-Critical	Hot standby doubles infrastructure cost	Use cold standby (restore from backup)	Development databases, analytics pipelines	Save money, accept longer downtime

Interview Application

Common Interview Question

Q: “Design a highly available database for a payment system. How would you handle primary database failure?”

Strong Answer:

“For a payment system where data loss is unacceptable, I’d design a hot standby failover system with these characteristics:

Architecture:

Primary database with synchronous replication to hot standby

Hot standby in different availability zone (same region for low latency)

Health monitoring via Patroni or similar tool (10-second heartbeat)

Distributed coordination using etcd or ZooKeeper for split-brain prevention

Failure Handling:

Detection (10s): Patroni detects primary unresponsive after 2 missed heartbeats

Fencing (2s): Revoke old primary’s write access via network isolation

Promotion (5s): Standby promoted to primary, acquires leadership lock in etcd

Routing (5s): Update floating IP or DNS to point to new primary

Reconnection (5s): Applications retry connections, resume transactions

Guarantees:

RTO: ~27 seconds (acceptable for payment system)

RPO: 0 seconds (synchronous replication means zero data loss)

Consistency: Transactions committed to both primary and standby before ACK

Trade-offs:

Latency: Synchronous replication adds ~5-10ms to write latency

Cost: Hot standby doubles database infrastructure cost

Complexity: Patroni/etcd adds operational complexity

Split-Brain Prevention:

Use distributed lock in etcd (only one primary can hold lock)

Primary must renew lease every 5 seconds or lose write access

Generation numbers (epochs) to detect stale primaries

Real-World Example:

Similar to AWS RDS Multi-AZ: synchronous replication, automatic failover

Or PostgreSQL + Patroni (used by Zalando, widely adopted)”

Why This Answer Works:

Identifies appropriate failover type (hot standby) for use case
Explains step-by-step process with timing
Discusses RTO/RPO trade-offs explicitly
Addresses split-brain problem proactively
References real implementations

Code Example

Implementing Simple Failover with Health Checks

// Health Monitor for Failover Detection
class HealthMonitor {
  constructor(primaryUrl, standbyUrl, config) {
    this.primaryUrl = primaryUrl;
    this.standbyUrl = standbyUrl;
    this.config = {
      heartbeatInterval: config.heartbeatInterval || 5000, // 5s
      failureThreshold: config.failureThreshold || 3, // 3 misses
      ...config,
    };

    this.consecutiveFailures = 0;
    this.currentPrimary = primaryUrl;
    this.isFailedOver = false;
  }

  async start() {
    setInterval(() => this.checkHealth(), this.config.heartbeatInterval);
  }

  async checkHealth() {
    try {
  // ... omitted: keep concept snippets short

// Expected timeline on failure:
// T0:  Primary crashes
// T5:  First health check fails (1/3)
// T10: Second health check fails (2/3)
// T15: Third health check fails (3/3) → Trigger failover
// T16: Fence old primary (1s)
// T21: Promote standby (5s)
// T22: Update routing
// Total downtime: ~22 seconds

Split-Brain Prevention with Distributed Lock

// Using etcd for distributed locking to prevent split-brain
const { Etcd3 } = require("etcd3");

class FailoverCoordinator {
  constructor(etcdHosts, nodeId) {
    this.etcd = new Etcd3({ hosts: etcdHosts });
    this.nodeId = nodeId;
    this.leaderKey = "/cluster/leader";
    this.lease = null;
    this.isLeader = false;
  }

  async tryBecomeLeader() {
    try {
      // Create a lease (TTL = 10 seconds)
      this.lease = this.etcd.lease(10);

      // Try to acquire leader lock with this lease
      const result = await this.etcd
        .if(this.leaderKey, "Create", "==", 0) // Only if key doesn't exist
        .then(
          this.etcd.put(this.leaderKey).value(this.nodeId).lease(this.lease)
  // ... omitted: keep concept snippets short

// Simulate primary failure (lease expires after 10s without renewal)
// ... network partition or crash ...

// After 10s, standby can acquire lock
await standby.tryBecomeLeader(); // Succeeds! Becomes new leader

// If old primary comes back:
await primary.tryBecomeLeader(); // Fails! Standby is now leader
// Old primary cannot write without leadership lock ✓

Prerequisites:

Distributed Systems Basics - Foundation concepts

Related Concepts:

Leader-Follower Replication - Understanding replication for standby
Consensus - Leader election algorithms
Quorum - Majority-based decisions for failover
Health Checks - Failure detection mechanisms

Used In Systems:

High-availability databases (PostgreSQL, MySQL, Redis)
Distributed coordination (ZooKeeper, etcd)
Message brokers (Kafka controller election)

Explained In Detail:

Distributed Systems Deep Dive - Failover patterns in depth

Quick Self-Check

Can explain failover in 60 seconds?
Understand difference between hot/warm/cold standby?
Can explain RTO vs RPO trade-offs?
Understand split-brain problem and prevention techniques?
Know failure detection with heartbeats and thresholds?
Can design failover for a production database?

TL;DR

Visual Overview

Core Explanation

What is Failover?

Standby Types

Failure Detection

Failover Process Steps

RTO vs RPO

Split-Brain Problem

Real Systems Using Failover

Case Study: PostgreSQL Failover

Case Study: Kafka Controller Failover

When to Use Failover

Use Failover When:

When NOT to Use Failover:

Interview Application

Common Interview Question

Code Example

Implementing Simple Failover with Health Checks

Split-Brain Prevention with Distributed Lock

Related Content

Quick Self-Check

Why this concept matters