Skip to content

Failover

Automatic switching to a backup system or replica when the primary fails, ensuring service continuity with minimal downtime

TL;DR

Failover is the process of automatically transferring operations from a failed primary system to a standby backup, minimizing downtime and ensuring service continuity. It’s essential for high-availability systems, enabling recovery from hardware failures, software crashes, and network issues within seconds to minutes.

Visual Overview

Failover Overview

Core Explanation

What is Failover?

Failover is the automatic or manual process of switching from a failed primary system to a standby backup system to maintain service availability. It’s a key component of high-availability (HA) architectures.

Key Components:

  1. Primary/Leader: Active system handling requests
  2. Standby/Follower: Backup system ready to take over
  3. Health Monitor: Detects primary failures via heartbeats
  4. Failover Orchestrator: Promotes standby to primary
  5. Fencing Mechanism: Prevents split-brain scenarios

Standby Types

1. Hot Standby (Active-Passive)

Hot Standby

2. Warm Standby

Warm Standby

3. Cold Standby

Cold Standby

Failure Detection

Heartbeat Monitoring:

Heartbeat Monitoring

Failover Process Steps

Step-by-Step Breakdown:

Failover Process Steps

RTO vs RPO

Recovery Time Objective (RTO):

Recovery Time Objective (RTO)

Recovery Point Objective (RPO):

Recovery Point Objective (RPO)

Split-Brain Problem

What is Split-Brain?

Split-Brain Problem

Prevention Techniques:

1. Fencing (STONITH)

Fencing (STONITH)

2. Distributed Locks / Leases

Distributed Locks / Leases

3. Quorum / Witness Node

Quorum / Witness Node

4. Generation Number / Epoch

Generation Number / Epoch

Real Systems Using Failover

SystemFailover TypeDetection TimeRTORPOSplit-Brain Prevention
PostgreSQLHot Standby (streaming replication)10-30s~30s0-5sWitness server, fencing
MySQLSemi-sync replication10-30s~1min0 (sync ack)GTID, virtual IP
Redis SentinelHot Standby (Sentinel monitors)5-15s~10s0-1sQuorum (majority of Sentinels)
Kafka ControllerHot Standby (ZooKeeper election)10-30s~20s0 (committed log)ZooKeeper leader election, epoch
AWS RDSMulti-AZ (automated failover)30-60s60-120s0AWS orchestration
CassandraLeaderless (no failover needed)N/AN/AN/AQuorum reads/writes

Case Study: PostgreSQL Failover

PostgreSQL Failover with Patroni

Case Study: Kafka Controller Failover

Kafka Controller Failover

When to Use Failover

Use Failover When:

High Availability Required

High Availability Use Case

Data Loss Unacceptable

Zero Data Loss Use Case

RTO Measured in Seconds/Minutes

Fast RTO Use Case

When NOT to Use Failover:

Stateless Services

Stateless Services Warning

Eventually Consistent Systems

Eventually Consistent Systems Warning

Cost-Sensitive Non-Critical Systems

Cost-Sensitive Systems Warning

Interview Application

Common Interview Question

Q: “Design a highly available database for a payment system. How would you handle primary database failure?”

Strong Answer:

“For a payment system where data loss is unacceptable, I’d design a hot standby failover system with these characteristics:

Architecture:

  1. Primary database with synchronous replication to hot standby
  2. Hot standby in different availability zone (same region for low latency)
  3. Health monitoring via Patroni or similar tool (10-second heartbeat)
  4. Distributed coordination using etcd or ZooKeeper for split-brain prevention

Failure Handling:

  1. Detection (10s): Patroni detects primary unresponsive after 2 missed heartbeats
  2. Fencing (2s): Revoke old primary’s write access via network isolation
  3. Promotion (5s): Standby promoted to primary, acquires leadership lock in etcd
  4. Routing (5s): Update floating IP or DNS to point to new primary
  5. Reconnection (5s): Applications retry connections, resume transactions

Guarantees:

  • RTO: ~27 seconds (acceptable for payment system)
  • RPO: 0 seconds (synchronous replication means zero data loss)
  • Consistency: Transactions committed to both primary and standby before ACK

Trade-offs:

  • Latency: Synchronous replication adds ~5-10ms to write latency
  • Cost: Hot standby doubles database infrastructure cost
  • Complexity: Patroni/etcd adds operational complexity

Split-Brain Prevention:

  • Use distributed lock in etcd (only one primary can hold lock)
  • Primary must renew lease every 5 seconds or lose write access
  • Generation numbers (epochs) to detect stale primaries

Real-World Example:

  • Similar to AWS RDS Multi-AZ: synchronous replication, automatic failover
  • Or PostgreSQL + Patroni (used by Zalando, widely adopted)”

Why This Answer Works:

  • Identifies appropriate failover type (hot standby) for use case
  • Explains step-by-step process with timing
  • Discusses RTO/RPO trade-offs explicitly
  • Addresses split-brain problem proactively
  • References real implementations

Code Example

Implementing Simple Failover with Health Checks

// Health Monitor for Failover Detection
class HealthMonitor {
  constructor(primaryUrl, standbyUrl, config) {
    this.primaryUrl = primaryUrl;
    this.standbyUrl = standbyUrl;
    this.config = {
      heartbeatInterval: config.heartbeatInterval || 5000, // 5s
      failureThreshold: config.failureThreshold || 3, // 3 misses
      ...config,
    };

    this.consecutiveFailures = 0;
    this.currentPrimary = primaryUrl;
    this.isFailedOver = false;
  }

  async start() {
    setInterval(() => this.checkHealth(), this.config.heartbeatInterval);
  }

  async checkHealth() {
    try {
      const response = await fetch(`${this.currentPrimary}/health`, {
        timeout: 3000, // 3s timeout
      });

      if (response.ok) {
        // Primary is healthy
        this.consecutiveFailures = 0;
        console.log(
          `[${new Date().toISOString()}] Primary healthy: ${this.currentPrimary}`
        );
      } else {
        this.handleFailure();
      }
    } catch (error) {
      // Network error, timeout, or server down
      this.handleFailure();
    }
  }

  handleFailure() {
    this.consecutiveFailures++;
    console.log(
      `[${new Date().toISOString()}] Primary unhealthy (${this.consecutiveFailures}/${this.config.failureThreshold})`
    );

    if (this.consecutiveFailures >= this.config.failureThreshold) {
      this.triggerFailover();
    }
  }

  async triggerFailover() {
    if (this.isFailedOver) {
      console.log("Already failed over, skipping");
      return;
    }

    console.log(`[${new Date().toISOString()}] TRIGGERING FAILOVER!`);

    try {
      // Step 1: Fence old primary (prevent split-brain)
      await this.fenceOldPrimary();

      // Step 2: Promote standby to primary
      await this.promoteStandby();

      // Step 3: Update routing
      this.currentPrimary = this.standbyUrl;
      this.isFailedOver = true;
      this.consecutiveFailures = 0;

      console.log(
        `[${new Date().toISOString()}] Failover complete. New primary: ${this.currentPrimary}`
      );

      // Notify operators
      await this.sendAlert("Failover completed", {
        oldPrimary: this.primaryUrl,
        newPrimary: this.standbyUrl,
      });
    } catch (error) {
      console.error("Failover failed:", error);
      await this.sendAlert("Failover FAILED", { error: error.message });
    }
  }

  async fenceOldPrimary() {
    // In production: disable old primary at network/firewall level
    // Or send STONITH command to power management
    console.log("Fencing old primary (preventing writes)...");

    try {
      await fetch(`${this.primaryUrl}/admin/disable`, {
        method: "POST",
        timeout: 2000,
      });
    } catch (error) {
      // Old primary might be completely down, that's OK
      console.log("Could not fence old primary (might be completely down)");
    }
  }

  async promoteStandby() {
    console.log("Promoting standby to primary...");

    const response = await fetch(`${this.standbyUrl}/admin/promote`, {
      method: "POST",
      timeout: 10000, // Give it time to catch up replication
    });

    if (!response.ok) {
      throw new Error("Failed to promote standby");
    }

    // Wait for promotion to complete
    await this.waitForStandbyReady();
  }

  async waitForStandbyReady() {
    const maxWait = 30000; // 30s max
    const startTime = Date.now();

    while (Date.now() - startTime < maxWait) {
      try {
        const response = await fetch(`${this.standbyUrl}/health`);
        if (response.ok) {
          const data = await response.json();
          if (data.role === "primary") {
            console.log("Standby successfully promoted to primary");
            return;
          }
        }
      } catch (error) {
        // Still promoting, retry
      }

      await new Promise(resolve => setTimeout(resolve, 1000)); // Wait 1s
    }

    throw new Error("Standby promotion timeout");
  }

  async sendAlert(message, details) {
    // In production: Send to PagerDuty, Slack, email, etc.
    console.error(`ALERT: ${message}`, details);
  }
}

// Usage
const monitor = new HealthMonitor(
  "http://primary.db.example.com:5432",
  "http://standby.db.example.com:5432",
  {
    heartbeatInterval: 5000, // Check every 5 seconds
    failureThreshold: 3, // Failover after 3 consecutive failures
  }
);

monitor.start();

// Expected timeline on failure:
// T0:  Primary crashes
// T5:  First health check fails (1/3)
// T10: Second health check fails (2/3)
// T15: Third health check fails (3/3) → Trigger failover
// T16: Fence old primary (1s)
// T21: Promote standby (5s)
// T22: Update routing
// Total downtime: ~22 seconds

Split-Brain Prevention with Distributed Lock

// Using etcd for distributed locking to prevent split-brain
const { Etcd3 } = require("etcd3");

class FailoverCoordinator {
  constructor(etcdHosts, nodeId) {
    this.etcd = new Etcd3({ hosts: etcdHosts });
    this.nodeId = nodeId;
    this.leaderKey = "/cluster/leader";
    this.lease = null;
    this.isLeader = false;
  }

  async tryBecomeLeader() {
    try {
      // Create a lease (TTL = 10 seconds)
      this.lease = this.etcd.lease(10);

      // Try to acquire leader lock with this lease
      const result = await this.etcd
        .if(this.leaderKey, "Create", "==", 0) // Only if key doesn't exist
        .then(
          this.etcd.put(this.leaderKey).value(this.nodeId).lease(this.lease)
        )
        .else(this.etcd.get(this.leaderKey))
        .commit();

      if (result.succeeded) {
        this.isLeader = true;
        console.log(`[${this.nodeId}] Became leader!`);

        // Keep renewing lease to maintain leadership
        this.startLeaseRenewal();

        return true;
      } else {
        const currentLeader = result.responses[0].kvs[0].value.toString();
        console.log(
          `[${this.nodeId}] Failed to become leader. Current leader: ${currentLeader}`
        );
        return false;
      }
    } catch (error) {
      console.error(`[${this.nodeId}] Error acquiring leadership:`, error);
      return false;
    }
  }

  async startLeaseRenewal() {
    // Renew lease every 5 seconds (TTL is 10s, so we have buffer)
    this.renewalInterval = setInterval(async () => {
      try {
        await this.lease.keepaliveOnce();
        console.log(`[${this.nodeId}] Lease renewed`);
      } catch (error) {
        console.error(
          `[${this.nodeId}] Failed to renew lease, losing leadership`
        );
        this.isLeader = false;
        clearInterval(this.renewalInterval);

        // Try to become leader again
        setTimeout(() => this.tryBecomeLeader(), 1000);
      }
    }, 5000);
  }

  async stepDown() {
    if (this.lease) {
      await this.lease.revoke();
      clearInterval(this.renewalInterval);
    }
    this.isLeader = false;
    console.log(`[${this.nodeId}] Stepped down from leadership`);
  }

  canWrite() {
    // Only leader can write (prevents split-brain)
    return this.isLeader;
  }
}

// Usage on Primary and Standby nodes:

// Primary node
const primary = new FailoverCoordinator(["localhost:2379"], "node-primary");
await primary.tryBecomeLeader(); // Acquires lock

// Standby node
const standby = new FailoverCoordinator(["localhost:2379"], "node-standby");
await standby.tryBecomeLeader(); // Fails (primary holds lock)

// Simulate primary failure (lease expires after 10s without renewal)
// ... network partition or crash ...

// After 10s, standby can acquire lock
await standby.tryBecomeLeader(); // Succeeds! Becomes new leader

// If old primary comes back:
await primary.tryBecomeLeader(); // Fails! Standby is now leader
// Old primary cannot write without leadership lock ✓

Prerequisites:

Related Concepts:

Used In Systems:

  • High-availability databases (PostgreSQL, MySQL, Redis)
  • Distributed coordination (ZooKeeper, etcd)
  • Message brokers (Kafka controller election)

Explained In Detail:

  • Distributed Systems Deep Dive - Failover patterns in depth

Quick Self-Check

  • Can explain failover in 60 seconds?
  • Understand difference between hot/warm/cold standby?
  • Can explain RTO vs RPO trade-offs?
  • Understand split-brain problem and prevention techniques?
  • Know failure detection with heartbeats and thresholds?
  • Can design failover for a production database?
Interview Notes
💼70% of HA design interviews
Interview Relevance
70% of HA design interviews
🏭Netflix, AWS, Google
Production Impact
Powers systems at Netflix, AWS, Google
99.99%+ uptime
Performance
99.99%+ uptime query improvement
📈RTO/RPO optimization
Scalability
RTO/RPO optimization