Skip to content

Failure Detection

Mechanisms to identify when nodes in a distributed system have failed, enabling recovery and fault tolerance

TL;DR

Failure detection determines whether a node in a distributed system is alive or dead. The fundamental challenge: you cannot distinguish a crashed node from a slow node or a network partition—all three look the same (no response). Failure detectors make a judgment call with a trade-off between detection speed and false positive rate. Advanced approaches like Phi Accrual provide probabilistic detection instead of binary alive/dead.

Visual Overview

Failure Detection

Core Explanation

What is Failure Detection?

Real-World Analogy: Imagine you’re a 911 dispatcher with a list of emergency responders. You call each one every hour to confirm they’re available. If someone doesn’t answer, are they:

  • Dead? (actual failure)
  • In a tunnel with no signal? (network partition)
  • On another call? (overloaded)
  • Asleep? (slow to respond)

You can’t know for sure. After 3 missed calls, you might mark them unavailable and route emergencies elsewhere. That’s failure detection—making a judgment call under uncertainty.

The Fundamental Problem

In an asynchronous distributed system:

  • There’s no upper bound on message delivery time
  • A message might arrive in 1ms or 10 minutes
  • You cannot distinguish “dead” from “very slow” with certainty

This is known as the FLP impossibility result: in an asynchronous system with even one faulty process, no deterministic algorithm can guarantee consensus.

Failure Detector Properties

Failure Detector Properties

Common Failure Detection Approaches

Failure Detection Approaches

Phi Accrual Failure Detector Deep Dive

Phi Accrual Failure Detector

Real Systems Using Failure Detection

SystemApproachConfigurationNotes
Apache CassandraPhi Accrualphi_convict_threshold (default: 8)Adaptive to network
Akka ClusterPhi AccrualConfigurable thresholdBuilt-in module
ZooKeeperTimeout-basedSession timeoutSimple fixed timeout
etcd/RaftHeartbeat timeoutElection timeoutFor leader failure
ConsulGossip + timeoutConfigurableSWIM protocol variant
KubernetesTimeout-basedNode not ready timeoutKubelet heartbeats

Note: Implementation details vary by version. Verify in current documentation.

Failure Detection States

Failure Detection State Machine

When to Use Different Approaches

✓ Approach Selection Guide

Choosing a Failure Detection Approach

✕ Pitfalls to Avoid

Failure Detection Pitfalls

Interview Application

Common Interview Question

Q: “In distributed systems, how do you detect node failures? What are the challenges and trade-offs?”

Strong Answer:

“Failure detection is fundamentally challenging because you can’t distinguish a crashed node from a slow node or a network partition—all produce the same observable behavior: no response.

The Trade-off: You’re choosing between detection speed and false positive rate:

  • Short timeout (1s): Fast detection, but network hiccups trigger false alarms
  • Long timeout (30s): Fewer false alarms, but traffic goes to dead nodes for 30 seconds

Common Approaches:

  1. Fixed Timeout (simplest):
    • No heartbeat within N seconds → suspected
    • Pros: Simple, predictable
    • Cons: Doesn’t adapt to network conditions
  2. Phi Accrual (adaptive):
    • Calculate probability of failure based on heartbeat history
    • φ = -log₁₀(P(alive))
    • φ > 8 typically means dead (~99.999999% confidence)
    • Adapts to actual network variance
  3. Gossip-based (distributed):
    • Nodes share suspicions with peers
    • Majority suspicion → confirmed failure
    • No single point of failure

Real-World Examples:

  • Cassandra uses Phi Accrual (phi_convict_threshold: 8)
  • Kubernetes uses fixed timeout (~40s)
  • Consul uses gossip-based (SWIM protocol)

Key insight: Perfect failure detection is impossible in asynchronous systems (FLP impossibility). Real systems use ‘eventually perfect’ detectors—they may make temporary mistakes but eventually converge to correct.”

Follow-up: How do you handle false positives in failure detection?

“False positives are dangerous—they can cause:

  • Unnecessary failovers (disruption)
  • Split brain (two primaries)
  • Cascade failures (eviction storm)

Mitigation strategies:

  1. Suspected state: Don’t act immediately. Mark as suspected first, then confirm with additional checks before marking dead.

  2. Peer confirmation: Before declaring a node dead, ask other nodes if they can reach it. Majority agreement prevents network-partition-based false positives.

  3. Hysteresis: Require sustained failure (3 missed heartbeats, not 1) before suspecting. Require sustained health before marking recovered.

  4. Rate limiting evictions: Don’t evict more than N nodes per minute. If you’re seeing mass evictions, something systemic is wrong.

  5. Self-fencing: If a node suspects it might be partitioned (its heartbeats aren’t being acknowledged), it should stop serving requests. ‘Better to be down than to be split-brain.’”

Follow-up: Explain the Phi Accrual failure detector.

“Phi Accrual replaces binary alive/dead with a suspicion level (φ) representing ‘how suspicious is the current situation?’

How it works:

  1. Track heartbeat inter-arrival times
  2. Model as normal distribution (mean μ, std dev σ)
  3. When checking: calculate how unlikely is this gap
  4. φ = -log₁₀(P(heartbeat still coming))

Interpretation:

  • φ = 1: ~10% chance of failure (probably fine)
  • φ = 3: ~0.1% chance alive (suspicious)
  • φ = 8: ~10⁻⁸ chance alive (almost certainly dead)

Why it’s better:

  • Adapts to actual network conditions
  • Fewer false positives during temporary slowness
  • Higher conviction threshold possible without slow detection

Configuration: Choose φ threshold based on risk tolerance. Higher threshold = fewer false positives but slower detection.”

Code Example

Phi Accrual Failure Detector (Python)

import time
import math
from collections import deque
from typing import Optional
from dataclasses import dataclass

@dataclass
class PhiAccrualDetector:
    """
    Phi Accrual Failure Detector.

    Calculates suspicion level (phi) based on heartbeat arrival history.
    Higher phi = more likely the node has failed.
    """

    # Configuration
    threshold: float = 8.0  # phi above this = suspected
    min_samples: int = 5    # minimum heartbeats before calculating
    max_samples: int = 200  # sliding window size

    # State
    arrival_times: deque = None
    last_heartbeat: float = None

    def __post_init__(self):
        self.arrival_times = deque(maxlen=self.max_samples)
        self.last_heartbeat = None

    def heartbeat_received(self) -> None:
        """Record a heartbeat arrival."""
        now = time.time()

        if self.last_heartbeat is not None:
            interval = now - self.last_heartbeat
            self.arrival_times.append(interval)

        self.last_heartbeat = now

    def _mean_and_stddev(self) -> tuple[float, float]:
        """Calculate mean and standard deviation of intervals."""
        if len(self.arrival_times) < self.min_samples:
            return None, None

        intervals = list(self.arrival_times)
        n = len(intervals)
        mean = sum(intervals) / n

        variance = sum((x - mean) ** 2 for x in intervals) / n
        stddev = math.sqrt(variance)

        # Ensure minimum stddev to avoid division issues
        stddev = max(stddev, 0.1)

        return mean, stddev

    def phi(self) -> Optional[float]:
        """
        Calculate current phi (suspicion level).

        Returns:
            phi value, or None if not enough samples
        """
        if self.last_heartbeat is None:
            return None

        mean, stddev = self._mean_and_stddev()
        if mean is None:
            return None

        # Time since last heartbeat
        t = time.time() - self.last_heartbeat

        # Calculate P(interval > t) using normal distribution CDF
        # P_later = 1 - CDF(t) = 1 - (1/2)[1 + erf((t-μ)/(σ√2))]
        # Simplified: P_later = (1/2)[1 - erf((t-μ)/(σ√2))]

        z = (t - mean) / (stddev * math.sqrt(2))
        p_later = 0.5 * (1 - math.erf(z))

        # Avoid log(0)
        if p_later <= 0:
            return float('inf')

        # phi = -log10(P_later)
        return -math.log10(p_later)

    def is_available(self) -> bool:
        """
        Check if node should be considered available.

        Returns:
            True if phi is below threshold or not enough samples
        """
        current_phi = self.phi()
        if current_phi is None:
            return True  # Benefit of doubt with insufficient data

        return current_phi < self.threshold

    def status(self) -> dict:
        """Get current detector status."""
        mean, stddev = self._mean_and_stddev()
        current_phi = self.phi()

        return {
            "phi": round(current_phi, 2) if current_phi else None,
            "threshold": self.threshold,
            "is_available": self.is_available(),
            "samples": len(self.arrival_times),
            "mean_interval": round(mean, 3) if mean else None,
            "stddev": round(stddev, 3) if stddev else None,
            "last_heartbeat_ago": (
                round(time.time() - self.last_heartbeat, 2)
                if self.last_heartbeat else None
            ),
        }


# Usage example
if __name__ == "__main__":
    import random

    print("=== Phi Accrual Failure Detector Demo ===\n")

    detector = PhiAccrualDetector(threshold=8.0)

    # Simulate normal heartbeats (1 second interval ± jitter)
    print("Simulating healthy node (1s interval ± 0.1s jitter):")
    for i in range(10):
        detector.heartbeat_received()
        status = detector.status()
        print(f"  Heartbeat {i+1}: φ={status['phi']}, available={status['is_available']}")
        time.sleep(1 + random.uniform(-0.1, 0.1))

    print("\nSimulating slow heartbeat (3s delay):")
    time.sleep(3)
    status = detector.status()
    print(f"  After 3s gap: φ={status['phi']}, available={status['is_available']}")

    detector.heartbeat_received()
    print("  Heartbeat received, recovering...")

    print("\nSimulating node failure (no more heartbeats):")
    for i in range(8):
        time.sleep(1)
        status = detector.status()
        print(f"  +{i+1}s: φ={status['phi']}, available={status['is_available']}")

        if not status['is_available']:
            print("  → Node marked as unavailable!")
            break

    print(f"\nFinal status: {detector.status()}")

Adaptive Timeout (Simpler Alternative)

import time
from collections import deque
from typing import Optional

class AdaptiveTimeoutDetector:
    """
    Adaptive timeout failure detector.

    Simpler than Phi Accrual but still adapts to network conditions.
    timeout = mean + (k * stddev)
    """

    def __init__(self, k: float = 4.0, min_samples: int = 5, max_samples: int = 100):
        """
        Args:
            k: Number of standard deviations to add to mean
            min_samples: Minimum heartbeats before adapting
            max_samples: Sliding window size
        """
        self.k = k
        self.min_samples = min_samples
        self.intervals = deque(maxlen=max_samples)
        self.last_heartbeat: Optional[float] = None
        self.default_timeout = 5.0  # Used before enough samples

    def heartbeat_received(self) -> None:
        """Record heartbeat arrival."""
        now = time.time()
        if self.last_heartbeat is not None:
            self.intervals.append(now - self.last_heartbeat)
        self.last_heartbeat = now

    def get_timeout(self) -> float:
        """Calculate adaptive timeout."""
        if len(self.intervals) < self.min_samples:
            return self.default_timeout

        intervals = list(self.intervals)
        mean = sum(intervals) / len(intervals)
        variance = sum((x - mean) ** 2 for x in intervals) / len(intervals)
        stddev = variance ** 0.5

        return mean + (self.k * stddev)

    def is_available(self) -> bool:
        """Check if node is available."""
        if self.last_heartbeat is None:
            return False

        elapsed = time.time() - self.last_heartbeat
        return elapsed < self.get_timeout()

See It In Action:

Related Concepts:

Quick Self-Check

  • Can explain why failure detection is fundamentally hard?
  • Understand the trade-off between detection speed and false positives?
  • Know the difference between timeout-based and phi accrual?
  • Can explain what phi represents in phi accrual?
  • Understand why “suspected” state exists before “dead”?
  • Know how gossip-based detection avoids single point of failure?
Interview Notes
💼60% of distributed systems interviews
Interview Relevance
60% of distributed systems interviews
🏭Cluster management
Production Impact
Powers systems at Cluster management
Detection latency vs accuracy
Performance
Detection latency vs accuracy query improvement
📈False positive rate
Scalability
False positive rate