Mechanisms to identify when nodes in a distributed system have failed, enabling recovery and fault tolerance
TL;DR
Failure detection determines whether a node in a distributed system is alive or dead. The fundamental challenge: you cannot distinguish a crashed node from a slow node or a network partition—all three look the same (no response). Failure detectors make a judgment call with a trade-off between detection speed and false positive rate. Advanced approaches like Phi Accrual provide probabilistic detection instead of binary alive/dead.
Visual Overview
Core Explanation
What is Failure Detection?
Real-World Analogy: Imagine you’re a 911 dispatcher with a list of emergency responders. You call each one every hour to confirm they’re available. If someone doesn’t answer, are they:
- Dead? (actual failure)
- In a tunnel with no signal? (network partition)
- On another call? (overloaded)
- Asleep? (slow to respond)
You can’t know for sure. After 3 missed calls, you might mark them unavailable and route emergencies elsewhere. That’s failure detection—making a judgment call under uncertainty.
The Fundamental Problem
In an asynchronous distributed system:
- There’s no upper bound on message delivery time
- A message might arrive in 1ms or 10 minutes
- You cannot distinguish “dead” from “very slow” with certainty
This is known as the FLP impossibility result: in an asynchronous system with even one faulty process, no deterministic algorithm can guarantee consensus.
Failure Detector Properties
Common Failure Detection Approaches
Phi Accrual Failure Detector Deep Dive
Real Systems Using Failure Detection
| System | Approach | Configuration | Notes |
|---|---|---|---|
| Apache Cassandra | Phi Accrual | phi_convict_threshold (default: 8) | Adaptive to network |
| Akka Cluster | Phi Accrual | Configurable threshold | Built-in module |
| ZooKeeper | Timeout-based | Session timeout | Simple fixed timeout |
| etcd/Raft | Heartbeat timeout | Election timeout | For leader failure |
| Consul | Gossip + timeout | Configurable | SWIM protocol variant |
| Kubernetes | Timeout-based | Node not ready timeout | Kubelet heartbeats |
Note: Implementation details vary by version. Verify in current documentation.
Failure Detection States
When to Use Different Approaches
✓ Approach Selection Guide
✕ Pitfalls to Avoid
Interview Application
Common Interview Question
Q: “In distributed systems, how do you detect node failures? What are the challenges and trade-offs?”
Strong Answer:
“Failure detection is fundamentally challenging because you can’t distinguish a crashed node from a slow node or a network partition—all produce the same observable behavior: no response.
The Trade-off: You’re choosing between detection speed and false positive rate:
- Short timeout (1s): Fast detection, but network hiccups trigger false alarms
- Long timeout (30s): Fewer false alarms, but traffic goes to dead nodes for 30 seconds
Common Approaches:
- Fixed Timeout (simplest):
- No heartbeat within N seconds → suspected
- Pros: Simple, predictable
- Cons: Doesn’t adapt to network conditions
- Phi Accrual (adaptive):
- Calculate probability of failure based on heartbeat history
- φ = -log₁₀(P(alive))
- φ > 8 typically means dead (~99.999999% confidence)
- Adapts to actual network variance
- Gossip-based (distributed):
- Nodes share suspicions with peers
- Majority suspicion → confirmed failure
- No single point of failure
Real-World Examples:
- Cassandra uses Phi Accrual (phi_convict_threshold: 8)
- Kubernetes uses fixed timeout (~40s)
- Consul uses gossip-based (SWIM protocol)
Key insight: Perfect failure detection is impossible in asynchronous systems (FLP impossibility). Real systems use ‘eventually perfect’ detectors—they may make temporary mistakes but eventually converge to correct.”
Follow-up: How do you handle false positives in failure detection?
“False positives are dangerous—they can cause:
- Unnecessary failovers (disruption)
- Split brain (two primaries)
- Cascade failures (eviction storm)
Mitigation strategies:
Suspected state: Don’t act immediately. Mark as suspected first, then confirm with additional checks before marking dead.
Peer confirmation: Before declaring a node dead, ask other nodes if they can reach it. Majority agreement prevents network-partition-based false positives.
Hysteresis: Require sustained failure (3 missed heartbeats, not 1) before suspecting. Require sustained health before marking recovered.
Rate limiting evictions: Don’t evict more than N nodes per minute. If you’re seeing mass evictions, something systemic is wrong.
Self-fencing: If a node suspects it might be partitioned (its heartbeats aren’t being acknowledged), it should stop serving requests. ‘Better to be down than to be split-brain.’”
Follow-up: Explain the Phi Accrual failure detector.
“Phi Accrual replaces binary alive/dead with a suspicion level (φ) representing ‘how suspicious is the current situation?’
How it works:
- Track heartbeat inter-arrival times
- Model as normal distribution (mean μ, std dev σ)
- When checking: calculate how unlikely is this gap
- φ = -log₁₀(P(heartbeat still coming))
Interpretation:
- φ = 1: ~10% chance of failure (probably fine)
- φ = 3: ~0.1% chance alive (suspicious)
- φ = 8: ~10⁻⁸ chance alive (almost certainly dead)
Why it’s better:
- Adapts to actual network conditions
- Fewer false positives during temporary slowness
- Higher conviction threshold possible without slow detection
Configuration: Choose φ threshold based on risk tolerance. Higher threshold = fewer false positives but slower detection.”
Code Example
Phi Accrual Failure Detector (Python)
import time
import math
from collections import deque
from typing import Optional
from dataclasses import dataclass
@dataclass
class PhiAccrualDetector:
"""
Phi Accrual Failure Detector.
Calculates suspicion level (phi) based on heartbeat arrival history.
Higher phi = more likely the node has failed.
"""
# Configuration
threshold: float = 8.0 # phi above this = suspected
min_samples: int = 5 # minimum heartbeats before calculating
max_samples: int = 200 # sliding window size
# State
arrival_times: deque = None
last_heartbeat: float = None
def __post_init__(self):
self.arrival_times = deque(maxlen=self.max_samples)
self.last_heartbeat = None
def heartbeat_received(self) -> None:
"""Record a heartbeat arrival."""
now = time.time()
if self.last_heartbeat is not None:
interval = now - self.last_heartbeat
self.arrival_times.append(interval)
self.last_heartbeat = now
def _mean_and_stddev(self) -> tuple[float, float]:
"""Calculate mean and standard deviation of intervals."""
if len(self.arrival_times) < self.min_samples:
return None, None
intervals = list(self.arrival_times)
n = len(intervals)
mean = sum(intervals) / n
variance = sum((x - mean) ** 2 for x in intervals) / n
stddev = math.sqrt(variance)
# Ensure minimum stddev to avoid division issues
stddev = max(stddev, 0.1)
return mean, stddev
def phi(self) -> Optional[float]:
"""
Calculate current phi (suspicion level).
Returns:
phi value, or None if not enough samples
"""
if self.last_heartbeat is None:
return None
mean, stddev = self._mean_and_stddev()
if mean is None:
return None
# Time since last heartbeat
t = time.time() - self.last_heartbeat
# Calculate P(interval > t) using normal distribution CDF
# P_later = 1 - CDF(t) = 1 - (1/2)[1 + erf((t-μ)/(σ√2))]
# Simplified: P_later = (1/2)[1 - erf((t-μ)/(σ√2))]
z = (t - mean) / (stddev * math.sqrt(2))
p_later = 0.5 * (1 - math.erf(z))
# Avoid log(0)
if p_later <= 0:
return float('inf')
# phi = -log10(P_later)
return -math.log10(p_later)
def is_available(self) -> bool:
"""
Check if node should be considered available.
Returns:
True if phi is below threshold or not enough samples
"""
current_phi = self.phi()
if current_phi is None:
return True # Benefit of doubt with insufficient data
return current_phi < self.threshold
def status(self) -> dict:
"""Get current detector status."""
mean, stddev = self._mean_and_stddev()
current_phi = self.phi()
return {
"phi": round(current_phi, 2) if current_phi else None,
"threshold": self.threshold,
"is_available": self.is_available(),
"samples": len(self.arrival_times),
"mean_interval": round(mean, 3) if mean else None,
"stddev": round(stddev, 3) if stddev else None,
"last_heartbeat_ago": (
round(time.time() - self.last_heartbeat, 2)
if self.last_heartbeat else None
),
}
# Usage example
if __name__ == "__main__":
import random
print("=== Phi Accrual Failure Detector Demo ===\n")
detector = PhiAccrualDetector(threshold=8.0)
# Simulate normal heartbeats (1 second interval ± jitter)
print("Simulating healthy node (1s interval ± 0.1s jitter):")
for i in range(10):
detector.heartbeat_received()
status = detector.status()
print(f" Heartbeat {i+1}: φ={status['phi']}, available={status['is_available']}")
time.sleep(1 + random.uniform(-0.1, 0.1))
print("\nSimulating slow heartbeat (3s delay):")
time.sleep(3)
status = detector.status()
print(f" After 3s gap: φ={status['phi']}, available={status['is_available']}")
detector.heartbeat_received()
print(" Heartbeat received, recovering...")
print("\nSimulating node failure (no more heartbeats):")
for i in range(8):
time.sleep(1)
status = detector.status()
print(f" +{i+1}s: φ={status['phi']}, available={status['is_available']}")
if not status['is_available']:
print(" → Node marked as unavailable!")
break
print(f"\nFinal status: {detector.status()}")
Adaptive Timeout (Simpler Alternative)
import time
from collections import deque
from typing import Optional
class AdaptiveTimeoutDetector:
"""
Adaptive timeout failure detector.
Simpler than Phi Accrual but still adapts to network conditions.
timeout = mean + (k * stddev)
"""
def __init__(self, k: float = 4.0, min_samples: int = 5, max_samples: int = 100):
"""
Args:
k: Number of standard deviations to add to mean
min_samples: Minimum heartbeats before adapting
max_samples: Sliding window size
"""
self.k = k
self.min_samples = min_samples
self.intervals = deque(maxlen=max_samples)
self.last_heartbeat: Optional[float] = None
self.default_timeout = 5.0 # Used before enough samples
def heartbeat_received(self) -> None:
"""Record heartbeat arrival."""
now = time.time()
if self.last_heartbeat is not None:
self.intervals.append(now - self.last_heartbeat)
self.last_heartbeat = now
def get_timeout(self) -> float:
"""Calculate adaptive timeout."""
if len(self.intervals) < self.min_samples:
return self.default_timeout
intervals = list(self.intervals)
mean = sum(intervals) / len(intervals)
variance = sum((x - mean) ** 2 for x in intervals) / len(intervals)
stddev = variance ** 0.5
return mean + (self.k * stddev)
def is_available(self) -> bool:
"""Check if node is available."""
if self.last_heartbeat is None:
return False
elapsed = time.time() - self.last_heartbeat
return elapsed < self.get_timeout()
Related Content
See It In Action:
- Heartbeat & Failure Detection Explainer - Visual walkthrough of the trade-offs
Related Concepts:
- Heartbeat - The liveness signal
- Health Checks - Deeper health verification
- Failover - What happens after detection
- Gossip Protocol - Distributed failure detection
Quick Self-Check
- Can explain why failure detection is fundamentally hard?
- Understand the trade-off between detection speed and false positives?
- Know the difference between timeout-based and phi accrual?
- Can explain what phi represents in phi accrual?
- Understand why “suspected” state exists before “dead”?
- Know how gossip-based detection avoids single point of failure?
Interview Notes
60% of distributed systems interviews
Powers systems at Cluster management
Detection latency vs accuracy query improvement
False positive rate