Failure detection mechanisms in distributed systems: how to determine if a node is alive, dead, or just slow, enabling automatic failover and self-healing systems
TL;DR
Health checks are probes that determine if a service is alive and functioning correctly. They enable load balancers to route traffic away from failing nodes, orchestrators to restart unhealthy containers, and distributed systems to trigger failover. The key trade-off: faster detection means more false positives.
Visual Overview
Why Health Checks Matter
The Fundamental Problem
How do you know if a remote node is dead or just slow?
| Scenario | Network Response | Reality |
|---|---|---|
| Node crashed | Timeout | Dead |
| Node overloaded | Timeout | Alive but struggling |
| Network partition | Timeout | Alive but unreachable |
| GC pause | Timeout then responds | Alive |
All look the same to the caller: No response within timeout.
Impact of Getting It Wrong
| Detection | False Positive | False Negative |
|---|---|---|
| Too aggressive | Healthy nodes marked dead, cascading restarts | - |
| Too conservative | - | Dead nodes continue receiving traffic |
Health Check Patterns
1. HTTP Health Endpoints
# Simple liveness check
@app.route('/health/live')
def liveness():
return {'status': 'alive'}, 200
# Readiness check with dependencies
@app.route('/health/ready')
def readiness():
if not db.is_connected():
return {'status': 'not ready', 'reason': 'DB unavailable'}, 503
if not cache.is_connected():
return {'status': 'not ready', 'reason': 'Cache unavailable'}, 503
return {'status': 'ready'}, 200
2. TCP Health Checks
For non-HTTP services:
- Open TCP connection to port
- Success = port responding
- Used by: AWS ELB, HAProxy
3. gRPC Health Checking Protocol
service Health {
rpc Check(HealthCheckRequest) returns (HealthCheckResponse);
rpc Watch(HealthCheckRequest) returns (stream HealthCheckResponse);
}
message HealthCheckResponse {
enum ServingStatus {
UNKNOWN = 0;
SERVING = 1;
NOT_SERVING = 2;
}
ServingStatus status = 1;
}
Heartbeat Protocols
Push-Based (Heartbeats)
Node periodically sends “I’m alive” to monitor.
Pros: Lower monitor load Cons: Dead node = silence (must distinguish from network issues)
Pull-Based (Polling)
Monitor periodically checks each node.
Pros: Centralized view Cons: Monitor overload at scale
Gossip-Based
Nodes share health info peer-to-peer.
Pros: Scalable, no single point of failure Cons: Eventually consistent detection Used by: Cassandra, Consul
Phi Accrual Failure Detector
Instead of binary alive/dead, calculate probability of failure:
Phi (φ) = -log10(P(heartbeat delay))
φ = 1 → 10% chance of failure
φ = 2 → 1% chance of failure
φ = 8 → 0.000001% chance of failure
Threshold: Mark dead when φ > 8
Advantage: Adapts to network conditions automatically. Used by: Cassandra, Akka
Kubernetes Health Probes
apiVersion: v1
kind: Pod
spec:
containers:
- name: app
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 10 # Wait after start
periodSeconds: 5 # Check every 5s
timeoutSeconds: 3 # Timeout per check
failureThreshold: 3 # Failures before restart
readinessProbe:
httpGet:
path: /health/ready
port: 8080
periodSeconds: 5
failureThreshold: 1 # Remove from service immediately
Configuration Trade-offs
| Parameter | Lower Value | Higher Value |
|---|---|---|
| Check interval | Faster detection, more load | Slower detection, less load |
| Timeout | More false positives | Misses slow failures |
| Failure threshold | Quick failover | Tolerates transient issues |
Production Recommendations
# Typical production settings
liveness:
interval: 10s
timeout: 5s
failureThreshold: 3 # 30s to restart
readiness:
interval: 5s
timeout: 3s
failureThreshold: 1 # Immediate traffic removal
Anti-Patterns
1. Health Check Does Too Much
# BAD: Health check that takes 30 seconds
@app.route('/health')
def health():
run_full_database_integrity_check() # Takes 30s!
return {'status': 'healthy'}
Health checks should be fast (under 100ms).
2. No Dependency Isolation
# BAD: All dependencies fail readiness
@app.route('/health/ready')
def ready():
check_database() # Critical
check_analytics_db() # Not critical for serving traffic
Only check critical dependencies in readiness.
3. Cascading Failures
If health check fails under load → more load on remaining nodes → they fail too.
Solution: Circuit breakers, gradual rollout, load shedding.
Related Content
Prerequisites:
- Distributed Systems Basics - Foundation concepts
Related Concepts:
- Failover - What happens when health checks fail
- Consensus - Leader election on failure
- Load Balancing - Traffic routing away from unhealthy nodes
Used In Systems:
- Kubernetes (liveness/readiness probes)
- AWS ELB/ALB (target health checks)
- Consul (service health checks)
- Every HA deployment
Next Recommended: Failover - Learn what happens after detecting a failure
Interview Notes
60% of production-focused interviews
Powers systems at Every HA system
Detection latency vs false positives query improvement
O(N) or O(N²) depending on protocol