Heartbeat & Failure Detection
How distributed systems detect when nodes fail using periodic heartbeat signals and timeout-based detection.
How Do You Know a Node is Dead?
In distributed systems, you can't tell if a node is dead, just slow, or unreachable due to a network partition. All three scenarios look the same: no response.
This uncertainty is fundamental — you cannot reliably distinguish "dead" from "very slow" in an asynchronous system.
- No response could mean crash, overload, or network issue
- Direct pings don't solve the problem (they can timeout too)
- Systems need a protocol to make decisions despite uncertainty
Periodic "I'm Alive" Signals
The heartbeat pattern solves this with periodic signals. Each node sends a small "I'm alive" message at regular intervals (e.g., every second). If a node stops sending heartbeats, something is probably wrong.
This is push-based detection — nodes announce themselves rather than waiting to be polled.
- Heartbeat period: How often to send (e.g., 1 second)
- Lightweight messages: Just a timestamp or sequence number
- Alternative: Pull-based polling (monitor checks nodes)
No Heartbeat? Start the Clock
When a heartbeat is missed, the monitor doesn't immediately declare the node dead. It starts a timeout counter. If no heartbeat arrives within the timeout period (typically 2-3× the heartbeat interval), the node is suspected.
This grace period handles temporary network delays and brief pauses.
- Timeout: Multiple heartbeat periods (for tolerance)
- Common setting: timeout = 3× heartbeat interval
- Monitor tracks "last seen" timestamp per node
- Suspected state before confirmed failure
Timeout Exceeded: Node Marked Dead
Once the timeout expires with no heartbeat, the monitor marks the node as dead. This triggers downstream actions: removing the node from load balancer rotation, initiating failover, or starting leader election.
The node may actually be alive but unreachable — the system must proceed anyway.
- Dead node removed from active set
- Triggers failover or rebalancing
- May be a "false positive" (node alive but partitioned)
- System continues without waiting indefinitely
Fast Detection vs False Positives
Heartbeat configuration is a balance:
- Short timeout: Detect failures quickly, but network hiccups cause false positives (healthy nodes marked dead)
- Long timeout: Fewer false alarms, but slow to detect real failures (traffic continues to dead nodes)
There's no perfect setting — tune based on your tolerance for each type of error. Advanced systems use adaptive algorithms (like the Phi accrual detector) that compute a suspicion score from arrival latencies instead of binary alive/dead.