Heartbeat & Failure Detection

Read this as How quickly can the system suspect failure without flapping?

Failure Trap: A missing heartbeat is suspicion, not proof; low timeouts create false failures.
Decision Rule: Tune heartbeats from observed latency, add grace windows, and route recovery through explicit state transitions.

1 / ?

How Do You Know a Node is Dead?

In distributed systems, you can't tell if a node is dead, just slow, or unreachable due to a network partition. All three scenarios look the same: no response.

This uncertainty is fundamental — you cannot reliably distinguish "dead" from "very slow" in an asynchronous system.

No response could mean crash, overload, or network issue
Direct pings don't solve the problem (they can timeout too)
Systems need a protocol to make decisions despite uncertainty

Periodic "I'm Alive" Signals

The heartbeat pattern solves this with periodic signals. Each node sends a small "I'm alive" message at regular intervals (e.g., every second). If a node stops sending heartbeats, something is probably wrong.

This is push-based detection — nodes announce themselves rather than waiting to be polled.

Heartbeat period: How often to send (e.g., 1 second)
Lightweight messages: Just a timestamp or sequence number
Alternative: Pull-based polling (monitor checks nodes)

No Heartbeat? Start the Clock

When a heartbeat is missed, the monitor doesn't immediately declare the node dead. It starts a timeout counter. If no heartbeat arrives within the timeout period (typically 2-3× the heartbeat interval), the node is suspected.

This grace period handles temporary network delays and brief pauses.

Timeout: Multiple heartbeat periods (for tolerance)
Common setting: timeout = 3× heartbeat interval
Monitor tracks "last seen" timestamp per node
Suspected state before confirmed failure

Timeout Exceeded: Node Marked Dead

Once the timeout expires with no heartbeat, the monitor marks the node as dead. This triggers downstream actions: removing the node from load balancer rotation, initiating failover, or starting leader election.

The node may actually be alive but unreachable — the system must proceed anyway.

Dead node removed from active set
Triggers failover or rebalancing
May be a "false positive" (node alive but partitioned)
System continues without waiting indefinitely

Fast Detection vs False Positives

Heartbeat configuration is a balance:

Short timeout: Detect failures quickly, but network hiccups cause false positives (healthy nodes marked dead)
Long timeout: Fewer false alarms, but slow to detect real failures (traffic continues to dead nodes)

There's no perfect setting — tune based on your tolerance for each type of error. Advanced systems use adaptive algorithms (like the Phi accrual detector) that compute a suspicion score from arrival latencies instead of binary alive/dead.