Heartbeat & Failure Detection

How distributed systems detect when nodes fail using periodic heartbeat signals and timeout-based detection.

Read this as How quickly can the system suspect failure without flapping?
Failure Trap
A missing heartbeat is suspicion, not proof; low timeouts create false failures.
Decision Rule
Tune heartbeats from observed latency, add grace windows, and route recovery through explicit state transitions.
Heartbeat and failure detection in a distributed system Five steps: a monitor cannot tell a dead node from a slow one; nodes send periodic "I'm alive" heartbeats; a missed heartbeat starts a timeout clock; when the timeout is exceeded the node is marked dead and failover begins; and choosing the timeout trades fast detection against false positives. Is a quiet node dead, or just slow? N1 ? crashed? N2 ? slow? N3 ? cut off? Monitor All three look the same: silence Nodes send periodic "I'm alive" Monitor N1 N2 N3 One heartbeat every second No beat from N3 — start the clock N1 N2 still beating N3 no heartbeat silent for 3s / 3s limit N3 beats: 0s 2s missed Timeout exceeded — N3 marked dead FAILURE DETECTED N1 alive N2 alive N3 DEAD → failover started Pick a timeout: fast vs. false alarms Short timeout (3s) + detects real failures fast − a network blip looks dead (false positive) Long timeout (10s) + ignores brief network blips − slow to notice a truly dead node Tune to your tolerance for each error
1 / ?

How Do You Know a Node is Dead?

In distributed systems, you can't tell if a node is dead, just slow, or unreachable due to a network partition. All three scenarios look the same: no response.

This uncertainty is fundamental — you cannot reliably distinguish "dead" from "very slow" in an asynchronous system.

  • No response could mean crash, overload, or network issue
  • Direct pings don't solve the problem (they can timeout too)
  • Systems need a protocol to make decisions despite uncertainty

Periodic "I'm Alive" Signals

The heartbeat pattern solves this with periodic signals. Each node sends a small "I'm alive" message at regular intervals (e.g., every second). If a node stops sending heartbeats, something is probably wrong.

This is push-based detection — nodes announce themselves rather than waiting to be polled.

  • Heartbeat period: How often to send (e.g., 1 second)
  • Lightweight messages: Just a timestamp or sequence number
  • Alternative: Pull-based polling (monitor checks nodes)

No Heartbeat? Start the Clock

When a heartbeat is missed, the monitor doesn't immediately declare the node dead. It starts a timeout counter. If no heartbeat arrives within the timeout period (typically 2-3× the heartbeat interval), the node is suspected.

This grace period handles temporary network delays and brief pauses.

  • Timeout: Multiple heartbeat periods (for tolerance)
  • Common setting: timeout = 3× heartbeat interval
  • Monitor tracks "last seen" timestamp per node
  • Suspected state before confirmed failure

Timeout Exceeded: Node Marked Dead

Once the timeout expires with no heartbeat, the monitor marks the node as dead. This triggers downstream actions: removing the node from load balancer rotation, initiating failover, or starting leader election.

The node may actually be alive but unreachable — the system must proceed anyway.

  • Dead node removed from active set
  • Triggers failover or rebalancing
  • May be a "false positive" (node alive but partitioned)
  • System continues without waiting indefinitely

Fast Detection vs False Positives

Heartbeat configuration is a balance:

  • Short timeout: Detect failures quickly, but network hiccups cause false positives (healthy nodes marked dead)
  • Long timeout: Fewer false alarms, but slow to detect real failures (traffic continues to dead nodes)

There's no perfect setting — tune based on your tolerance for each type of error. Advanced systems use adaptive algorithms (like the Phi accrual detector) that compute a suspicion score from arrival latencies instead of binary alive/dead.