Skip to content
~100s Visual Explainer

Heartbeat & Failure Detection

How distributed systems detect when nodes fail using periodic heartbeat signals and timeout-based detection.

How Do You Know a Node is Dead? N1 ? Dead? N2 ? Slow? N3 ? Partitioned? Monitor watching... Periodic "I'm Alive" Signals Monitor N1 ✓ alive N2 ✓ alive N3 ✓ alive Heartbeat interval: 1 second No Heartbeat? Start the Clock Monitoring N3... N1 ♥ alive N2 ♥ alive N3 No heartbeat! Last seen: 3 seconds ago ? 0s 2s 4s Timeout Exceeded: Node Marked Dead FAILURE DETECTED N1 ALIVE N2 ALIVE N3 DEAD → Initiating failover Fast Detection vs False Positives Short Timeout (3s) Fast detection More false positives N Network blip False positive! Long Timeout (6s) Fewer false positives Slow to detect failures Actually dead 10s... Delayed detection ⚖ Tune based on tolerance for each error type
1 / ?

How Do You Know a Node is Dead?

In distributed systems, you can't tell if a node is dead, just slow, or unreachable due to a network partition. All three scenarios look the same: no response.

This uncertainty is fundamental — you cannot reliably distinguish "dead" from "very slow" in an asynchronous system.

  • No response could mean crash, overload, or network issue
  • Direct pings don't solve the problem (they can timeout too)
  • Systems need a protocol to make decisions despite uncertainty

Periodic "I'm Alive" Signals

The heartbeat pattern solves this with periodic signals. Each node sends a small "I'm alive" message at regular intervals (e.g., every second). If a node stops sending heartbeats, something is probably wrong.

This is push-based detection — nodes announce themselves rather than waiting to be polled.

  • Heartbeat period: How often to send (e.g., 1 second)
  • Lightweight messages: Just a timestamp or sequence number
  • Alternative: Pull-based polling (monitor checks nodes)

No Heartbeat? Start the Clock

When a heartbeat is missed, the monitor doesn't immediately declare the node dead. It starts a timeout counter. If no heartbeat arrives within the timeout period (typically 2-3× the heartbeat interval), the node is suspected.

This grace period handles temporary network delays and brief pauses.

  • Timeout: Multiple heartbeat periods (for tolerance)
  • Common setting: timeout = 3× heartbeat interval
  • Monitor tracks "last seen" timestamp per node
  • Suspected state before confirmed failure

Timeout Exceeded: Node Marked Dead

Once the timeout expires with no heartbeat, the monitor marks the node as dead. This triggers downstream actions: removing the node from load balancer rotation, initiating failover, or starting leader election.

The node may actually be alive but unreachable — the system must proceed anyway.

  • Dead node removed from active set
  • Triggers failover or rebalancing
  • May be a "false positive" (node alive but partitioned)
  • System continues without waiting indefinitely

Fast Detection vs False Positives

Heartbeat configuration is a balance:

  • Short timeout: Detect failures quickly, but network hiccups cause false positives (healthy nodes marked dead)
  • Long timeout: Fewer false alarms, but slow to detect real failures (traffic continues to dead nodes)

There's no perfect setting — tune based on your tolerance for each type of error. Advanced systems use adaptive algorithms (like the Phi accrual detector) that compute a suspicion score from arrival latencies instead of binary alive/dead.

What's Next?

Now that you understand failure detection, explore how it enables higher-level patterns: Raft Consensus uses heartbeats for leader election, Failover explains what happens after detection, and Health Checks covers the application-level perspective.