Heartbeat | Concepts

TL;DR

A heartbeat is a periodic “I’m alive” message sent by nodes in a distributed system. If heartbeats stop arriving, the sender is presumed failed. Heartbeats are the foundation of failure detection, enabling leader election, cluster membership, and automatic failover. The core trade-off is detection speed vs false positive rate—shorter timeouts detect failures faster but trigger more false alarms.

Visual Overview

Heartbeat Protocol

Worker Node ──♥──♥──♥──♥──♥──► Monitor
            1s  1s  1s  1s  1s

♥ = heartbeat ("I'm alive")
Heartbeat interval: 1s
Timeout threshold: 3s (3× interval)

Monitor tracks "last seen" per node

FAILURE TIMELINE
├──♥──┼──♥──┼──♥──┼──?──┼──?──┼ TIMEOUT
0s    1s    2s    3s    4s    5s
           ↑
    Last heartbeat

After 3s with no heartbeat:
└─ Node marked SUSPECTED
└─ After confirmation → marked DEAD
└─ Trigger failover, remove from cluster

Core Explanation

What is a Heartbeat?

Real-World Analogy: Think of a scuba diving buddy system. You and your partner agree to check in every 30 seconds with an “OK” hand signal. If your partner doesn’t respond for 2 minutes, you assume something’s wrong and start emergency procedures. That hand signal is a heartbeat—a simple, periodic “I’m fine” message.

The same principle works in distributed systems: nodes periodically announce they’re alive. Silence means trouble.

How Heartbeats Work

Heartbeat Mechanism

SENDER (Worker Node)

while running:
send_heartbeat(monitor)
sleep(interval)

Message:
{ "node_id": "worker-1",
"timestamp": 1704067200,
"status": "healthy",
"metadata": { "load": 0.5 }
}


RECEIVER (Monitor)

# Track last seen for each node
last_seen = {}

on heartbeat_received(node_id, ts):
last_seen[node_id] = ts

every check_interval:
for node_id, ts in last_seen:
  if now - ts > timeout:
    mark_suspected(node_id)

Push vs Pull: Heartbeat vs Health Check

Heartbeat vs Health Check

HEARTBEAT (PUSH-BASED)
┌────────────────────────────────────────────────────┐
│                                                    │
│  Worker ────♥────♥────♥────► Monitor               │
│         (push)                                     │
│                                                    │
│  ✓ Node proactively sends                          │
│  ✓ Lower latency detection                         │
│  ✓ Works when monitor can't reach worker           │
│  ✕ Requires cooperative node (must send)           │
│  ✕ Can't verify deeper health                      │
│                                                    │
└────────────────────────────────────────────────────┘

HEALTH CHECK (PULL-BASED)
┌────────────────────────────────────────────────────┐
│                                                    │
│  Monitor ────?────?────?────► Worker               │
│          (poll)    ◄────OK────                     │
│                                                    │
│  ✓ Works with uncooperative services               │
│  ✓ Can check deeper health (DB, disk, app)         │
│  ✓ Centralized control                             │
│  ✕ Higher detection latency                        │
│  ✕ Requires network path to worker                 │
│                                                    │
└────────────────────────────────────────────────────┘

COMMON PATTERN: COMBINE BOTH
┌────────────────────────────────────────────────────┐
│                                                    │
│  1. Heartbeat for fast liveness detection          │
│  2. Health check for deeper status verification    │
│                                                    │
│  Example: Kubernetes                               │
│  - Kubelet sends heartbeats to API server          │
│  - Kubelet also polls container health endpoints   │
│                                                    │
└────────────────────────────────────────────────────┘

The Configuration Trade-off

Heartbeat Configuration Trade-offs

THE FUNDAMENTAL TRADE-OFF
┌────────────────────────────────────────────────────┐
│                                                    │
│  FAST DETECTION          vs         ACCURACY       │
│  (short timeout)              (long timeout)       │
│                                                    │
│  ✓ Detect failures quickly   ✓ Fewer false alarms  │
│  ✕ More false positives      ✕ Slow to detect      │
│  ✕ Network hiccup = "dead"   ✕ Traffic to dead     │
│                                                    │
│  No perfect answer—tune for your use case          │
│                                                    │
└────────────────────────────────────────────────────┘

CONFIGURATION EXAMPLES
┌────────────────────────────────────────────────────┐
│                                                    │
│  Use Case       │ Interval │ Timeout │ Trade-off   │
│  ───────────────┼──────────┼─────────┼───────────  │
│  Leader election│ 100ms    │ 300ms   │ Fast (FP OK)│
│  Database HA    │ 1s       │ 3s      │ Balanced    │
│  Service mesh   │ 5s       │ 15s     │ Low FP      │
│  Monitoring     │ 30s      │ 90s     │ Very low FP │
│                                                    │
│  Rule of thumb: timeout = 3× interval              │
│                                                    │
└────────────────────────────────────────────────────┘

FALSE POSITIVE IMPACT
┌────────────────────────────────────────────────────┐
│                                                    │
│  Low stakes (service discovery):                   │
│  └─ False positive = temporary routing change      │
│  └─ Not a big deal, can be aggressive              │
│                                                    │
│  High stakes (database failover):                  │
│  └─ False positive = split brain possible          │
│  └─ Unnecessary failover = disruption              │
│  └─ Be conservative, require confirmation          │
│                                                    │
└────────────────────────────────────────────────────┘

Real Systems Using Heartbeats

System	Interval	Timeout	Notes
Kubernetes	10s (default)	40s (default)	Kubelet to API server
Apache ZooKeeper	tickTime × 2	Session timeout (configurable)	Heartbeat in session
etcd	Configurable	Election timeout	Raft heartbeats
Consul	1s (default)	10s (default)	Gossip-based
Amazon ELB	Configurable	Unhealthy threshold × interval	Health checks

Note: Defaults vary by version. Always verify in current documentation.

Heartbeat Patterns in Practice

Common Heartbeat Architectures

CENTRALIZED MONITOR
┌────────────────────────────────────────────────────┐
│                                                    │
│     Worker 1 ────♥─────┐                           │
│     Worker 2 ────♥─────┼──► Central Monitor        │
│     Worker 3 ────♥─────┘                           │
│                                                    │
│  ✓ Simple to implement                             │
│  ✓ Single source of truth                          │
│  ✕ Monitor is single point of failure              │
│  Use: Small clusters, Kubernetes control plane     │
│                                                    │
└────────────────────────────────────────────────────┘

PEER-TO-PEER (RING)
┌────────────────────────────────────────────────────┐
│                                                    │
│       Worker 1 ──♥──► Worker 2                     │
│          ↑               │                         │
│          ♥               ♥                         │
│          │               ↓                         │
│       Worker 4 ◄──♥── Worker 3                     │
│                                                    │
│  ✓ No single point of failure                      │
│  ✓ O(1) messages per node                          │
│  ✕ Longer detection path                           │
│  Use: Simple fault-tolerant clusters               │
│                                                    │
└────────────────────────────────────────────────────┘

GOSSIP-BASED
┌────────────────────────────────────────────────────┐
│                                                    │
│  Each node gossips to k random peers               │
│  "I'm alive, and here's who I've heard from..."    │
│                                                    │
│  Node 1 ──♥──► Node 3 ──♥──► Node 5                │
│     │                   │                          │
│     ♥                   ♥                          │
│     ↓                   ↓                          │
│  Node 4          Node 2                            │
│                                                    │
│  ✓ Highly scalable (O(log N) propagation)          │
│  ✓ Robust to failures                              │
│  ✕ Eventually consistent failure detection         │
│  Use: Large clusters (Cassandra, Consul)           │
│                                                    │
└────────────────────────────────────────────────────┘

When to Use Heartbeats

✓ Perfect Use Cases

Use Case	Scenario	Requirement	Configuration	Trade-off
Leader election	Database with primary/replica setup	Detect leader failure for failover	100ms interval, 300ms timeout	Fast detection, some false positives
Cluster membership	Service discovery, load balancing	Know which nodes are available	1s interval, 5s timeout	Moderate detection, low false positives
Session keepalive	ZooKeeper sessions, distributed locks	Maintain session while client active	Session-based, configurable timeout	Balance responsiveness vs overhead
Worker pool monitoring	Task queue with worker processes	Redistribute tasks from dead workers	5s interval, 15s timeout	Lower urgency, very low false positives

✕ When NOT to Use (or Use Carefully)

Situation	Problem	Example	Alternative	When OK
Need deeper health status	Heartbeat says “alive” but app is broken	Service running but can’t connect to DB	Health checks with dependency probing	Combine heartbeat + health check
Deterministic failure detection	Heartbeats can’t guarantee failure	Two nodes both think other is dead (split brain)	Consensus protocols (Raft, Paxos)	Accept occasional false positives
Very high frequency systems	Heartbeat overhead at 100+ Hz	Real-time trading, gaming	Integrated health in message protocol	Lower frequency acceptable
Asymmetric networks	Heartbeat path ≠ data path	Node reachable for heartbeat but not data	Probe actual service endpoints	Network is symmetric

Interview Application

Common Interview Question

Q: “Explain how heartbeats work in distributed systems and the trade-offs involved in configuring them.”

Strong Answer:

“Heartbeats are periodic ‘I’m alive’ messages used for failure detection. Here’s how they work:

Mechanism:

Sender: Every N seconds, send a heartbeat to the monitor

Receiver: Track ‘last seen’ timestamp per node

Detection: If no heartbeat for timeout period, mark node as suspected failed

The Core Trade-off:

Config Detection Speed False Positives Example
100ms/300ms Very fast High Leader election
1s/3s Fast Moderate Database HA
5s/15s Slow Low Service mesh

Why false positives matter:

Network hiccup during timeout window = healthy node marked dead

Consequence: unnecessary failover, split brain risk

Why detection speed matters:

Slow detection = traffic continues to dead node

Consequence: errors, latency, data loss

Rule of thumb: timeout = 3× interval. For a 1-second interval, use 3-second timeout—tolerates 2 missed heartbeats before suspecting failure.

Real-world example: Kubernetes uses 10s heartbeat interval with 40s timeout (pod eviction after ~40s of no heartbeats). This is tuned for stability over speed—Kubernetes prioritizes avoiding false positives.”

Config	Detection Speed	False Positives	Example
100ms/300ms	Very fast	High	Leader election
1s/3s	Fast	Moderate	Database HA
5s/15s	Slow	Low	Service mesh

Follow-up: How do you handle the case where a heartbeat succeeds but the service is actually broken?

“Heartbeats only prove the process is running, not that it’s healthy. A service can send heartbeats while:

Its database connection is dead

It’s in an infinite loop

It’s out of memory but not crashed

Solutions:

Liveness + Readiness separation (Kubernetes model):

Liveness probe: Is the process alive? (heartbeat)

Readiness probe: Can it serve traffic? (deeper health check)

Application-level heartbeat:

Include health status in heartbeat message

{ alive: true, db_connected: true, queue_healthy: true }

Hierarchical health checks:

Heartbeat for fast liveness

Periodic deep health check (every 30s) for readiness

Best practice: Use heartbeats for ‘is the process running?’ and separate health checks for ‘can it serve requests?’”

Follow-up: What’s the difference between a heartbeat and a lease?

“They’re related but serve different purposes:

Heartbeat: Continuous signal—‘I’m still here.’ Monitor tracks last-seen timestamp. No explicit acknowledgment required.

Lease: Time-limited grant—‘You have permission until T.’ Must be renewed before expiry. Server explicitly grants/extends.

Key difference:

Heartbeat: Detection is passive (monitor notices absence)

Lease: Detection is active (lease holder knows when it expires)

Example:

ZooKeeper sessions: Heartbeat keeps session alive

Distributed locks: Lease on lock auto-expires if not renewed

Leases add safety: if a node partitions, it knows its lease expires and should stop acting as leader. With pure heartbeats, a partitioned node might keep acting as leader, thinking it’s fine.”

Code Example

Heartbeat System (Python)

import time
import threading
from dataclasses import dataclass, field
from typing import Dict, Callable, Optional
from enum import Enum

class NodeStatus(Enum):
    ALIVE = "alive"
    SUSPECTED = "suspected"
    DEAD = "dead"

@dataclass
class NodeState:
    """Tracked state for a node."""
    node_id: str
    last_heartbeat: float
    status: NodeStatus = NodeStatus.ALIVE
    metadata: dict = field(default_factory=dict)

class HeartbeatMonitor:
    """
    Centralized heartbeat monitor.
    # ... omitted: keep concept snippets short
    time.sleep(5)

    print("\nFinal status:")
    print(f"  worker-1: {monitor.get_status('worker-1')}")
    print(f"  worker-2: {monitor.get_status('worker-2')}")
    print(f"  Alive nodes: {monitor.get_alive_nodes()}")

    # Cleanup
    worker2.stop()
    monitor.stop()

Heartbeat with Metadata (Production Pattern)

import psutil

def get_node_health() -> dict:
    """Collect node health metrics to include in heartbeat."""
    return {
        "cpu_percent": psutil.cpu_percent(),
        "memory_percent": psutil.virtual_memory().percent,
        "disk_percent": psutil.disk_usage('/').percent,
        "load_avg": psutil.getloadavg()[0],
        "connections": len(psutil.net_connections()),
    }

# Usage
sender = HeartbeatSender(
    node_id="worker-1",
    monitor=monitor,
    interval=1.0,
    metadata_fn=get_node_health  # Include health in each heartbeat
)

See It In Action:

Heartbeat & Failure Detection Explainer - Visual walkthrough of timeout detection

Related Concepts:

Failure Detection - The broader problem heartbeats solve
Health Checks - Pull-based alternative
Consensus - Uses heartbeats for leader detection

Quick Self-Check

Can explain heartbeats in 60 seconds?
Understand the trade-off between detection speed and false positives?
Know the difference between heartbeat (push) and health check (pull)?
Can implement a basic heartbeat monitor with timeouts?
Understand why timeout = 3× interval is a common rule of thumb?
Know when to use heartbeats vs leases?