Circuit Breaker | Concepts

TL;DR

A circuit breaker wraps calls to external services and tracks failures. When failures exceed a threshold, the breaker “trips open”—subsequent calls fail immediately without attempting the network call. After a timeout, it allows a test request through. If that succeeds, the breaker closes and normal operation resumes.

Visual Overview

Circuit Breaker State Machine

THREE-STATE MACHINE
┌────────────────────────────────────────────────────┐
│                                                    │
│                    ┌──────────┐                    │
│  ┌─────────────────│  CLOSED  │←─────────────┐     │
│  │                 │ (normal) │              │     │
│  │                 └────┬─────┘              │     │
│  │                      │                    │     │
│  │         failures > threshold              │     │
│  │                      │                success   │
│  │                      ↓                    │     │
│  │                 ┌──────────┐              │     │
│  │                 │   OPEN   │──────────────┤     │
│  │                 │  (trip)  │              │     │
│  │                 └────┬─────┘              │     │
│  │                      │                    │     │
│  │              timeout expires              │     │
│  │                      │                    │     │
│  │                      ↓                    │     │
│  │                ┌───────────┐              │     │
│  │     failure    │ HALF-OPEN │──────────────┘     │
│  └────────────────│  (test)   │                    │
│                   └───────────┘                    │
│                                                    │
└────────────────────────────────────────────────────┘

STATE BEHAVIORS
┌────────────────────────────────────────────────────┐
│  CLOSED: Normal operation                          │
│  ├─ Requests pass through to downstream            │
│  ├─ Failures counted                               │
│  └─ Threshold breach → trip to OPEN                │
│                                                    │
│  OPEN: Protection mode                             │
│  ├─ All requests fail immediately                  │
│  ├─ No network calls made (fail fast)              │
│  └─ Timer running for recovery attempt             │
│                                                    │
│  HALF-OPEN: Testing recovery                       │
│  ├─ Allow ONE test request through                 │
│  ├─ Success → close breaker                        │
│  └─ Failure → reopen breaker                       │
└────────────────────────────────────────────────────┘

Core Explanation

What is a Circuit Breaker?

Real-World Analogy: Think of an electrical circuit breaker in your home. When too much current flows (overload), the breaker trips and cuts power to prevent a fire. You don’t keep trying to run the overloaded appliance—you wait, fix the problem, then reset the breaker.

Software circuit breakers work the same way:

Overload = too many failures calling a downstream service
Trip = stop calling that service
Reset = test if service recovered, then resume

The Problem It Solves

Cascade Failure Without Circuit Breaker

WITHOUT CIRCUIT BREAKER: CASCADE FAILURE
┌────────────────────────────────────────────────────┐
│                                                    │
│  Service A          Service B          Service C   │
│  ┌──────┐          ┌──────┐          ┌──────┐      │
│  │      │ ───────► │      │ ───────► │ SLOW │      │
│  │      │          │      │          │  ✗   │      │
│  └──────┘          └──────┘          └──────┘      │
│                                                    │
│  1. Service C becomes slow (5s timeouts)           │
│                                                    │
│  2. Service B threads block waiting for C          │
│     Thread pool: [████████████] exhausted!         │
│                                                    │
│  3. Service B stops responding to A                │
│     Service A threads block waiting for B          │
│     Thread pool: [████████████] exhausted!         │
│                                                    │
│  4. Service A fails → User sees error              │
│                                                    │
│  One slow service took down the entire chain!      │
└────────────────────────────────────────────────────┘

With Circuit Breaker: Fail Fast

┌────────────────────────────────────────────────────┐
│                                                    │
│  Service A          Service B          Service C   │
│  ┌──────┐          ┌──────┐  [CB]    ┌──────┐      │
│  │      │ ───────► │      │──┤├────► │ SLOW │      │
│  │      │          │      │  OPEN    │  ✗   │      │
│  └──────┘          └──────┘          └──────┘      │
│                                                    │
│  1. Service C becomes slow                         │
│                                                    │
│  2. Circuit breaker detects failures, trips OPEN   │
│                                                    │
│  3. Service B returns fast failure (no wait!)      │
│     "Service C unavailable" in <1ms                │
│                                                    │
│  4. Service B stays healthy                        │
│     Thread pool: [██░░░░░░░░░░] plenty free        │
│                                                    │
│  5. Service A gets quick error, can show fallback  │
│     User sees degraded experience, not failure     │
└────────────────────────────────────────────────────┘

Timeline Example

Circuit Breaker Timeline

┌────────────────────────────────────────────────────────────┐
│  Time:   0s    5s    10s   40s   41s   42s   45s           │
│          │     │     │     │     │     │     │             │
│  State:  CLOSED─────►OPEN──────────────►HALF──►CLOSED      │
│          │     │     │     │           │OPEN │             │
│          │     │     │     │           │     │             │
│  Events: │     │     │     │           │     │             │
│          │     │     │     │           │     └─ success!   │
│          │     │     │     │           └─ test request     │
│          │     │     │     └─ timeout expires (30s)        │
│          │     │     └─ 5th failure → TRIP                 │
│          │     └─ failures accumulating                    │
│          └─ normal operation                               │
│                                                            │
│  Requests:                                                 │
│  0-10s:  ✓ ✓ ✗ ✗ ✗ ✗ ✗ [TRIP]                              │
│  10-40s: ✗ ✗ ✗ ✗ ✗ (instant fail, no network call)         │
│  41s:    ✓ (test request succeeds)                         │
│  42s+:   ✓ ✓ ✓ ✓ (normal operation resumed)                │
└────────────────────────────────────────────────────────────┘

Configuration Parameters

Parameter	Description	Typical Value	Trade-off
Failure Threshold	Failures before tripping	5-10	Low = sensitive, High = slow to protect
Timeout	How long to stay open	30-60 seconds	Short = fast recovery, Long = gentle on recovering service
Success Threshold	Successes in half-open before closing	1-3	Low = fast recovery, High = more confidence
Window Size	Time window for counting failures	60 seconds	Rolling vs consecutive failures

Real Systems Using Circuit Breakers

Library/System	Language	Features	Use Case
Hystrix (Netflix)	Java	Bulkheads, fallbacks, metrics	Legacy but well-documented
Resilience4j	Java	Modern, lightweight, Spring integration	Recommended for new Java projects
Polly	.NET	Policies, retry + circuit breaker	C# applications
opossum	Node.js	Simple, Prometheus metrics	JavaScript/TypeScript services
gobreaker	Go	Simple, concurrent-safe	Go microservices
Istio	Service mesh	Sidecar-based, no code changes	Kubernetes environments

Case Study: E-Commerce Checkout

Circuit Breakers in E-Commerce

E-COMMERCE CHECKOUT WITH CIRCUIT BREAKERS
┌────────────────────────────────────────────────────┐
│                                                    │
│  Checkout Service                                  │
│        │                                           │
│        ├──[CB]──► Payment Service                  │
│        │          └─ Fallback: "Pay later" option  │
│        │                                           │
│        ├──[CB]──► Inventory Service                │
│        │          └─ Fallback: Cached stock levels │
│        │                                           │
│        ├──[CB]──► Shipping Calculator              │
│        │          └─ Fallback: Flat rate estimate  │
│        │                                           │
│        └──[CB]──► Recommendation Service           │
│                   └─ Fallback: Hide section        │
│                                                    │
│  SCENARIO: Payment service down                    │
│  ├─ Payment CB trips OPEN                          │
│  ├─ Checkout offers "Pay later" or "PayPal"        │
│  ├─ Other services unaffected                      │
│  └─ Customer can still complete order              │
└────────────────────────────────────────────────────┘

When to Use Circuit Breakers

✓ Perfect Use Cases

Circuit Breaker Use Cases

EXTERNAL API CALLS
Scenario: Calling third-party payment processor
Requirement: Don't let payment issues kill entire checkout
Configuration: Threshold=3, Timeout=60s
Fallback: Offer alternative payment methods

DATABASE CONNECTIONS
Scenario: Primary DB under heavy load
Requirement: Don't exhaust connection pool
Configuration: Threshold=5, Timeout=30s
Fallback: Read from replica, queue writes

MICROSERVICE CALLS
Scenario: Calling inventory service during checkout
Requirement: Checkout works even if inventory is slow
Configuration: Threshold=5, Timeout=30s
Fallback: Use cached inventory, verify at shipment

EXPENSIVE OPERATIONS
Scenario: ML model inference service
Requirement: Don't block on slow predictions
Configuration: Threshold=3, Timeout=10s
Fallback: Use simpler heuristic, default recommendation

✕ When NOT to Use

When Circuit Breakers Don't Fit

CRITICAL PATH WITH NO FALLBACK
Problem: If payment MUST succeed, circuit breaker just delays failure
Alternative: Retry with backoff, queue for later processing
When OK: If you have a meaningful fallback (alternative payment)

SIMPLE INTERNAL CALLS
Problem: Overhead not worth it for simple, reliable calls
Alternative: Just handle errors normally
When OK: For unreliable or slow internal services

FIRE-AND-FORGET CALLS
Problem: Async calls that don't block the caller
Alternative: Dead letter queues, retry queues
When OK: If you need to track failure rates for alerting

Interview Application

Common Interview Question

Q: “You’re designing a microservices architecture. How would you handle failures in downstream services?”

Strong Answer:

“I’d implement the circuit breaker pattern for all downstream calls. Here’s my approach:

Why Circuit Breakers:

Prevent cascade failures: One slow service shouldn’t take down the entire system

Fail fast: Return errors in milliseconds instead of waiting for timeouts

Allow recovery: Give failing services breathing room to recover

Enable fallbacks: Return cached data or degraded responses

Implementation:

Use a library like Resilience4j (Java) or Polly (.NET)

Configure per-dependency: payment service might have stricter thresholds than recommendations

Configuration for critical path (e.g., inventory check):

Failure threshold: 5 failures in 60 seconds

Open timeout: 30 seconds

Half-open: Allow 1 test request

Fallback strategy:

Inventory: Return cached stock levels, verify at shipment

Payment: Offer alternative payment methods

Recommendations: Hide the section entirely

Monitoring:

Track circuit breaker state in metrics (Prometheus/Grafana)

Alert when breaker trips (indicates downstream problem)

Dashboard showing breaker states across all services

Combined with other patterns:

Retry with exponential backoff for transient failures

Bulkheads to isolate thread pools per dependency

Timeouts to bound how long we wait”

Follow-up: How do you decide on circuit breaker thresholds?

“I’d start with conservative defaults and tune based on data:

Start with:

Failure threshold: 5 consecutive or 50% in 10 requests

Timeout: 30 seconds

Half-open test count: 1

Tune based on:

Normal error rate: If service has 1% baseline errors, threshold should be higher

Recovery time: How long does the service typically take to recover?

Business impact: Critical services might need faster tripping

Monitor and adjust:

If breaker trips too often on transient errors → raise threshold

If cascade failures still occur → lower threshold

If service recovers but breaker stays open → shorten timeout”

Code Example

Circuit Breaker with Resilience4j (Java)

import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.circuitbreaker.CircuitBreakerRegistry;

import java.time.Duration;
import java.util.function.Supplier;

public class PaymentService {

    private final CircuitBreaker circuitBreaker;
    private final PaymentGateway paymentGateway;

    public PaymentService(PaymentGateway paymentGateway) {
        this.paymentGateway = paymentGateway;

        // Configure circuit breaker
        CircuitBreakerConfig config = CircuitBreakerConfig.custom()
            // Trip to OPEN after 5 failures
            .failureRateThreshold(50)  // 50% failure rate
            .minimumNumberOfCalls(5)    // Need at least 5 calls to evaluate

            // Stay OPEN for 30 seconds before testing
    // ... omitted: keep concept snippets short
    // Check circuit breaker status for health checks / dashboards
    public CircuitBreakerStatus getStatus() {
        return new CircuitBreakerStatus(
            circuitBreaker.getState().name(),
            circuitBreaker.getMetrics().getFailureRate(),
            circuitBreaker.getMetrics().getNumberOfFailedCalls(),
            circuitBreaker.getMetrics().getNumberOfSuccessfulCalls()
        );
    }
}

Simple Python Implementation

import time
from enum import Enum
from dataclasses import dataclass
from typing import Callable, TypeVar, Optional
from functools import wraps

T = TypeVar('T')


class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"


@dataclass
class CircuitBreakerConfig:
    failure_threshold: int = 5
    recovery_timeout: float = 30.0
    half_open_max_calls: int = 1


    # ... omitted: keep concept snippets short
        try:
            result = unreliable_service()
            print(f"Call {i+1}: {result}")
        except CircuitOpenError as e:
            print(f"Call {i+1}: BLOCKED - {e}")
        except Exception as e:
            print(f"Call {i+1}: FAILED - {e}")

        print(f"  Status: {cb.status}")
        time.sleep(1)

See It In Action:

Circuit Breaker Explainer - Visual walkthrough of state transitions

Related Concepts:

Failover - What happens after failure detection
Health Checks - Proactive health verification
Rate Limiting - Another traffic control pattern

Quick Self-Check

Can explain circuit breaker in 60 seconds?
Understand the three states and transitions?
Know what “fail fast” means and why it matters?
Can configure thresholds and explain trade-offs?
Understand relationship with retries and timeouts?
Can implement a fallback strategy?