TL;DR
A circuit breaker wraps calls to external services and tracks failures. When failures exceed a threshold, the breaker “trips open”—subsequent calls fail immediately without attempting the network call. After a timeout, it allows a test request through. If that succeeds, the breaker closes and normal operation resumes.
Visual Overview
THREE-STATE MACHINE ┌────────────────────────────────────────────────────┐ │ │ │ ┌──────────┐ │ │ ┌─────────────────│ CLOSED │←─────────────┐ │ │ │ │ (normal) │ │ │ │ │ └────┬─────┘ │ │ │ │ │ │ │ │ │ failures > threshold │ │ │ │ │ success │ │ │ ↓ │ │ │ │ ┌──────────┐ │ │ │ │ │ OPEN │──────────────┤ │ │ │ │ (trip) │ │ │ │ │ └────┬─────┘ │ │ │ │ │ │ │ │ │ timeout expires │ │ │ │ │ │ │ │ │ ↓ │ │ │ │ ┌───────────┐ │ │ │ │ failure │ HALF-OPEN │──────────────┘ │ │ └────────────────│ (test) │ │ │ └───────────┘ │ │ │ └────────────────────────────────────────────────────┘ STATE BEHAVIORS ┌────────────────────────────────────────────────────┐ │ CLOSED: Normal operation │ │ ├─ Requests pass through to downstream │ │ ├─ Failures counted │ │ └─ Threshold breach → trip to OPEN │ │ │ │ OPEN: Protection mode │ │ ├─ All requests fail immediately │ │ ├─ No network calls made (fail fast) │ │ └─ Timer running for recovery attempt │ │ │ │ HALF-OPEN: Testing recovery │ │ ├─ Allow ONE test request through │ │ ├─ Success → close breaker │ │ └─ Failure → reopen breaker │ └────────────────────────────────────────────────────┘
Core Explanation
What is a Circuit Breaker?
Real-World Analogy: Think of an electrical circuit breaker in your home. When too much current flows (overload), the breaker trips and cuts power to prevent a fire. You don’t keep trying to run the overloaded appliance—you wait, fix the problem, then reset the breaker.
Software circuit breakers work the same way:
- Overload = too many failures calling a downstream service
- Trip = stop calling that service
- Reset = test if service recovered, then resume
The Problem It Solves
WITHOUT CIRCUIT BREAKER: CASCADE FAILURE ┌────────────────────────────────────────────────────┐ │ │ │ Service A Service B Service C │ │ ┌──────┐ ┌──────┐ ┌──────┐ │ │ │ │ ───────► │ │ ───────► │ SLOW │ │ │ │ │ │ │ │ ✗ │ │ │ └──────┘ └──────┘ └──────┘ │ │ │ │ 1. Service C becomes slow (5s timeouts) │ │ │ │ 2. Service B threads block waiting for C │ │ Thread pool: [████████████] exhausted! │ │ │ │ 3. Service B stops responding to A │ │ Service A threads block waiting for B │ │ Thread pool: [████████████] exhausted! │ │ │ │ 4. Service A fails → User sees error │ │ │ │ One slow service took down the entire chain! │ └────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────┐ │ │ │ Service A Service B Service C │ │ ┌──────┐ ┌──────┐ [CB] ┌──────┐ │ │ │ │ ───────► │ │──┤├────► │ SLOW │ │ │ │ │ │ │ OPEN │ ✗ │ │ │ └──────┘ └──────┘ └──────┘ │ │ │ │ 1. Service C becomes slow │ │ │ │ 2. Circuit breaker detects failures, trips OPEN │ │ │ │ 3. Service B returns fast failure (no wait!) │ │ "Service C unavailable" in <1ms │ │ │ │ 4. Service B stays healthy │ │ Thread pool: [██░░░░░░░░░░] plenty free │ │ │ │ 5. Service A gets quick error, can show fallback │ │ User sees degraded experience, not failure │ └────────────────────────────────────────────────────┘
Timeline Example
┌────────────────────────────────────────────────────────────┐ │ Time: 0s 5s 10s 40s 41s 42s 45s │ │ │ │ │ │ │ │ │ │ │ State: CLOSED─────►OPEN──────────────►HALF──►CLOSED │ │ │ │ │ │ │OPEN │ │ │ │ │ │ │ │ │ │ │ Events: │ │ │ │ │ │ │ │ │ │ │ │ │ └─ success! │ │ │ │ │ │ └─ test request │ │ │ │ │ └─ timeout expires (30s) │ │ │ │ └─ 5th failure → TRIP │ │ │ └─ failures accumulating │ │ └─ normal operation │ │ │ │ Requests: │ │ 0-10s: ✓ ✓ ✗ ✗ ✗ ✗ ✗ [TRIP] │ │ 10-40s: ✗ ✗ ✗ ✗ ✗ (instant fail, no network call) │ │ 41s: ✓ (test request succeeds) │ │ 42s+: ✓ ✓ ✓ ✓ (normal operation resumed) │ └────────────────────────────────────────────────────────────┘
Configuration Parameters
| Parameter | Description | Typical Value | Trade-off |
|---|---|---|---|
| Failure Threshold | Failures before tripping | 5-10 | Low = sensitive, High = slow to protect |
| Timeout | How long to stay open | 30-60 seconds | Short = fast recovery, Long = gentle on recovering service |
| Success Threshold | Successes in half-open before closing | 1-3 | Low = fast recovery, High = more confidence |
| Window Size | Time window for counting failures | 60 seconds | Rolling vs consecutive failures |
Real Systems Using Circuit Breakers
| Library/System | Language | Features | Use Case |
|---|---|---|---|
| Hystrix (Netflix) | Java | Bulkheads, fallbacks, metrics | Legacy but well-documented |
| Resilience4j | Java | Modern, lightweight, Spring integration | Recommended for new Java projects |
| Polly | .NET | Policies, retry + circuit breaker | C# applications |
| opossum | Node.js | Simple, Prometheus metrics | JavaScript/TypeScript services |
| gobreaker | Go | Simple, concurrent-safe | Go microservices |
| Istio | Service mesh | Sidecar-based, no code changes | Kubernetes environments |
Case Study: E-Commerce Checkout
E-COMMERCE CHECKOUT WITH CIRCUIT BREAKERS ┌────────────────────────────────────────────────────┐ │ │ │ Checkout Service │ │ │ │ │ ├──[CB]──► Payment Service │ │ │ └─ Fallback: "Pay later" option │ │ │ │ │ ├──[CB]──► Inventory Service │ │ │ └─ Fallback: Cached stock levels │ │ │ │ │ ├──[CB]──► Shipping Calculator │ │ │ └─ Fallback: Flat rate estimate │ │ │ │ │ └──[CB]──► Recommendation Service │ │ └─ Fallback: Hide section │ │ │ │ SCENARIO: Payment service down │ │ ├─ Payment CB trips OPEN │ │ ├─ Checkout offers "Pay later" or "PayPal" │ │ ├─ Other services unaffected │ │ └─ Customer can still complete order │ └────────────────────────────────────────────────────┘
When to Use Circuit Breakers
✓ Perfect Use Cases
EXTERNAL API CALLS Scenario: Calling third-party payment processor Requirement: Don't let payment issues kill entire checkout Configuration: Threshold=3, Timeout=60s Fallback: Offer alternative payment methods DATABASE CONNECTIONS Scenario: Primary DB under heavy load Requirement: Don't exhaust connection pool Configuration: Threshold=5, Timeout=30s Fallback: Read from replica, queue writes MICROSERVICE CALLS Scenario: Calling inventory service during checkout Requirement: Checkout works even if inventory is slow Configuration: Threshold=5, Timeout=30s Fallback: Use cached inventory, verify at shipment EXPENSIVE OPERATIONS Scenario: ML model inference service Requirement: Don't block on slow predictions Configuration: Threshold=3, Timeout=10s Fallback: Use simpler heuristic, default recommendation
✕ When NOT to Use
CRITICAL PATH WITH NO FALLBACK
Problem: If payment MUST succeed, circuit breaker just delays failure
Alternative: Retry with backoff, queue for later processing
When OK: If you have a meaningful fallback (alternative payment)
SIMPLE INTERNAL CALLS
Problem: Overhead not worth it for simple, reliable calls
Alternative: Just handle errors normally
When OK: For unreliable or slow internal services
FIRE-AND-FORGET CALLS
Problem: Async calls that don't block the caller
Alternative: Dead letter queues, retry queues
When OK: If you need to track failure rates for alerting
Interview Application
Common Interview Question
Q: “You’re designing a microservices architecture. How would you handle failures in downstream services?”
Strong Answer:
“I’d implement the circuit breaker pattern for all downstream calls. Here’s my approach:
Why Circuit Breakers:
- Prevent cascade failures: One slow service shouldn’t take down the entire system
- Fail fast: Return errors in milliseconds instead of waiting for timeouts
- Allow recovery: Give failing services breathing room to recover
- Enable fallbacks: Return cached data or degraded responses
Implementation:
- Use a library like Resilience4j (Java) or Polly (.NET)
- Configure per-dependency: payment service might have stricter thresholds than recommendations
Configuration for critical path (e.g., inventory check):
- Failure threshold: 5 failures in 60 seconds
- Open timeout: 30 seconds
- Half-open: Allow 1 test request
Fallback strategy:
- Inventory: Return cached stock levels, verify at shipment
- Payment: Offer alternative payment methods
- Recommendations: Hide the section entirely
Monitoring:
- Track circuit breaker state in metrics (Prometheus/Grafana)
- Alert when breaker trips (indicates downstream problem)
- Dashboard showing breaker states across all services
Combined with other patterns:
- Retry with exponential backoff for transient failures
- Bulkheads to isolate thread pools per dependency
- Timeouts to bound how long we wait”
Follow-up: How do you decide on circuit breaker thresholds?
“I’d start with conservative defaults and tune based on data:
Start with:
- Failure threshold: 5 consecutive or 50% in 10 requests
- Timeout: 30 seconds
- Half-open test count: 1
Tune based on:
- Normal error rate: If service has 1% baseline errors, threshold should be higher
- Recovery time: How long does the service typically take to recover?
- Business impact: Critical services might need faster tripping
Monitor and adjust:
- If breaker trips too often on transient errors → raise threshold
- If cascade failures still occur → lower threshold
- If service recovers but breaker stays open → shorten timeout”
Code Example
Circuit Breaker with Resilience4j (Java)
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.circuitbreaker.CircuitBreakerRegistry;
import java.time.Duration;
import java.util.function.Supplier;
public class PaymentService {
private final CircuitBreaker circuitBreaker;
private final PaymentGateway paymentGateway;
public PaymentService(PaymentGateway paymentGateway) {
this.paymentGateway = paymentGateway;
// Configure circuit breaker
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
// Trip to OPEN after 5 failures
.failureRateThreshold(50) // 50% failure rate
.minimumNumberOfCalls(5) // Need at least 5 calls to evaluate
// Stay OPEN for 30 seconds before testing
// ... omitted: keep concept snippets short
// Check circuit breaker status for health checks / dashboards
public CircuitBreakerStatus getStatus() {
return new CircuitBreakerStatus(
circuitBreaker.getState().name(),
circuitBreaker.getMetrics().getFailureRate(),
circuitBreaker.getMetrics().getNumberOfFailedCalls(),
circuitBreaker.getMetrics().getNumberOfSuccessfulCalls()
);
}
}
Simple Python Implementation
import time
from enum import Enum
from dataclasses import dataclass
from typing import Callable, TypeVar, Optional
from functools import wraps
T = TypeVar('T')
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
@dataclass
class CircuitBreakerConfig:
failure_threshold: int = 5
recovery_timeout: float = 30.0
half_open_max_calls: int = 1
# ... omitted: keep concept snippets short
try:
result = unreliable_service()
print(f"Call {i+1}: {result}")
except CircuitOpenError as e:
print(f"Call {i+1}: BLOCKED - {e}")
except Exception as e:
print(f"Call {i+1}: FAILED - {e}")
print(f" Status: {cb.status}")
time.sleep(1)
Related Content
See It In Action:
- Circuit Breaker Explainer - Visual walkthrough of state transitions
Related Concepts:
- Failover - What happens after failure detection
- Health Checks - Proactive health verification
- Rate Limiting - Another traffic control pattern
Quick Self-Check
- Can explain circuit breaker in 60 seconds?
- Understand the three states and transitions?
- Know what “fail fast” means and why it matters?
- Can configure thresholds and explain trade-offs?
- Understand relationship with retries and timeouts?
- Can implement a fallback strategy?
Production signal