A resilience pattern that prevents cascade failures by failing fast when a downstream service is unhealthy
TL;DR
A circuit breaker wraps calls to external services and tracks failures. When failures exceed a threshold, the breaker “trips open”—subsequent calls fail immediately without attempting the network call. After a timeout, it allows a test request through. If that succeeds, the breaker closes and normal operation resumes. This prevents cascade failures and protects system resources.
Visual Overview
Core Explanation
What is a Circuit Breaker?
Real-World Analogy: Think of an electrical circuit breaker in your home. When too much current flows (overload), the breaker trips and cuts power to prevent a fire. You don’t keep trying to run the overloaded appliance—you wait, fix the problem, then reset the breaker.
Software circuit breakers work the same way:
- Overload = too many failures calling a downstream service
- Trip = stop calling that service
- Reset = test if service recovered, then resume
The Problem It Solves
Timeline Example
Configuration Parameters
| Parameter | Description | Typical Value | Trade-off |
|---|---|---|---|
| Failure Threshold | Failures before tripping | 5-10 | Low = sensitive, High = slow to protect |
| Timeout | How long to stay open | 30-60 seconds | Short = fast recovery, Long = gentle on recovering service |
| Success Threshold | Successes in half-open before closing | 1-3 | Low = fast recovery, High = more confidence |
| Window Size | Time window for counting failures | 60 seconds | Rolling vs consecutive failures |
Real Systems Using Circuit Breakers
| Library/System | Language | Features | Use Case |
|---|---|---|---|
| Hystrix (Netflix) | Java | Bulkheads, fallbacks, metrics | Legacy but well-documented |
| Resilience4j | Java | Modern, lightweight, Spring integration | Recommended for new Java projects |
| Polly | .NET | Policies, retry + circuit breaker | C# applications |
| opossum | Node.js | Simple, Prometheus metrics | JavaScript/TypeScript services |
| gobreaker | Go | Simple, concurrent-safe | Go microservices |
| Istio | Service mesh | Sidecar-based, no code changes | Kubernetes environments |
Case Study: E-Commerce Checkout
When to Use Circuit Breakers
✓ Perfect Use Cases
✕ When NOT to Use
Interview Application
Common Interview Question
Q: “You’re designing a microservices architecture. How would you handle failures in downstream services?”
Strong Answer:
“I’d implement the circuit breaker pattern for all downstream calls. Here’s my approach:
Why Circuit Breakers:
- Prevent cascade failures: One slow service shouldn’t take down the entire system
- Fail fast: Return errors in milliseconds instead of waiting for timeouts
- Allow recovery: Give failing services breathing room to recover
- Enable fallbacks: Return cached data or degraded responses
Implementation:
- Use a library like Resilience4j (Java) or Polly (.NET)
- Configure per-dependency: payment service might have stricter thresholds than recommendations
Configuration for critical path (e.g., inventory check):
- Failure threshold: 5 failures in 60 seconds
- Open timeout: 30 seconds
- Half-open: Allow 1 test request
Fallback strategy:
- Inventory: Return cached stock levels, verify at shipment
- Payment: Offer alternative payment methods
- Recommendations: Hide the section entirely
Monitoring:
- Track circuit breaker state in metrics (Prometheus/Grafana)
- Alert when breaker trips (indicates downstream problem)
- Dashboard showing breaker states across all services
Combined with other patterns:
- Retry with exponential backoff for transient failures
- Bulkheads to isolate thread pools per dependency
- Timeouts to bound how long we wait”
Follow-up: How do you decide on circuit breaker thresholds?
“I’d start with conservative defaults and tune based on data:
Start with:
- Failure threshold: 5 consecutive or 50% in 10 requests
- Timeout: 30 seconds
- Half-open test count: 1
Tune based on:
- Normal error rate: If service has 1% baseline errors, threshold should be higher
- Recovery time: How long does the service typically take to recover?
- Business impact: Critical services might need faster tripping
Monitor and adjust:
- If breaker trips too often on transient errors → raise threshold
- If cascade failures still occur → lower threshold
- If service recovers but breaker stays open → shorten timeout”
Code Example
Circuit Breaker with Resilience4j (Java)
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.circuitbreaker.CircuitBreakerRegistry;
import java.time.Duration;
import java.util.function.Supplier;
public class PaymentService {
private final CircuitBreaker circuitBreaker;
private final PaymentGateway paymentGateway;
public PaymentService(PaymentGateway paymentGateway) {
this.paymentGateway = paymentGateway;
// Configure circuit breaker
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
// Trip to OPEN after 5 failures
.failureRateThreshold(50) // 50% failure rate
.minimumNumberOfCalls(5) // Need at least 5 calls to evaluate
// Stay OPEN for 30 seconds before testing
.waitDurationInOpenState(Duration.ofSeconds(30))
// In HALF-OPEN, allow 3 test calls
.permittedNumberOfCallsInHalfOpenState(3)
// Sliding window for failure rate calculation
.slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
.slidingWindowSize(10)
// What counts as a failure
.recordExceptions(PaymentException.class, TimeoutException.class)
.ignoreExceptions(InvalidCardException.class) // Don't count client errors
.build();
CircuitBreakerRegistry registry = CircuitBreakerRegistry.of(config);
this.circuitBreaker = registry.circuitBreaker("paymentService");
// Register event handlers for monitoring
circuitBreaker.getEventPublisher()
.onStateTransition(event -> {
System.out.println("Circuit breaker state: " +
event.getStateTransition().getFromState() + " -> " +
event.getStateTransition().getToState());
// Send to metrics system (Prometheus, DataDog, etc.)
})
.onCallNotPermitted(event -> {
System.out.println("Call blocked by circuit breaker");
});
}
public PaymentResult processPayment(PaymentRequest request) {
// Wrap the call with circuit breaker
Supplier<PaymentResult> paymentCall = CircuitBreaker
.decorateSupplier(circuitBreaker, () -> {
// This is the actual call that might fail
return paymentGateway.charge(request);
});
try {
return paymentCall.get();
} catch (CallNotPermittedException e) {
// Circuit breaker is OPEN - fail fast with fallback
return handleCircuitOpen(request);
} catch (PaymentException e) {
// Payment failed (circuit breaker recorded this)
throw e;
}
}
private PaymentResult handleCircuitOpen(PaymentRequest request) {
// Fallback options when payment service is unavailable:
// Option 1: Offer alternative payment
// return new PaymentResult(PaymentStatus.DEFERRED,
// "Payment service unavailable. Try PayPal?");
// Option 2: Queue for later processing
// paymentQueue.enqueue(request);
// return new PaymentResult(PaymentStatus.QUEUED,
// "Payment will be processed shortly");
// Option 3: Return error with helpful message
return new PaymentResult(PaymentStatus.SERVICE_UNAVAILABLE,
"Payment processing temporarily unavailable. Please try again in a few minutes.");
}
// Check circuit breaker status for health checks / dashboards
public CircuitBreakerStatus getStatus() {
return new CircuitBreakerStatus(
circuitBreaker.getState().name(),
circuitBreaker.getMetrics().getFailureRate(),
circuitBreaker.getMetrics().getNumberOfFailedCalls(),
circuitBreaker.getMetrics().getNumberOfSuccessfulCalls()
);
}
}
Simple Python Implementation
import time
from enum import Enum
from dataclasses import dataclass
from typing import Callable, TypeVar, Optional
from functools import wraps
T = TypeVar('T')
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
@dataclass
class CircuitBreakerConfig:
failure_threshold: int = 5
recovery_timeout: float = 30.0
half_open_max_calls: int = 1
class CircuitBreaker:
"""
Simple circuit breaker implementation.
Usage:
cb = CircuitBreaker("payment-service")
@cb
def call_payment_service():
return requests.post(...)
result = call_payment_service() # Raises CircuitOpenError if open
"""
def __init__(self, name: str, config: Optional[CircuitBreakerConfig] = None):
self.name = name
self.config = config or CircuitBreakerConfig()
self.state = CircuitState.CLOSED
self.failure_count = 0
self.last_failure_time: Optional[float] = None
self.half_open_calls = 0
def __call__(self, func: Callable[..., T]) -> Callable[..., T]:
@wraps(func)
def wrapper(*args, **kwargs) -> T:
return self.call(func, *args, **kwargs)
return wrapper
def call(self, func: Callable[..., T], *args, **kwargs) -> T:
"""Execute function with circuit breaker protection."""
# Check if we should transition from OPEN to HALF_OPEN
if self.state == CircuitState.OPEN:
if self._should_attempt_reset():
self.state = CircuitState.HALF_OPEN
self.half_open_calls = 0
print(f"[{self.name}] Transitioning to HALF_OPEN")
else:
raise CircuitOpenError(
f"Circuit breaker {self.name} is OPEN. "
f"Retry after {self._time_until_retry():.1f}s"
)
# Check half-open call limit
if self.state == CircuitState.HALF_OPEN:
if self.half_open_calls >= self.config.half_open_max_calls:
raise CircuitOpenError(
f"Circuit breaker {self.name} is HALF_OPEN and at capacity"
)
self.half_open_calls += 1
# Execute the call
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
"""Handle successful call."""
if self.state == CircuitState.HALF_OPEN:
# Successful test call - close the circuit
self.state = CircuitState.CLOSED
self.failure_count = 0
print(f"[{self.name}] SUCCESS in HALF_OPEN → CLOSED")
elif self.state == CircuitState.CLOSED:
# Reset failure count on success (for consecutive failure mode)
self.failure_count = 0
def _on_failure(self):
"""Handle failed call."""
self.failure_count += 1
self.last_failure_time = time.time()
if self.state == CircuitState.HALF_OPEN:
# Failed test call - back to open
self.state = CircuitState.OPEN
print(f"[{self.name}] FAILURE in HALF_OPEN → OPEN")
elif self.state == CircuitState.CLOSED:
if self.failure_count >= self.config.failure_threshold:
self.state = CircuitState.OPEN
print(f"[{self.name}] Threshold reached → OPEN")
def _should_attempt_reset(self) -> bool:
"""Check if enough time has passed to try recovery."""
if self.last_failure_time is None:
return True
elapsed = time.time() - self.last_failure_time
return elapsed >= self.config.recovery_timeout
def _time_until_retry(self) -> float:
"""Calculate seconds until circuit breaker will try half-open."""
if self.last_failure_time is None:
return 0
elapsed = time.time() - self.last_failure_time
return max(0, self.config.recovery_timeout - elapsed)
@property
def status(self) -> dict:
return {
"name": self.name,
"state": self.state.value,
"failure_count": self.failure_count,
"time_until_retry": self._time_until_retry() if self.state == CircuitState.OPEN else None
}
class CircuitOpenError(Exception):
"""Raised when circuit breaker is open and call is blocked."""
pass
# Usage example
if __name__ == "__main__":
import random
cb = CircuitBreaker("test-service", CircuitBreakerConfig(
failure_threshold=3,
recovery_timeout=5.0
))
@cb
def unreliable_service():
if random.random() < 0.7: # 70% failure rate
raise Exception("Service unavailable")
return "Success!"
for i in range(20):
try:
result = unreliable_service()
print(f"Call {i+1}: {result}")
except CircuitOpenError as e:
print(f"Call {i+1}: BLOCKED - {e}")
except Exception as e:
print(f"Call {i+1}: FAILED - {e}")
print(f" Status: {cb.status}")
time.sleep(1)
Related Content
See It In Action:
- Circuit Breaker Explainer - Visual walkthrough of state transitions
Related Concepts:
- Failover - What happens after failure detection
- Health Checks - Proactive health verification
- Rate Limiting - Another traffic control pattern
Quick Self-Check
- Can explain circuit breaker in 60 seconds?
- Understand the three states and transitions?
- Know what “fail fast” means and why it matters?
- Can configure thresholds and explain trade-offs?
- Understand relationship with retries and timeouts?
- Can implement a fallback strategy?
Interview Notes
70% of microservices interviews
Powers systems at All microservice architectures
Prevents cascade failures query improvement
Protects resources