Circuit Breaker

A resilience pattern that prevents cascade failures by failing fast when a downstream service is unhealthy

TL;DR

A circuit breaker wraps calls to external services and tracks failures. When failures exceed a threshold, the breaker “trips open”—subsequent calls fail immediately without attempting the network call. After a timeout, it allows a test request through. If that succeeds, the breaker closes and normal operation resumes.

Visual Overview

Circuit Breaker State Machine
THREE-STATE MACHINE

                                                    
                                        
    CLOSED       
                    (normal)                    
                                      
                                                 
           failures > threshold                   
                                        success   
                                                 
                                      
                      OPEN        
                     (trip)                     
                                      
                                                 
                timeout expires                   
                                                 
                                                 
                                     
       failure     HALF-OPEN      
    (test)                       
                                       
                                                    


STATE BEHAVIORS

  CLOSED: Normal operation                          
   Requests pass through to downstream            
   Failures counted                               
   Threshold breach  trip to OPEN                
                                                    
  OPEN: Protection mode                             
   All requests fail immediately                  
   No network calls made (fail fast)              
   Timer running for recovery attempt             
                                                    
  HALF-OPEN: Testing recovery                       
   Allow ONE test request through                 
   Success  close breaker                        
   Failure  reopen breaker                       


Core Explanation

What is a Circuit Breaker?

Real-World Analogy: Think of an electrical circuit breaker in your home. When too much current flows (overload), the breaker trips and cuts power to prevent a fire. You don’t keep trying to run the overloaded appliance—you wait, fix the problem, then reset the breaker.

Software circuit breakers work the same way:

  • Overload = too many failures calling a downstream service
  • Trip = stop calling that service
  • Reset = test if service recovered, then resume

The Problem It Solves

Cascade Failure Without Circuit Breaker
WITHOUT CIRCUIT BREAKER: CASCADE FAILURE

                                                    
  Service A          Service B          Service C   
                            
                   SLOW       
                                             
                            
                                                    
  1. Service C becomes slow (5s timeouts)           
                                                    
  2. Service B threads block waiting for C          
     Thread pool: [████████████] exhausted!         
                                                    
  3. Service B stops responding to A                
     Service A threads block waiting for B          
     Thread pool: [████████████] exhausted!         
                                                    
  4. Service A fails  User sees error              
                                                    
  One slow service took down the entire chain!      


With Circuit Breaker: Fail Fast

                                                    
  Service A          Service B          Service C   
              [CB]          
                  SLOW       
                          OPEN               
                            
                                                    
  1. Service C becomes slow                         
                                                    
  2. Circuit breaker detects failures, trips OPEN   
                                                    
  3. Service B returns fast failure (no wait!)      
     "Service C unavailable" in <1ms                
                                                    
  4. Service B stays healthy                        
     Thread pool: [██░░░░░░░░░░] plenty free        
                                                    
  5. Service A gets quick error, can show fallback  
     User sees degraded experience, not failure     


Timeline Example

Circuit Breaker Timeline

  Time:   0s    5s    10s   40s   41s   42s   45s           
                                                     
  State:  CLOSEDOPENHALFCLOSED      
                                    OPEN              
                                                      
  Events:                                             
                                          success!   
                                     test request     
                          timeout expires (30s)        
                     5th failure  TRIP                 
                failures accumulating                    
           normal operation                               
                                                            
  Requests:                                                 
  0-10s:         [TRIP]                              
  10-40s:      (instant fail, no network call)         
  41s:     (test request succeeds)                         
  42s+:       (normal operation resumed)                


Configuration Parameters

ParameterDescriptionTypical ValueTrade-off
Failure ThresholdFailures before tripping5-10Low = sensitive, High = slow to protect
TimeoutHow long to stay open30-60 secondsShort = fast recovery, Long = gentle on recovering service
Success ThresholdSuccesses in half-open before closing1-3Low = fast recovery, High = more confidence
Window SizeTime window for counting failures60 secondsRolling vs consecutive failures

Real Systems Using Circuit Breakers

Library/SystemLanguageFeaturesUse Case
Hystrix (Netflix)JavaBulkheads, fallbacks, metricsLegacy but well-documented
Resilience4jJavaModern, lightweight, Spring integrationRecommended for new Java projects
Polly.NETPolicies, retry + circuit breakerC# applications
opossumNode.jsSimple, Prometheus metricsJavaScript/TypeScript services
gobreakerGoSimple, concurrent-safeGo microservices
IstioService meshSidecar-based, no code changesKubernetes environments

Case Study: E-Commerce Checkout

Circuit Breakers in E-Commerce
E-COMMERCE CHECKOUT WITH CIRCUIT BREAKERS

                                                    
  Checkout Service                                  
                                                   
        [CB] Payment Service                  
                   Fallback: "Pay later" option  
                                                   
        [CB] Inventory Service                
                   Fallback: Cached stock levels 
                                                   
        [CB] Shipping Calculator              
                   Fallback: Flat rate estimate  
                                                   
        [CB] Recommendation Service           
                    Fallback: Hide section        
                                                    
  SCENARIO: Payment service down                    
   Payment CB trips OPEN                          
   Checkout offers "Pay later" or "PayPal"        
   Other services unaffected                      
   Customer can still complete order              


When to Use Circuit Breakers

✓ Perfect Use Cases

Circuit Breaker Use Cases
EXTERNAL API CALLS
Scenario: Calling third-party payment processor
Requirement: Don't let payment issues kill entire checkout
Configuration: Threshold=3, Timeout=60s
Fallback: Offer alternative payment methods

DATABASE CONNECTIONS
Scenario: Primary DB under heavy load
Requirement: Don't exhaust connection pool
Configuration: Threshold=5, Timeout=30s
Fallback: Read from replica, queue writes

MICROSERVICE CALLS
Scenario: Calling inventory service during checkout
Requirement: Checkout works even if inventory is slow
Configuration: Threshold=5, Timeout=30s
Fallback: Use cached inventory, verify at shipment

EXPENSIVE OPERATIONS
Scenario: ML model inference service
Requirement: Don't block on slow predictions
Configuration: Threshold=3, Timeout=10s
Fallback: Use simpler heuristic, default recommendation

✕ When NOT to Use

When Circuit Breakers Don't Fit
CRITICAL PATH WITH NO FALLBACK
Problem: If payment MUST succeed, circuit breaker just delays failure
Alternative: Retry with backoff, queue for later processing
When OK: If you have a meaningful fallback (alternative payment)

SIMPLE INTERNAL CALLS
Problem: Overhead not worth it for simple, reliable calls
Alternative: Just handle errors normally
When OK: For unreliable or slow internal services

FIRE-AND-FORGET CALLS
Problem: Async calls that don't block the caller
Alternative: Dead letter queues, retry queues
When OK: If you need to track failure rates for alerting

Interview Application

Common Interview Question

Q: “You’re designing a microservices architecture. How would you handle failures in downstream services?”

Strong Answer:

“I’d implement the circuit breaker pattern for all downstream calls. Here’s my approach:

Why Circuit Breakers:

  1. Prevent cascade failures: One slow service shouldn’t take down the entire system
  2. Fail fast: Return errors in milliseconds instead of waiting for timeouts
  3. Allow recovery: Give failing services breathing room to recover
  4. Enable fallbacks: Return cached data or degraded responses

Implementation:

  • Use a library like Resilience4j (Java) or Polly (.NET)
  • Configure per-dependency: payment service might have stricter thresholds than recommendations

Configuration for critical path (e.g., inventory check):

  • Failure threshold: 5 failures in 60 seconds
  • Open timeout: 30 seconds
  • Half-open: Allow 1 test request

Fallback strategy:

  • Inventory: Return cached stock levels, verify at shipment
  • Payment: Offer alternative payment methods
  • Recommendations: Hide the section entirely

Monitoring:

  • Track circuit breaker state in metrics (Prometheus/Grafana)
  • Alert when breaker trips (indicates downstream problem)
  • Dashboard showing breaker states across all services

Combined with other patterns:

  • Retry with exponential backoff for transient failures
  • Bulkheads to isolate thread pools per dependency
  • Timeouts to bound how long we wait”

Follow-up: How do you decide on circuit breaker thresholds?

“I’d start with conservative defaults and tune based on data:

Start with:

  • Failure threshold: 5 consecutive or 50% in 10 requests
  • Timeout: 30 seconds
  • Half-open test count: 1

Tune based on:

  • Normal error rate: If service has 1% baseline errors, threshold should be higher
  • Recovery time: How long does the service typically take to recover?
  • Business impact: Critical services might need faster tripping

Monitor and adjust:

  • If breaker trips too often on transient errors → raise threshold
  • If cascade failures still occur → lower threshold
  • If service recovers but breaker stays open → shorten timeout”

Code Example

Circuit Breaker with Resilience4j (Java)

import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.circuitbreaker.CircuitBreakerRegistry;

import java.time.Duration;
import java.util.function.Supplier;

public class PaymentService {

    private final CircuitBreaker circuitBreaker;
    private final PaymentGateway paymentGateway;

    public PaymentService(PaymentGateway paymentGateway) {
        this.paymentGateway = paymentGateway;

        // Configure circuit breaker
        CircuitBreakerConfig config = CircuitBreakerConfig.custom()
            // Trip to OPEN after 5 failures
            .failureRateThreshold(50)  // 50% failure rate
            .minimumNumberOfCalls(5)    // Need at least 5 calls to evaluate

            // Stay OPEN for 30 seconds before testing
    // ... omitted: keep concept snippets short
    // Check circuit breaker status for health checks / dashboards
    public CircuitBreakerStatus getStatus() {
        return new CircuitBreakerStatus(
            circuitBreaker.getState().name(),
            circuitBreaker.getMetrics().getFailureRate(),
            circuitBreaker.getMetrics().getNumberOfFailedCalls(),
            circuitBreaker.getMetrics().getNumberOfSuccessfulCalls()
        );
    }
}

Simple Python Implementation

import time
from enum import Enum
from dataclasses import dataclass
from typing import Callable, TypeVar, Optional
from functools import wraps

T = TypeVar('T')


class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"


@dataclass
class CircuitBreakerConfig:
    failure_threshold: int = 5
    recovery_timeout: float = 30.0
    half_open_max_calls: int = 1


    # ... omitted: keep concept snippets short
        try:
            result = unreliable_service()
            print(f"Call {i+1}: {result}")
        except CircuitOpenError as e:
            print(f"Call {i+1}: BLOCKED - {e}")
        except Exception as e:
            print(f"Call {i+1}: FAILED - {e}")

        print(f"  Status: {cb.status}")
        time.sleep(1)

See It In Action:

Related Concepts:

Quick Self-Check

  • Can explain circuit breaker in 60 seconds?
  • Understand the three states and transitions?
  • Know what “fail fast” means and why it matters?
  • Can configure thresholds and explain trade-offs?
  • Understand relationship with retries and timeouts?
  • Can implement a fallback strategy?

Production signal

Why this concept matters

Interview 70% of microservices interviews
Production All microservice architectures
Performance Prevents cascade failures
Scale Protects resources