Skip to content

Circuit Breaker

A resilience pattern that prevents cascade failures by failing fast when a downstream service is unhealthy

TL;DR

A circuit breaker wraps calls to external services and tracks failures. When failures exceed a threshold, the breaker “trips open”—subsequent calls fail immediately without attempting the network call. After a timeout, it allows a test request through. If that succeeds, the breaker closes and normal operation resumes. This prevents cascade failures and protects system resources.

Visual Overview

Circuit Breaker State Machine

Core Explanation

What is a Circuit Breaker?

Real-World Analogy: Think of an electrical circuit breaker in your home. When too much current flows (overload), the breaker trips and cuts power to prevent a fire. You don’t keep trying to run the overloaded appliance—you wait, fix the problem, then reset the breaker.

Software circuit breakers work the same way:

  • Overload = too many failures calling a downstream service
  • Trip = stop calling that service
  • Reset = test if service recovered, then resume

The Problem It Solves

Cascade Failure Without Circuit Breaker
With Circuit Breaker: Fail Fast

Timeline Example

Circuit Breaker Timeline

Configuration Parameters

ParameterDescriptionTypical ValueTrade-off
Failure ThresholdFailures before tripping5-10Low = sensitive, High = slow to protect
TimeoutHow long to stay open30-60 secondsShort = fast recovery, Long = gentle on recovering service
Success ThresholdSuccesses in half-open before closing1-3Low = fast recovery, High = more confidence
Window SizeTime window for counting failures60 secondsRolling vs consecutive failures

Real Systems Using Circuit Breakers

Library/SystemLanguageFeaturesUse Case
Hystrix (Netflix)JavaBulkheads, fallbacks, metricsLegacy but well-documented
Resilience4jJavaModern, lightweight, Spring integrationRecommended for new Java projects
Polly.NETPolicies, retry + circuit breakerC# applications
opossumNode.jsSimple, Prometheus metricsJavaScript/TypeScript services
gobreakerGoSimple, concurrent-safeGo microservices
IstioService meshSidecar-based, no code changesKubernetes environments

Case Study: E-Commerce Checkout

Circuit Breakers in E-Commerce

When to Use Circuit Breakers

✓ Perfect Use Cases

Circuit Breaker Use Cases

✕ When NOT to Use

When Circuit Breakers Don't Fit

Interview Application

Common Interview Question

Q: “You’re designing a microservices architecture. How would you handle failures in downstream services?”

Strong Answer:

“I’d implement the circuit breaker pattern for all downstream calls. Here’s my approach:

Why Circuit Breakers:

  1. Prevent cascade failures: One slow service shouldn’t take down the entire system
  2. Fail fast: Return errors in milliseconds instead of waiting for timeouts
  3. Allow recovery: Give failing services breathing room to recover
  4. Enable fallbacks: Return cached data or degraded responses

Implementation:

  • Use a library like Resilience4j (Java) or Polly (.NET)
  • Configure per-dependency: payment service might have stricter thresholds than recommendations

Configuration for critical path (e.g., inventory check):

  • Failure threshold: 5 failures in 60 seconds
  • Open timeout: 30 seconds
  • Half-open: Allow 1 test request

Fallback strategy:

  • Inventory: Return cached stock levels, verify at shipment
  • Payment: Offer alternative payment methods
  • Recommendations: Hide the section entirely

Monitoring:

  • Track circuit breaker state in metrics (Prometheus/Grafana)
  • Alert when breaker trips (indicates downstream problem)
  • Dashboard showing breaker states across all services

Combined with other patterns:

  • Retry with exponential backoff for transient failures
  • Bulkheads to isolate thread pools per dependency
  • Timeouts to bound how long we wait”

Follow-up: How do you decide on circuit breaker thresholds?

“I’d start with conservative defaults and tune based on data:

Start with:

  • Failure threshold: 5 consecutive or 50% in 10 requests
  • Timeout: 30 seconds
  • Half-open test count: 1

Tune based on:

  • Normal error rate: If service has 1% baseline errors, threshold should be higher
  • Recovery time: How long does the service typically take to recover?
  • Business impact: Critical services might need faster tripping

Monitor and adjust:

  • If breaker trips too often on transient errors → raise threshold
  • If cascade failures still occur → lower threshold
  • If service recovers but breaker stays open → shorten timeout”

Code Example

Circuit Breaker with Resilience4j (Java)

import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.circuitbreaker.CircuitBreakerRegistry;

import java.time.Duration;
import java.util.function.Supplier;

public class PaymentService {

    private final CircuitBreaker circuitBreaker;
    private final PaymentGateway paymentGateway;

    public PaymentService(PaymentGateway paymentGateway) {
        this.paymentGateway = paymentGateway;

        // Configure circuit breaker
        CircuitBreakerConfig config = CircuitBreakerConfig.custom()
            // Trip to OPEN after 5 failures
            .failureRateThreshold(50)  // 50% failure rate
            .minimumNumberOfCalls(5)    // Need at least 5 calls to evaluate

            // Stay OPEN for 30 seconds before testing
            .waitDurationInOpenState(Duration.ofSeconds(30))

            // In HALF-OPEN, allow 3 test calls
            .permittedNumberOfCallsInHalfOpenState(3)

            // Sliding window for failure rate calculation
            .slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
            .slidingWindowSize(10)

            // What counts as a failure
            .recordExceptions(PaymentException.class, TimeoutException.class)
            .ignoreExceptions(InvalidCardException.class)  // Don't count client errors

            .build();

        CircuitBreakerRegistry registry = CircuitBreakerRegistry.of(config);
        this.circuitBreaker = registry.circuitBreaker("paymentService");

        // Register event handlers for monitoring
        circuitBreaker.getEventPublisher()
            .onStateTransition(event -> {
                System.out.println("Circuit breaker state: " +
                    event.getStateTransition().getFromState() + " -> " +
                    event.getStateTransition().getToState());
                // Send to metrics system (Prometheus, DataDog, etc.)
            })
            .onCallNotPermitted(event -> {
                System.out.println("Call blocked by circuit breaker");
            });
    }

    public PaymentResult processPayment(PaymentRequest request) {
        // Wrap the call with circuit breaker
        Supplier<PaymentResult> paymentCall = CircuitBreaker
            .decorateSupplier(circuitBreaker, () -> {
                // This is the actual call that might fail
                return paymentGateway.charge(request);
            });

        try {
            return paymentCall.get();
        } catch (CallNotPermittedException e) {
            // Circuit breaker is OPEN - fail fast with fallback
            return handleCircuitOpen(request);
        } catch (PaymentException e) {
            // Payment failed (circuit breaker recorded this)
            throw e;
        }
    }

    private PaymentResult handleCircuitOpen(PaymentRequest request) {
        // Fallback options when payment service is unavailable:

        // Option 1: Offer alternative payment
        // return new PaymentResult(PaymentStatus.DEFERRED,
        //     "Payment service unavailable. Try PayPal?");

        // Option 2: Queue for later processing
        // paymentQueue.enqueue(request);
        // return new PaymentResult(PaymentStatus.QUEUED,
        //     "Payment will be processed shortly");

        // Option 3: Return error with helpful message
        return new PaymentResult(PaymentStatus.SERVICE_UNAVAILABLE,
            "Payment processing temporarily unavailable. Please try again in a few minutes.");
    }

    // Check circuit breaker status for health checks / dashboards
    public CircuitBreakerStatus getStatus() {
        return new CircuitBreakerStatus(
            circuitBreaker.getState().name(),
            circuitBreaker.getMetrics().getFailureRate(),
            circuitBreaker.getMetrics().getNumberOfFailedCalls(),
            circuitBreaker.getMetrics().getNumberOfSuccessfulCalls()
        );
    }
}

Simple Python Implementation

import time
from enum import Enum
from dataclasses import dataclass
from typing import Callable, TypeVar, Optional
from functools import wraps

T = TypeVar('T')


class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"


@dataclass
class CircuitBreakerConfig:
    failure_threshold: int = 5
    recovery_timeout: float = 30.0
    half_open_max_calls: int = 1


class CircuitBreaker:
    """
    Simple circuit breaker implementation.

    Usage:
        cb = CircuitBreaker("payment-service")

        @cb
        def call_payment_service():
            return requests.post(...)

        result = call_payment_service()  # Raises CircuitOpenError if open
    """

    def __init__(self, name: str, config: Optional[CircuitBreakerConfig] = None):
        self.name = name
        self.config = config or CircuitBreakerConfig()
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.last_failure_time: Optional[float] = None
        self.half_open_calls = 0

    def __call__(self, func: Callable[..., T]) -> Callable[..., T]:
        @wraps(func)
        def wrapper(*args, **kwargs) -> T:
            return self.call(func, *args, **kwargs)
        return wrapper

    def call(self, func: Callable[..., T], *args, **kwargs) -> T:
        """Execute function with circuit breaker protection."""

        # Check if we should transition from OPEN to HALF_OPEN
        if self.state == CircuitState.OPEN:
            if self._should_attempt_reset():
                self.state = CircuitState.HALF_OPEN
                self.half_open_calls = 0
                print(f"[{self.name}] Transitioning to HALF_OPEN")
            else:
                raise CircuitOpenError(
                    f"Circuit breaker {self.name} is OPEN. "
                    f"Retry after {self._time_until_retry():.1f}s"
                )

        # Check half-open call limit
        if self.state == CircuitState.HALF_OPEN:
            if self.half_open_calls >= self.config.half_open_max_calls:
                raise CircuitOpenError(
                    f"Circuit breaker {self.name} is HALF_OPEN and at capacity"
                )
            self.half_open_calls += 1

        # Execute the call
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        """Handle successful call."""
        if self.state == CircuitState.HALF_OPEN:
            # Successful test call - close the circuit
            self.state = CircuitState.CLOSED
            self.failure_count = 0
            print(f"[{self.name}] SUCCESS in HALF_OPEN → CLOSED")
        elif self.state == CircuitState.CLOSED:
            # Reset failure count on success (for consecutive failure mode)
            self.failure_count = 0

    def _on_failure(self):
        """Handle failed call."""
        self.failure_count += 1
        self.last_failure_time = time.time()

        if self.state == CircuitState.HALF_OPEN:
            # Failed test call - back to open
            self.state = CircuitState.OPEN
            print(f"[{self.name}] FAILURE in HALF_OPEN → OPEN")
        elif self.state == CircuitState.CLOSED:
            if self.failure_count >= self.config.failure_threshold:
                self.state = CircuitState.OPEN
                print(f"[{self.name}] Threshold reached → OPEN")

    def _should_attempt_reset(self) -> bool:
        """Check if enough time has passed to try recovery."""
        if self.last_failure_time is None:
            return True
        elapsed = time.time() - self.last_failure_time
        return elapsed >= self.config.recovery_timeout

    def _time_until_retry(self) -> float:
        """Calculate seconds until circuit breaker will try half-open."""
        if self.last_failure_time is None:
            return 0
        elapsed = time.time() - self.last_failure_time
        return max(0, self.config.recovery_timeout - elapsed)

    @property
    def status(self) -> dict:
        return {
            "name": self.name,
            "state": self.state.value,
            "failure_count": self.failure_count,
            "time_until_retry": self._time_until_retry() if self.state == CircuitState.OPEN else None
        }


class CircuitOpenError(Exception):
    """Raised when circuit breaker is open and call is blocked."""
    pass


# Usage example
if __name__ == "__main__":
    import random

    cb = CircuitBreaker("test-service", CircuitBreakerConfig(
        failure_threshold=3,
        recovery_timeout=5.0
    ))

    @cb
    def unreliable_service():
        if random.random() < 0.7:  # 70% failure rate
            raise Exception("Service unavailable")
        return "Success!"

    for i in range(20):
        try:
            result = unreliable_service()
            print(f"Call {i+1}: {result}")
        except CircuitOpenError as e:
            print(f"Call {i+1}: BLOCKED - {e}")
        except Exception as e:
            print(f"Call {i+1}: FAILED - {e}")

        print(f"  Status: {cb.status}")
        time.sleep(1)

See It In Action:

Related Concepts:

Quick Self-Check

  • Can explain circuit breaker in 60 seconds?
  • Understand the three states and transitions?
  • Know what “fail fast” means and why it matters?
  • Can configure thresholds and explain trade-offs?
  • Understand relationship with retries and timeouts?
  • Can implement a fallback strategy?
Interview Notes
⭐ Must-Know
💼70% of microservices interviews
Interview Relevance
70% of microservices interviews
🏭All microservice architectures
Production Impact
Powers systems at All microservice architectures
Prevents cascade failures
Performance
Prevents cascade failures query improvement
📈Protects resources
Scalability
Protects resources