Circuit Breaker
How to prevent cascade failures in microservices — the circuit breaker pattern for resilient service communication.
One Slow Service Kills Everything
When a downstream service becomes slow or unresponsive, callers keep sending requests. Threads block waiting for timeouts. Connection pools exhaust. Soon the caller itself becomes unresponsive, and the failure cascades upstream.
One broken service can take down your entire system.
- Slow responses worse than failures (tie up resources)
- Thread pool exhaustion
- Connection pool exhaustion
- Cascade failures through the call graph
Fail Fast, Protect Resources
A circuit breaker wraps calls to external services. When failures exceed a threshold, the breaker "trips open" — subsequent calls fail immediately without attempting the network call. This protects your resources and gives the downstream service time to recover.
Like an electrical circuit breaker that prevents fires.
- Wrapper around external calls
- Tracks success/failure metrics
- Trips open when failures exceed threshold
- Fails fast: immediate rejection, no network call
Closed → Open → Half-Open → ...
The circuit breaker has three states:
- CLOSED: Normal operation. Requests pass through. Failures are counted.
- OPEN: Breaker has tripped. All requests fail immediately without calling downstream.
- HALF-OPEN: After a timeout, allow one test request. If it succeeds, close the breaker. If it fails, stay open.
Watching the Breaker Trip
Imagine calling a payment service. Five requests fail in a row (threshold=5). The breaker trips open. For the next 30 seconds, all payment calls return an error immediately — no network attempt.
After 30 seconds, one test request goes through. If the payment service is back, the breaker closes and normal operation resumes.
- Failure counter increments on each failure
- Threshold reached → trip to OPEN
- OPEN state: fail immediately (protect resources)
- Automatic recovery via HALF-OPEN test
Tuning the Breaker
Circuit breaker parameters need tuning:
- Failure threshold: Too low = false trips on transient errors. Too high = slow to protect.
- Timeout: Too short = hammering a recovering service. Too long = slow recovery.
- Window: Rolling window vs consecutive failures.
Monitor your breakers — an open breaker is a signal something is wrong.