Reliability + System Design

Circuit breaker pattern in system design

A circuit breaker protects your service from repeatedly calling a failing dependency. It fails fast, gives the dependency time to recover, and helps prevent cascading failures.

ReliabilityCircuit BreakerResilienceMicroservicesSystem Design

The Short Answer

A circuit breaker protects your service from repeatedly calling a dependency that is already failing.

Instead of letting every request wait on a slow or broken service, the circuit breaker can temporarily stop calls and fail fast.

The key idea: if a dependency is clearly unhealthy, stop hammering it. Give it time to recover, and protect your own service from being dragged down with it.

The Real Problem

Imagine an Order Service calling a Payment Service. The Payment Service becomes slow or unavailable.

Without Circuit Breaker

Order Service calls Payment Service
Payment Service is slow
Requests wait and timeout
Threads pile up in Order Service

With Circuit Breaker

Failures cross threshold
Circuit opens
New calls fail fast
Order Service stays healthier

The circuit breaker does not fix the Payment Service. It protects the caller and gives the failing dependency breathing room.

Why Timeouts and Retries Are Not Enough

Timeouts and retries are useful, but they do not fully solve the problem.

If the dependency is only briefly flaky, retries may help. But if the dependency is consistently failing, retries can make the situation worse by sending even more traffic to something already overloaded.

text
Dependency starts failing
Timeouts happen
Retries increase traffic
Dependency gets more overloaded
More failures
Caller starts failing too
Retries are optimistic: “maybe it will work next time.”

Circuit breakers are defensive: “it is probably still broken, so do not keep calling it right now.”

The Three Circuit Breaker States

A circuit breaker is usually explained using three states:

Closed

Normal state. Requests are allowed through. The breaker watches failures.

Open

Failure threshold reached. Requests fail fast without calling the dependency.

Half-Open

After a wait period, allow a small number of trial requests to see if the dependency recovered.

This gives the system a controlled way to stop traffic, wait, test recovery, and then return to normal.

Mental Model: State Transitions

Closed

Calls pass through

Open

Calls fail fast

Half-Open

Trial calls allowed

Closed → Open: too many failures or timeouts.

Open → Half-Open: wait period expires.

Half-Open → Closed: trial calls succeed.

Half-Open → Open: trial calls fail.

Simple Pseudocode

A circuit breaker usually wraps a remote call.

text
if circuit is OPEN:
    fail fast

try:
    call dependency
    record success
catch failure:
    record failure
    maybe open circuit

The important part is that in the open state, the service avoids even making the remote call.

Simple Java Example

This is a deliberately simplified example. Real systems usually use a library like resilience4j, but the basic idea is easier to understand in plain Java.

java
import java.time.Duration;
import java.time.Instant;
import java.util.function.Supplier;

public class SimpleCircuitBreaker {
    enum State {
        CLOSED,
        OPEN,
        HALF_OPEN
    }

    private State state = State.CLOSED;
    private int failureCount = 0;
    private final int failureThreshold = 3;
    private final Duration openDuration = Duration.ofSeconds(5);
    private Instant openedAt;

    public String call(Supplier<String> dependencyCall) {
        if (state == State.OPEN) {
            if (Instant.now().isBefore(openedAt.plus(openDuration))) {
                throw new RuntimeException("Circuit is open. Failing fast.");
            }

            state = State.HALF_OPEN;
        }

        try {
            String result = dependencyCall.get();
            recordSuccess();
            return result;
        } catch (RuntimeException ex) {
            recordFailure();
            throw ex;
        }
    }

    private void recordSuccess() {
        failureCount = 0;
        state = State.CLOSED;
    }

    private void recordFailure() {
        failureCount++;

        if (failureCount >= failureThreshold) {
            state = State.OPEN;
            openedAt = Instant.now();
        }
    }
}

This example tracks failures. Once enough failures happen, the breaker opens and future calls fail fast until the wait period expires.

What Happens When the Circuit Is Open?

When the circuit is open, your service has a few options:

Fail fast

Return an error immediately instead of waiting on a dependency that is probably failing.

Use fallback

Return cached data, default data, or reduced functionality if that is safe for the product.

Degrade gracefully

Disable non-critical features while preserving the main user flow.

Queue for later

For background work, store the task and retry asynchronously later.

Example: Graceful Degradation

Suppose the recommendation service is failing on an ecommerce site.

text
Product Page
Recommendation Service fails
Circuit breaker opens
Hide recommendations
Still allow product view and checkout

That is much better than failing the entire product page just because recommendations are unavailable.

Important Configuration Choices

Circuit breakers are powerful, but the configuration matters.

Failure threshold

How many failures or what failure percentage should open the circuit?

Sliding window

Are you measuring failures over the last 10 requests, last 100 requests, or last 30 seconds?

Open duration

How long should the breaker stay open before testing recovery?

Half-open trial count

How many requests should be allowed through before deciding the dependency is healthy again?

Failure types

Should timeouts count? What about 500s, 429s, validation errors, or authentication failures?

Fallback behavior

What should the user see when the dependency is unavailable?

What Counts as a Failure?

Not every error should trip a circuit breaker.

Usually count

Timeouts, connection failures, dependency 5xx errors, and clear service-unavailable responses.

Usually do not count

Bad user input, validation errors, authentication errors, or authorization errors.

A circuit breaker should protect against dependency health problems, not normal business validation failures.

Circuit Breakers and Retries Together

Circuit breakers and retries are often used together, but they must be coordinated.

text
Safe pattern:
timeout per call
limited retries with backoff
circuit breaker around dependency
fallback or graceful degradation

If the circuit is open, do not keep retrying the same failing call. That defeats the purpose of the circuit breaker.

When Not to Use a Circuit Breaker

Circuit breakers are not always necessary.

  • Simple local function calls
  • Very low-volume internal tools
  • Cases where a normal timeout is enough
  • Operations where fail-fast behavior is worse than waiting

They are most useful around remote dependencies, especially when failure can spread from one service to another.

How to Answer This in an Interview

I would explain that a circuit breaker protects a service from repeatedly calling a failing dependency. It starts closed and allows calls. If failures exceed a threshold, it opens and fails fast. After a cool-down period, it becomes half-open and allows a few trial calls. If those succeed, it closes again; if they fail, it opens again. I would combine it with timeouts, bounded retries, fallback, and monitoring.

Common Interview Follow-Ups

Is a circuit breaker the same as a retry?

No. A retry tries the operation again because the failure may be temporary. A circuit breaker stops calling a dependency when it is likely to keep failing.

What are the three circuit breaker states?

Closed, Open, and Half-Open. Closed allows calls, Open fails fast, and Half-Open allows trial calls to check recovery.

Does a circuit breaker fix the downstream service?

No. It protects the caller and reduces pressure on the downstream service, but it does not directly repair the dependency.

What should happen when the circuit is open?

The service can fail fast, return cached/default data, gracefully degrade, or queue work for later depending on the use case.

What mistake do candidates make?

They memorize Closed/Open/Half-Open but do not explain the actual reason: preventing repeated slow calls from consuming resources and causing cascading failures.

Final Takeaway

A circuit breaker is a resilience pattern for remote dependencies. It detects repeated failures, temporarily stops calls, fails fast, tests recovery later, and helps prevent one failing service from dragging down the rest of the system.