Why Your Microservice Outages Cascade (And How to Stop Them)

If you've built an application with microservices, you know the benefits. Each service is specialized and independent. But this independence can create a hidden weakness.

Imagine your Post Service needs to get the author's name from the Profile Service. It's a simple, routine request. But what happens if the Profile Service is down or responding slowly? The Post Service will keep sending requests, waiting for an answer. Soon, its own resources get tied up, its performance drops, and it might crash.

This is a cascading failure. A problem in one small service ripples through your system, causing a much larger outage.

To prevent this, we can use a design pattern called the Circuit Breaker.

In this guide, I'll explain the concept and show you how it works. I've also created a complete working example with Node.js and Redis on my GitHub repository if you want to build it yourself.

The Big Idea: The Three States of a Circuit Breaker

The circuit breaker works by moving between three simple states. If you understand these, you understand the entire pattern.

1. CLOSED: Everything is Normal

This is the default state. The circuit is "closed," so requests are allowed to flow from your service to the dependency, just like normal.

While everything is working, the circuit breaker quietly monitors the calls for failures (like errors or timeouts). It keeps a count of any recent failures. If the number of failures passes a certain limit, the circuit breaker "trips" and moves to the OPEN state.

2. OPEN: The Service is Unavailable

Now, the circuit is "open." For a set amount of time (for example, 30 seconds), the circuit breaker will immediately block any new requests to the failing service.

It won't even try to make the network call. Instead, it instantly returns an error. This is a key benefit: failing fast.

Your service doesn't have to wait for another failed request to time out. It gets an immediate response and can handle the error gracefully, maybe by showing a default message or returning cached data. This gives the troubled service time to recover without being overwhelmed by constant new requests.

Once the timeout period ends, the circuit breaker moves to the HALF-OPEN state.

3. HALF-OPEN: Time for a Test

The timeout is over, and the circuit breaker is ready to check if the dependency has recovered. It will allow a single request to go through to the service.

If this single request succeeds: The circuit breaker assumes the service is healthy again. It resets its failure count and moves back to the CLOSED state, allowing requests to flow normally.
If this single request fails: The circuit breaker assumes the service is still having problems. It moves back to the OPEN state and starts the timeout again.

Why This Pattern Is So Useful

The circuit breaker pattern solves several critical problems in distributed systems.

It Protects Your Services

Without a circuit breaker, a slow or failing dependency can use up your service's resources, like connection pools and memory. This can cause your service to crash. The circuit breaker prevents this by stopping the calls before they can cause harm.

It Helps Dependencies Recover

When a service is overloaded, the worst thing you can do is send it more traffic. By blocking requests, the circuit breaker gives the struggling service the breathing room it needs to recover, whether that means finishing a long process, restarting, or clearing its memory.

It Makes Failures Easier to See

When a circuit trips, it's a very clear signal that something is wrong with a specific dependency. This makes monitoring much more straightforward. You can create alerts for when a circuit opens, making it easier to find and fix the root cause of problems.

Building a Circuit Breaker

Enough theory. Let's look at a practical implementation.

You can find a complete, working example that I've built here:
GitHub: SystemDesign/circuit-breaker

The repository includes:
A profileService (a service that we can make fail on purpose for testing).
A postService (a service that depends on the profileService).
A circuitBreaker.js module that contains all the logic.
Full setup instructions.

What's in the Code

The implementation includes two simple Express servers to simulate a real microservice architecture.

The core of the project is the circuit breaker logic. It uses Redis to store the state of the circuit (OPEN, CLOSED, or HALF-OPEN), the failure count, and timestamps. Using a central store like Redis is important because if you have multiple instances of your service, they can all share the same circuit breaker state. An outage detected by one instance will be known by all of them.

How to Use It

The implementation is designed to be easy to use. You just wrap your service call inside the circuit breaker's execute function.

Instead of calling the other service directly, you let the circuit breaker handle it. The circuit breaker manages all the logic for tracking failures, changing states, and handling timeouts.

If the circuit is open, your function will receive an error right away. You can then catch this error and decide what to do, like returning fallback data. This is much better than a user waiting 30 seconds for a request to time out.

Seeing it in Action

The code in the repository lets you easily test the circuit breaker's behavior.

Start both services. Make a request to the postService. It will successfully call the profileService, and everything works as expected.

Make the service fail. Use a special endpoint to toggle the profileService into a "failing" mode, where it returns errors.

Watch the circuit trip. Send a few requests to the postService. You'll see in the logs that the first few calls fail. After a set number of failures, the circuit breaker will trip and move to the OPEN state.

See it fail fast. Try to send another request right away. This time, you'll get an instant error message. The circuit breaker didn't even try to contact the profileService. This is the pattern working correctly.

Watch it recover. Wait for the configured timeout (e.g., 10 seconds). Send another request. The circuit will move to HALF-OPEN and try the call one more time. If you've fixed the profileService, the call will succeed, and the circuit will move back to CLOSED. The system has healed itself automatically.

Scaling the Pattern with Redis

The example uses Redis to share the circuit's state. This is a powerful feature for real-world applications. When you run multiple copies of a service, you want them all to know about the health of your dependencies.

If one instance detects a failure and opens the circuit, Redis ensures all other instances also know not to send requests. No instance has to discover the problem on its own.

For even faster systems, you could use Redis Pub/Sub. When an instance trips a circuit, it could publish a message. All other instances could subscribe to that message and update their state immediately, allowing the entire system to react to an outage in milliseconds.

When Should You Use This?

Circuit breakers are powerful, but they aren't needed for every single interaction.

Good Use Cases:

Calls to external services or third-party APIs.
Calls between your own microservices that could potentially fail.
Any situation where a failure could cascade and affect other parts of your system.
When you need a way to degrade service gracefully instead of failing completely.

When to Avoid Them:

Calls to a local database.
For critical operations that must succeed, like a final step in a payment process.
In simple applications with no external dependencies.
You've Got This!
The Circuit Breaker pattern is a practical way to build more reliable and resilient systems. It accepts that failures will happen and gives your services a smart way to handle them.
By preventing one small problem from taking down your entire application, you build systems that are stronger and easier to manage.
What We Covered:

The three states: CLOSED, OPEN, and HALF-OPEN.
Why failing fast is better than long timeouts.
How circuit breakers stop cascading failures.

A real implementation you can explore and use.

Feel free to head over to the GitHub repo to see the code, run it for yourself, and adapt it to your own projects.

Now you have another tool to help you build better, more fault-tolerant systems. Happy coding!