Reliability + System Design

Health checks in system design

Health checks help infrastructure decide whether an instance is alive, ready for traffic, or should be restarted. They are essential for load balancing, deployments, and auto-recovery.

Health ChecksReliabilityAvailabilityKubernetesSystem Design

The Short Answer

A health check is a small test used to decide whether a service instance is healthy enough to run, receive traffic, or stay in rotation.

Health checks answer questions like: “Is this process alive?”, “Is it ready for traffic?”, and “Should the load balancer keep sending requests here?”

The Real Problem

Imagine a load balancer sending traffic to three app servers.

No Health Checks

Load Balancer

App 1

App 2 broken

App 3

Some users still hit broken App 2

With Health Checks

Load Balancer

App 1 healthy

App 2 removed

App 3 healthy

Traffic goes only to healthy instances

Health checks help infrastructure stop sending traffic to instances that are crashed, stuck, overloaded, still starting, or unable to serve requests correctly.

Three Common Types of Health Checks

Liveness

Is the app alive?

If this fails, the platform may restart the container.

Readiness

Is the app ready for traffic?

If this fails, traffic should not be routed here.

Startup

Has the app finished starting?

Useful for slow-starting apps so they are not killed too early.

Liveness is about restarting broken instances. Readiness is about routing traffic only to instances that can serve requests.

Liveness vs Readiness: The Most Important Distinction

This distinction matters a lot in interviews and production systems.

Liveness Failure

App is stuck or deadlocked

Restart the process/container

Readiness Failure

App is alive but cannot serve traffic

Remove from traffic, do not necessarily restart

Example: an app may be alive but still warming up caches, applying migrations, overloaded, or unable to reach a required dependency. That is a readiness problem, not necessarily a liveness problem.

Simple Spring Boot Health Endpoint

A very simple health endpoint may look like this:

java

import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RestController;

@RestController
public class HealthController {

    @GetMapping("/health")
    public String health() {
        return "OK";
    }
}

This only proves the web server can respond. It does not prove the whole application is ready to serve real user traffic.

A Better Readiness Check

A readiness check may verify important dependencies needed to serve traffic.

java

import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RestController;

@RestController
public class ReadinessController {
    private final DatabaseClient databaseClient;

    public ReadinessController(DatabaseClient databaseClient) {
        this.databaseClient = databaseClient;
    }

    @GetMapping("/ready")
    public ResponseEntity<String> ready() {
        if (!databaseClient.canConnect()) {
            return ResponseEntity
                    .status(503)
                    .body("Database unavailable");
        }

        return ResponseEntity.ok("READY");
    }

    interface DatabaseClient {
        boolean canConnect();
    }
}

If this returns 503, the load balancer or orchestrator can stop routing traffic to that instance until it becomes ready again.

But Do Not Make Health Checks Too Heavy

Health checks should be useful, but they should not overload your own system.

Too shallow

Only checking that the process responds may miss broken dependencies.

Too deep

Checking every dependency on every health request can create extra load and false failures.

Good liveness check

Checks whether the process is alive and not obviously stuck.

Good readiness check

Checks whether the instance can serve the important traffic it is about to receive.

The trap: if every instance marks itself unhealthy because one shared dependency is down, the load balancer may remove all instances even though the app itself could still serve degraded responses.

Health Checks and Load Balancers

Load balancers use health checks to decide which instances should receive traffic.

text

Load Balancer
    checks /ready on App 1 → 200 OK
    checks /ready on App 2 → 503
    checks /ready on App 3 → 200 OK

Traffic goes to App 1 and App 3 only.

This is a major reason health checks improve availability: they help route around bad instances automatically.

Health Checks and Deployments

Health checks are also important during deployments.

text

Deploy new version
    ↓
New instance starts
    ↓
Startup/readiness checks run
    ↓
Only after passing readiness does it receive traffic

This helps avoid sending users to a version that has started but is not actually ready yet.

Common Things to Check

What you check depends on the system, but common checks include:

process is alive
HTTP server can respond
database connection pool is usable
required cache or queue is reachable
disk is not full
critical config loaded correctly
instance is not overloaded
startup/warmup has completed

What Not to Check

A health check should not become a full integration test for the entire company.

Do not call dozens of downstream services every few seconds.
Do not perform expensive database queries.
Do not mutate production data.
Do not make health checks depend on optional features.
Do not expose sensitive internal details publicly.

Health checks should be cheap, fast, safe, and designed around the decision they support: restart, receive traffic, or alert.

Health Checks vs Monitoring

Health checks and monitoring are related, but they are not the same.

Health check

A specific signal used by infrastructure to make an action decision, such as route traffic or restart a container.

Monitoring

A broader view of system behavior over time: metrics, logs, traces, dashboards, alerts, and trends.

A service can pass its health check but still have problems visible in monitoring, such as high p99 latency, rising error rate, or slow database queries.

Health Check Design Questions

In interviews, it helps to ask:

Is this check for liveness, readiness, or startup?
Who consumes the result: load balancer, Kubernetes, monitor, or human?
What action happens when it fails?
Should this instance be restarted or only removed from traffic?
Which dependencies are critical for this endpoint?
Could this check cause cascading failure?
How often will it run?
What timeout should the health check use?

How to Answer This in an Interview

I would explain that health checks let infrastructure decide whether an instance should receive traffic, restart, or remain out of rotation. I would separate liveness from readiness: liveness tells us whether the process should be restarted, while readiness tells us whether it should receive traffic. I would keep checks cheap and fast, include only critical dependencies in readiness, and avoid making every instance unhealthy because one optional dependency is down.

Common Interview Follow-Ups

What is the difference between liveness and readiness?

Liveness asks whether the process is alive and should keep running. Readiness asks whether the instance is ready to receive traffic.

Should a readiness check include the database?

If the service cannot serve its main traffic without the database, then yes, a lightweight database check can make sense. But it should be cheap and should not check optional dependencies unnecessarily.

Can health checks cause outages?

Yes. If health checks are too strict or depend on a shared failing dependency, all instances may mark themselves unhealthy at once and be removed from traffic.

Is /health enough?

Usually not by itself. A basic /health endpoint only proves the process can respond. Production systems often separate /live, /ready, and sometimes /startup.

What mistake do candidates make?

They say 'add a health check' but do not explain what it checks, who consumes it, what action happens on failure, or how it avoids false positives.

Final Takeaway

Health checks are small signals with big consequences. Use liveness to decide when to restart, readiness to decide when to route traffic, and startup checks for slow-starting applications. Keep checks fast, safe, and aligned with the action they trigger.