Reliability + System Design

Reliability concepts in system design

Reliability means a system performs its intended function correctly and consistently over time. It includes availability, correctness, durability, recovery, observability, and predictable behavior during failures.

ReliabilityAvailabilitySLOError BudgetSystem Design

The Short Answer

Reliability means the system performs its intended function correctly and consistently over time.

Availability asks, “Can users reach the system?” Reliability asks, “Does the system keep working correctly when users depend on it?”

A system can be available but still unreliable if it returns wrong results, loses data, times out often, or behaves unpredictably.

Availability vs Reliability

Available But Not Reliable

API returns 200 OK
But sometimes returns wrong inventory
User can order out-of-stock item

Reliable System

API responds successfully
Data is correct
Behavior is predictable

Availability is part of reliability, but reliability also includes correctness, durability, predictable latency, recovery, and safe behavior during failures.

Simple Example

Imagine a payment system. If the payment API is reachable but occasionally double-charges customers, nobody would call it reliable.

text
Payment API is online
User submits payment
Network timeout happens
Client retries
User is charged twice

Available? Maybe.
Reliable? No.

This is why reliability is not just uptime. It is about whether the system behaves correctly under normal and failure conditions.

What Reliability Includes

Correctness

The system returns the right result and preserves business invariants.

Availability

The system is usable when users need it.

Durability

Important data is not lost after crashes, restarts, or hardware failures.

Latency predictability

The system does not randomly become extremely slow for many users.

Recoverability

The system can recover quickly after failures.

Fault tolerance

The system can continue operating when some components fail.

Observability

Engineers can detect, debug, and understand failures.

Operational safety

Deployments, config changes, and migrations do not frequently break production.

Reliability Is About Failure Assumptions

Reliable systems are designed with the assumption that things will fail.

  • servers crash
  • deployments contain bugs
  • networks become slow
  • databases fail over
  • queues build up
  • third-party APIs go down
  • traffic spikes happen
  • humans make mistakes
Reliability is not pretending failures will not happen. Reliability is designing so failures are contained, detected, recovered from, and learned from.

How Reliability Is Measured

Reliability should be measured from the user's point of view. Common measures include:

Success rate

What percentage of valid requests succeed?

Error rate

How often do users receive failed responses?

Latency

How long do users wait, especially at p95 and p99?

Durability

Was important data preserved correctly?

Recovery time

How long does it take to recover after failure?

Incident frequency

How often do serious production issues happen?

SLI, SLO, and SLA

These terms show up constantly in reliability discussions.

SLI

Service Level Indicator

The measurement.

Example: request success rate.

SLO

Service Level Objective

The internal target.

Example: 99.9% success rate.

SLA

Service Level Agreement

The external promise.

Example: contract with customers.

In interviews, SLOs are especially useful because they turn vague reliability requirements into measurable targets.

Error Budgets

An error budget is the amount of unreliability the system is allowed to have while still meeting its SLO.

text
If SLO = 99.9% success rate
Then error budget = 0.1% failed valid requests

Error budgets are useful because they balance reliability and feature velocity.

Budget Available

Reliability target is being met
Continue shipping carefully

Budget Burned

Too many failures
Slow feature work and fix reliability

How to Improve Reliability

Reduce single points of failure

Use redundancy across instances, zones, services, and data stores where needed.

Use timeouts and retries carefully

Avoid waiting forever, but prevent retry storms with limits, backoff, and jitter.

Add circuit breakers

Stop repeatedly calling dependencies that are clearly failing.

Gracefully degrade

Keep core user flows working when optional dependencies fail.

Protect data correctness

Use transactions, idempotency, constraints, and validation for critical state changes.

Improve observability

Use logs, metrics, traces, dashboards, and alerts to detect and debug failures.

Test failure scenarios

Practice failover, dependency failures, bad deployments, and recovery procedures.

Automate safe deployments

Use canaries, rollbacks, feature flags, and deployment checks.

Reliability vs Resilience vs Fault Tolerance

These terms overlap, but they are useful to separate.

Reliability

The system consistently performs its intended function correctly.

Resilience

The system can absorb failures and recover without total collapse.

Fault tolerance

The system continues operating even when specific components fail.

A Senior-Level Interview Framing

A weaker answer says:

text
We need reliability, so we add more servers.

A stronger answer says:

I would first define the reliability target using user-focused SLIs and SLOs. Then I would identify the critical user flows, failure modes, and single points of failure. For each failure mode, I would decide whether to prevent it, tolerate it, recover from it, or degrade gracefully. Finally, I would measure reliability with metrics, alerts, error budgets, and incident reviews.

Common Reliability Questions to Ask

  • What is the most important user flow?
  • What does “working correctly” mean for this system?
  • What is the acceptable error rate?
  • What latency matters to users?
  • What data must never be lost?
  • What dependencies can fail?
  • Can the system recover automatically?
  • How do we know when users are impacted?
  • What happens during bad deployments?
  • What can degrade safely, and what cannot?

How to Answer This in an Interview

Reliability means the system performs its intended function correctly and consistently over time. I would define it using SLIs and SLOs, identify critical user journeys, remove single points of failure, protect data correctness, add timeouts/retries/circuit breakers, support graceful degradation, monitor the system, and use error budgets and postmortems to continuously improve.

Common Interview Follow-Ups

Is reliability the same as availability?

No. Availability asks whether the system is usable when needed. Reliability asks whether it performs its intended function correctly and consistently over time.

Can a system be available but unreliable?

Yes. An API may return 200 responses but still return wrong data, lose messages, double-charge users, or behave unpredictably.

What is an SLO?

An SLO is a measurable reliability target, such as 99.9% successful requests over 30 days.

What is an error budget?

An error budget is the amount of failure allowed while still meeting the SLO. It helps teams balance shipping features with improving reliability.

What is the biggest reliability mistake?

Focusing only on uptime while ignoring correctness, data loss, latency, dependency failures, bad deployments, and observability.

Final Takeaway

Reliability is not just “the service is up.” A reliable system does the right thing, consistently, under normal conditions and failure conditions. In interviews, define the target, identify failure modes, protect critical flows, measure user impact, and explain tradeoffs.