Reliability + System Design
Reliability concepts in system design
Reliability means a system performs its intended function correctly and consistently over time. It includes availability, correctness, durability, recovery, observability, and predictable behavior during failures.
The Short Answer
Reliability means the system performs its intended function correctly and consistently over time.
Availability asks, “Can users reach the system?” Reliability asks, “Does the system keep working correctly when users depend on it?”
Availability vs Reliability
Available But Not Reliable
Reliable System
Availability is part of reliability, but reliability also includes correctness, durability, predictable latency, recovery, and safe behavior during failures.
Simple Example
Imagine a payment system. If the payment API is reachable but occasionally double-charges customers, nobody would call it reliable.
Payment API is online
↓
User submits payment
↓
Network timeout happens
↓
Client retries
↓
User is charged twice
Available? Maybe.
Reliable? No.This is why reliability is not just uptime. It is about whether the system behaves correctly under normal and failure conditions.
What Reliability Includes
Correctness
The system returns the right result and preserves business invariants.
Availability
The system is usable when users need it.
Durability
Important data is not lost after crashes, restarts, or hardware failures.
Latency predictability
The system does not randomly become extremely slow for many users.
Recoverability
The system can recover quickly after failures.
Fault tolerance
The system can continue operating when some components fail.
Observability
Engineers can detect, debug, and understand failures.
Operational safety
Deployments, config changes, and migrations do not frequently break production.
Reliability Is About Failure Assumptions
Reliable systems are designed with the assumption that things will fail.
- servers crash
- deployments contain bugs
- networks become slow
- databases fail over
- queues build up
- third-party APIs go down
- traffic spikes happen
- humans make mistakes
How Reliability Is Measured
Reliability should be measured from the user's point of view. Common measures include:
Success rate
What percentage of valid requests succeed?
Error rate
How often do users receive failed responses?
Latency
How long do users wait, especially at p95 and p99?
Durability
Was important data preserved correctly?
Recovery time
How long does it take to recover after failure?
Incident frequency
How often do serious production issues happen?
SLI, SLO, and SLA
These terms show up constantly in reliability discussions.
SLI
Service Level Indicator
The measurement.
Example: request success rate.
SLO
Service Level Objective
The internal target.
Example: 99.9% success rate.
SLA
Service Level Agreement
The external promise.
Example: contract with customers.
In interviews, SLOs are especially useful because they turn vague reliability requirements into measurable targets.
Error Budgets
An error budget is the amount of unreliability the system is allowed to have while still meeting its SLO.
If SLO = 99.9% success rate
Then error budget = 0.1% failed valid requestsError budgets are useful because they balance reliability and feature velocity.
Budget Available
Budget Burned
How to Improve Reliability
Reduce single points of failure
Use redundancy across instances, zones, services, and data stores where needed.
Use timeouts and retries carefully
Avoid waiting forever, but prevent retry storms with limits, backoff, and jitter.
Add circuit breakers
Stop repeatedly calling dependencies that are clearly failing.
Gracefully degrade
Keep core user flows working when optional dependencies fail.
Protect data correctness
Use transactions, idempotency, constraints, and validation for critical state changes.
Improve observability
Use logs, metrics, traces, dashboards, and alerts to detect and debug failures.
Test failure scenarios
Practice failover, dependency failures, bad deployments, and recovery procedures.
Automate safe deployments
Use canaries, rollbacks, feature flags, and deployment checks.
Reliability vs Resilience vs Fault Tolerance
These terms overlap, but they are useful to separate.
Reliability
The system consistently performs its intended function correctly.
Resilience
The system can absorb failures and recover without total collapse.
Fault tolerance
The system continues operating even when specific components fail.
A Senior-Level Interview Framing
A weaker answer says:
We need reliability, so we add more servers.A stronger answer says:
Common Reliability Questions to Ask
- What is the most important user flow?
- What does “working correctly” mean for this system?
- What is the acceptable error rate?
- What latency matters to users?
- What data must never be lost?
- What dependencies can fail?
- Can the system recover automatically?
- How do we know when users are impacted?
- What happens during bad deployments?
- What can degrade safely, and what cannot?
How to Answer This in an Interview
Common Interview Follow-Ups
Is reliability the same as availability?
No. Availability asks whether the system is usable when needed. Reliability asks whether it performs its intended function correctly and consistently over time.
Can a system be available but unreliable?
Yes. An API may return 200 responses but still return wrong data, lose messages, double-charge users, or behave unpredictably.
What is an SLO?
An SLO is a measurable reliability target, such as 99.9% successful requests over 30 days.
What is an error budget?
An error budget is the amount of failure allowed while still meeting the SLO. It helps teams balance shipping features with improving reliability.
What is the biggest reliability mistake?
Focusing only on uptime while ignoring correctness, data loss, latency, dependency failures, bad deployments, and observability.