System Design Fundamentals
Availability concepts in system design
Availability means the system can successfully serve users when they need it. This page explains how availability is measured, improved, and discussed in system design interviews.
The Short Answer
Availability means the system is able to serve users successfully when they need it.
In system design interviews, availability is usually about reducing downtime, avoiding single points of failure, and continuing to serve requests even when some machines, services, zones, or dependencies fail.
A Simple Example
Imagine an interview prep website with only one backend server.
Single Server
Multiple Servers
This is the basic idea behind high availability: avoid depending on exactly one fragile thing.
Where Availability Matters
Availability matters whenever users, businesses, or other systems depend on your service being reachable.
- Payment systems
- Login and authentication services
- Ad serving systems
- Order management systems
- Messaging and notification systems
- Search and recommendation APIs
- Internal platforms used by many teams
In interviews, the expected availability depends on the product. A banking transfer API and a weekly analytics export do not need the same design.
Availability Is Usually Measured as a Percentage
Availability is often expressed as the percentage of time or requests where the service is working successfully.
Availability = successful service time / total expected service timeIn request-based systems, teams often think in terms of successful requests:
Availability = successful requests / total valid requestsThis is why people talk about “nines” of availability.
99%
~7.2 hours downtime/month
99.9%
~43 minutes downtime/month
99.99%
~4.3 minutes downtime/month
Availability, SLOs, and Error Budgets
A common production way to discuss availability is with an SLO, or Service Level Objective.
Example:
SLO: The API should successfully serve 99.9% of valid requests over 30 days.The error budget is the allowed failure amount.
99.9% SLO = 0.1% error budgetThis is useful because it turns availability into an engineering and product tradeoff.
Healthy Error Budget
Burned Error Budget
How to Increase Availability
The big idea is to remove single points of failure and make recovery fast.
Run multiple app instances
If one instance crashes, the load balancer can route traffic to healthy instances.
Use health checks
Unhealthy instances should be automatically removed from traffic.
Deploy across zones
If one availability zone has a power, cooling, or network issue, another zone can continue serving traffic.
Replicate data
Databases, caches, and queues often need replicas so one machine failure does not destroy availability.
Use timeouts and retries carefully
Timeouts prevent threads from waiting forever. Retries can help transient failures, but aggressive retries can make outages worse.
Graceful degradation
If recommendations fail, still show the main page. If analytics fail, still let users complete the core action.
Circuit breakers
Stop repeatedly calling a failing dependency so your service does not collapse waiting on it.
Automated rollback
Bad deployments are a common outage source. Fast rollback improves practical availability.
Availability Zones: The Practical Cloud Building Block
In cloud systems, an availability zone is usually a physically separate datacenter grouping inside a region. The point is that one zone can fail without necessarily taking down every zone in that region.
Zone A
App + DB replica
Zone B
App + DB replica
Zone C
App + DB replica
This is why interviewers like hearing phrases such as multi-AZ deployment, health checks, load balancing, failover, and replication.
Availability Is Not Just “Add More Servers”
Adding more app servers helps, but the system is only as available as its weakest critical dependency.
App Layer Looks Available
But Database Is Single Point of Failure
Strong availability design looks at the full request path: load balancer, app servers, database, cache, queue, third-party services, DNS, deployment system, and monitoring.
Availability Tradeoffs
Higher availability usually creates tradeoffs.
Cost
More replicas, more zones, and more automation cost money.
Complexity
Failover, replication, and distributed state make systems harder to reason about.
Consistency
Highly available distributed systems may need to accept eventual consistency for some operations.
Operational burden
You need monitoring, alerts, incident response, capacity planning, and regular failure testing.
How to Talk About Availability in an Interview
A strong interview answer usually does not start with random technologies. It starts with the requirement.
A good structure:
- What is the availability target?
- Which user flows must stay available?
- What are the single points of failure?
- Can we run multiple instances?
- Can we deploy across availability zones?
- How does failover happen?
- What happens when dependencies fail?
- How do we monitor and alert?
Example Interview Answer
Common Interview Follow-Ups
Is availability the same as reliability?
No. They are related, but not identical. Availability asks whether the system is usable when needed. Reliability is more about operating correctly over time without failure.
Does adding more servers always improve availability?
Only if the rest of the system can handle failure too. If all servers depend on one database, one cache, or one third-party API, those can still be single points of failure.
What is graceful degradation?
Graceful degradation means the system continues to provide reduced but useful functionality when part of it fails. For example, an ecommerce site may hide recommendations but still allow checkout.
What is an error budget?
An error budget is the amount of failure allowed by an SLO. For example, a 99.9% availability target leaves a 0.1% failure budget.
What is the biggest mistake candidates make?
They say 'use load balancers and multiple servers' but forget databases, queues, caches, third-party dependencies, monitoring, failover, and operational tradeoffs.