System Design Fundamentals

Availability concepts in system design

Availability means the system can successfully serve users when they need it. This page explains how availability is measured, improved, and discussed in system design interviews.

AvailabilityReliabilitySLOSystem DesignHigh Availability

The Short Answer

Availability means the system is able to serve users successfully when they need it.

In system design interviews, availability is usually about reducing downtime, avoiding single points of failure, and continuing to serve requests even when some machines, services, zones, or dependencies fail.

A highly available system is not a system that never fails. It is a system designed so that failures do not easily become user-visible outages.

A Simple Example

Imagine an interview prep website with only one backend server.

Single Server

Users

↓

One App Server

If this server dies, the site is down.

Multiple Servers

Users

↓

Load Balancer

App Server 1

App Server 2

If one server dies, traffic can go to the other.

This is the basic idea behind high availability: avoid depending on exactly one fragile thing.

Where Availability Matters

Availability matters whenever users, businesses, or other systems depend on your service being reachable.

Payment systems
Login and authentication services
Ad serving systems
Order management systems
Messaging and notification systems
Search and recommendation APIs
Internal platforms used by many teams

In interviews, the expected availability depends on the product. A banking transfer API and a weekly analytics export do not need the same design.

Availability Is Usually Measured as a Percentage

Availability is often expressed as the percentage of time or requests where the service is working successfully.

text

Availability = successful service time / total expected service time

In request-based systems, teams often think in terms of successful requests:

text

Availability = successful requests / total valid requests

This is why people talk about “nines” of availability.

99%

~7.2 hours downtime/month

99.9%

~43 minutes downtime/month

99.99%

~4.3 minutes downtime/month

More nines are not free. Each extra nine usually requires more redundancy, automation, testing, monitoring, operational maturity, and cost.

Availability, SLOs, and Error Budgets

A common production way to discuss availability is with an SLO, or Service Level Objective.

Example:

text

SLO: The API should successfully serve 99.9% of valid requests over 30 days.

The error budget is the allowed failure amount.

text

99.9% SLO = 0.1% error budget

This is useful because it turns availability into an engineering and product tradeoff.

Healthy Error Budget

Few failures

Deployments can continue

Burned Error Budget

Too many failures

Slow down releases and improve reliability

How to Increase Availability

The big idea is to remove single points of failure and make recovery fast.

Run multiple app instances

If one instance crashes, the load balancer can route traffic to healthy instances.

Use health checks

Unhealthy instances should be automatically removed from traffic.

Deploy across zones

If one availability zone has a power, cooling, or network issue, another zone can continue serving traffic.

Replicate data

Databases, caches, and queues often need replicas so one machine failure does not destroy availability.

Use timeouts and retries carefully

Timeouts prevent threads from waiting forever. Retries can help transient failures, but aggressive retries can make outages worse.

Graceful degradation

If recommendations fail, still show the main page. If analytics fail, still let users complete the core action.

Circuit breakers

Stop repeatedly calling a failing dependency so your service does not collapse waiting on it.

Automated rollback

Bad deployments are a common outage source. Fast rollback improves practical availability.

Availability Zones: The Practical Cloud Building Block

In cloud systems, an availability zone is usually a physically separate datacenter grouping inside a region. The point is that one zone can fail without necessarily taking down every zone in that region.

Zone A

App + DB replica

Zone B

App + DB replica

Zone C

App + DB replica

If Zone A fails, traffic can be routed to Zone B or Zone C.

This is why interviewers like hearing phrases such as multi-AZ deployment, health checks, load balancing, failover, and replication.

Availability Is Not Just “Add More Servers”

Adding more app servers helps, but the system is only as available as its weakest critical dependency.

App Layer Looks Available

App 1

App 2

But Database Is Single Point of Failure

One Primary Database

If this fails, the whole write path may fail.

Strong availability design looks at the full request path: load balancer, app servers, database, cache, queue, third-party services, DNS, deployment system, and monitoring.

Availability Tradeoffs

Higher availability usually creates tradeoffs.

Cost

More replicas, more zones, and more automation cost money.

Complexity

Failover, replication, and distributed state make systems harder to reason about.

Consistency

Highly available distributed systems may need to accept eventual consistency for some operations.

Operational burden

You need monitoring, alerts, incident response, capacity planning, and regular failure testing.

How to Talk About Availability in an Interview

A strong interview answer usually does not start with random technologies. It starts with the requirement.

First define how available the system needs to be. Then identify the critical request path. Then remove single points of failure. Then discuss failover, monitoring, graceful degradation, and tradeoffs.

A good structure:

What is the availability target?
Which user flows must stay available?
What are the single points of failure?
Can we run multiple instances?
Can we deploy across availability zones?
How does failover happen?
What happens when dependencies fail?
How do we monitor and alert?

Example Interview Answer

For high availability, I would first define the target, such as 99.9% or 99.99%, because that affects cost and complexity. Then I would avoid single points of failure by running multiple stateless app instances behind a load balancer, using health checks, deploying across availability zones, and replicating critical data. I would also add timeouts, retries with backoff, circuit breakers, monitoring, alerting, and graceful degradation so dependency failures do not automatically become full outages.

Common Interview Follow-Ups

Is availability the same as reliability?

No. They are related, but not identical. Availability asks whether the system is usable when needed. Reliability is more about operating correctly over time without failure.

Does adding more servers always improve availability?

Only if the rest of the system can handle failure too. If all servers depend on one database, one cache, or one third-party API, those can still be single points of failure.

What is graceful degradation?

Graceful degradation means the system continues to provide reduced but useful functionality when part of it fails. For example, an ecommerce site may hide recommendations but still allow checkout.

What is an error budget?

An error budget is the amount of failure allowed by an SLO. For example, a 99.9% availability target leaves a 0.1% failure budget.

What is the biggest mistake candidates make?

They say 'use load balancers and multiple servers' but forget databases, queues, caches, third-party dependencies, monitoring, failover, and operational tradeoffs.

Final Takeaway

Availability is about keeping the system usable despite failures. In interviews, mention the target, remove single points of failure, deploy redundancy across instances and zones, plan failover, handle dependency failures, and measure success with SLOs and error budgets.