System Design + Performance

What are tail latency and p99 latency?

Latency in system design is the time interval between the start of a request from a client to the delivery of the result back from the server. Tail latency describes the slowest requests in a system. p99 latency means 99% of requests are faster than this value, while the slowest 1% are at or above it.

System DesignLatencyp99PerformanceDistributed SystemsSRE

The Short Answer

Tail latency is about the slowest requests in your system, not the typical request.

p99 latency means 99% of requests completed at or below that latency, and the slowest 1% took at least that long.

The key idea: average latency tells you what “usually” happens. p99 tells you what your unlucky users experience.

The Real Problem

Suppose your average latency is 80 ms. That sounds great.

But if 1% of requests take 2 seconds, then many real users still have a bad experience.

Google's SRE material gives exactly this warning: averages can look stable while tail latency changes significantly, and p99 latency can reveal saturation early.

Source: Google SRE Book - Service Level Objectives

Tail latency matters because users do not experience your average. Each user experiences their own request.

p50, p95, p99: What Do They Mean?

Percentiles describe the distribution of request latencies.

MetricMeaningHow to Think About It
p5050% of requests are faster than thisThe typical request
p9595% of requests are faster than thisA degraded minority
p9999% of requests are faster than thisThe slowest 1% of user experiences

Small Example: Why Average Can Lie

Imagine 100 requests with the following latency distribution:

Request GroupNumber of RequestsLatencyUser Experience
Fast9050 msExcellent
Somewhat slow9200 msNoticeable
Very slow12,000 msBad

The average is:

(90 * 50ms + 9 * 200ms + 1 * 2000ms) / 100
= 83ms average latency

An 83 ms average looks healthy, but one user waited 2 seconds.

Average latency can hide the tail. p95 and p99 show whether a meaningful minority of users are suffering.

What p99 Means in That Example

If we sort the 100 request latencies from fastest to slowest, p99 is around the 99th percentile position.

Requests 1-90: 50 ms
Requests 91-99: 200 ms
Request 100: 2000 ms
Tail lives here

The exact percentile calculation can vary slightly by tool, but the intuition is the same: p99 focuses on the slow end of the request distribution.

Why Tail Latency Gets Worse in Distributed Systems

Tail latency becomes more important when one user request fans out to multiple services.

User request
    ↓
API service
    ↓
calls: user service, payment service, inventory service, recommendation service

Even if each dependency is usually fast, the overall request can be slowed down by one unlucky slow dependency.

Simple System

One request depends on one service. Fewer places for latency to spike.

Fan-Out System

One request depends on many services. One slow dependency can make the whole request slow.

Common Causes of Tail Latency

  • garbage collection pauses
  • database slow queries
  • connection pool exhaustion
  • thread pool saturation
  • queue buildup
  • cache misses
  • noisy neighbors on shared infrastructure
  • network retries or packet loss
  • large payloads
  • hot partitions or overloaded shards
Tail latency is often a symptom of saturation: some resource is running out, and only some requests hit the bad path.

Timeouts and p99

Tail latency also affects timeout settings. AWS's Builders Library describes choosing timeouts by starting from downstream latency metrics and selecting an acceptable false-timeout rate, often using a high percentile such as p99.9.

Source: AWS Builders Library - Timeouts, Retries, and Backoff with Jitter

For example, if a downstream service has p99.9 latency of 300 ms, a client timeout of 50 ms may be too aggressive and cause unnecessary failures.

Timeouts should be based on real latency distributions, not guesses.

How to Improve Tail Latency

Reduce Work

Optimize slow queries, reduce payload sizes, avoid unnecessary downstream calls, and cache expensive reads.

Control Queues

Watch thread pools, connection pools, Kafka consumers, and request queues. Long queues create long tails.

Use Timeouts

Avoid waiting forever on slow dependencies. Set timeouts based on latency percentiles.

Use Retries Carefully

Retries can help transient failures, but they can also amplify load and make tail latency worse.

Add Jitter

Jitter prevents many clients from retrying at the same time.

Reduce Fan-Out

Fewer downstream calls means fewer chances for one slow dependency to dominate the request.

The Interview-Friendly Explanation

Tail latency measures the slowest requests in a system. p99 means 99% of requests are faster than that value, while the slowest 1% are at or above it. Average latency can hide bad user experiences, especially in distributed systems where one slow dependency can slow down the whole request. Senior engineers monitor p95/p99, not only averages, and use them to guide timeouts, capacity planning, caching, queue control, and dependency optimization.

Common Interview Follow-Ups

Why is average latency misleading?

Because it can hide a small but important group of very slow requests. Users experience individual requests, not the average.

What does p99 latency mean?

p99 means 99% of requests completed at or below that latency, and roughly the slowest 1% took that long or longer.

Why does fan-out make tail latency worse?

If one request depends on many downstream calls, the slowest dependency can dominate the overall response time.

How can you reduce p99 latency?

Common approaches include fixing slow queries, reducing fan-out, caching, controlling queues, tuning thread pools, setting sane timeouts, and avoiding retry storms.

Why are retries dangerous for tail latency?

Retries can help transient failures, but under overload they add more traffic to an already struggling system, which can make latency and failure rates worse.

Final Takeaway

p99 is about empathy for the unlucky user. A system can look fast on average while still being painfully slow for thousands of real requests at scale.