System Design + Performance
What are tail latency and p99 latency?
Latency in system design is the time interval between the start of a request from a client to the delivery of the result back from the server. Tail latency describes the slowest requests in a system. p99 latency means 99% of requests are faster than this value, while the slowest 1% are at or above it.
The Short Answer
Tail latency is about the slowest requests in your system, not the typical request.
p99 latency means 99% of requests completed at or below that latency, and the slowest 1% took at least that long.
The Real Problem
Suppose your average latency is 80 ms. That sounds great.
But if 1% of requests take 2 seconds, then many real users still have a bad experience.
Google's SRE material gives exactly this warning: averages can look stable while tail latency changes significantly, and p99 latency can reveal saturation early.
Source: Google SRE Book - Service Level Objectives
p50, p95, p99: What Do They Mean?
Percentiles describe the distribution of request latencies.
| Metric | Meaning | How to Think About It |
|---|---|---|
| p50 | 50% of requests are faster than this | The typical request |
| p95 | 95% of requests are faster than this | A degraded minority |
| p99 | 99% of requests are faster than this | The slowest 1% of user experiences |
Small Example: Why Average Can Lie
Imagine 100 requests with the following latency distribution:
| Request Group | Number of Requests | Latency | User Experience |
|---|---|---|---|
| Fast | 90 | 50 ms | Excellent |
| Somewhat slow | 9 | 200 ms | Noticeable |
| Very slow | 1 | 2,000 ms | Bad |
The average is:
(90 * 50ms + 9 * 200ms + 1 * 2000ms) / 100
= 83ms average latencyAn 83 ms average looks healthy, but one user waited 2 seconds.
What p99 Means in That Example
If we sort the 100 request latencies from fastest to slowest, p99 is around the 99th percentile position.
The exact percentile calculation can vary slightly by tool, but the intuition is the same: p99 focuses on the slow end of the request distribution.
Why Tail Latency Gets Worse in Distributed Systems
Tail latency becomes more important when one user request fans out to multiple services.
User request
↓
API service
↓
calls: user service, payment service, inventory service, recommendation serviceEven if each dependency is usually fast, the overall request can be slowed down by one unlucky slow dependency.
Simple System
Fan-Out System
Common Causes of Tail Latency
- garbage collection pauses
- database slow queries
- connection pool exhaustion
- thread pool saturation
- queue buildup
- cache misses
- noisy neighbors on shared infrastructure
- network retries or packet loss
- large payloads
- hot partitions or overloaded shards
Timeouts and p99
Tail latency also affects timeout settings. AWS's Builders Library describes choosing timeouts by starting from downstream latency metrics and selecting an acceptable false-timeout rate, often using a high percentile such as p99.9.
Source: AWS Builders Library - Timeouts, Retries, and Backoff with Jitter
For example, if a downstream service has p99.9 latency of 300 ms, a client timeout of 50 ms may be too aggressive and cause unnecessary failures.
How to Improve Tail Latency
Reduce Work
Control Queues
Use Timeouts
Use Retries Carefully
Add Jitter
Reduce Fan-Out
The Interview-Friendly Explanation
Common Interview Follow-Ups
Why is average latency misleading?
Because it can hide a small but important group of very slow requests. Users experience individual requests, not the average.
What does p99 latency mean?
p99 means 99% of requests completed at or below that latency, and roughly the slowest 1% took that long or longer.
Why does fan-out make tail latency worse?
If one request depends on many downstream calls, the slowest dependency can dominate the overall response time.
How can you reduce p99 latency?
Common approaches include fixing slow queries, reducing fan-out, caching, controlling queues, tuning thread pools, setting sane timeouts, and avoiding retry storms.
Why are retries dangerous for tail latency?
Retries can help transient failures, but under overload they add more traffic to an already struggling system, which can make latency and failure rates worse.