Reliability + System Design
Timeouts and retries in system design
Timeouts prevent services from waiting forever, while retries help recover from temporary failures. Used badly, retries can overload dependencies and cause cascading failures.
The Short Answer
Timeouts prevent your service from waiting forever on a slow dependency.
Retries help recover from temporary failures, such as a brief network issue or a dependency that is momentarily unavailable.
The Real Problem
Imagine an Order Service calling a Payment Service.
Order Service
Payment Service
Customer
Without a timeout, the Order Service may keep waiting. Threads stay occupied, request queues grow, latency increases, and eventually the Order Service may become unhealthy too.
Timeouts: Fail Instead of Waiting Forever
A timeout sets a maximum amount of time you are willing to wait for a dependency.
Order Service calls Payment Service
No timeout:
wait... wait... wait...
With timeout:
wait up to 300ms
then fail fast or use fallbackThis protects your own service resources. The goal is not to make the dependency healthy. The goal is to avoid letting one slow dependency consume all of your threads.
Simple Java Timeout Example
Here is a simple example using CompletableFuture. If the work does not complete within the timeout, the call fails.
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.TimeUnit;
public class TimeoutExample {
public static void main(String[] args) {
CompletableFuture<String> paymentCall =
CompletableFuture.supplyAsync(() -> {
sleep(2_000);
return "payment approved";
});
try {
String result = paymentCall
.orTimeout(500, TimeUnit.MILLISECONDS)
.join();
System.out.println(result);
} catch (Exception ex) {
System.out.println("Payment service timed out");
}
}
private static void sleep(long millis) {
try {
Thread.sleep(millis);
} catch (InterruptedException ex) {
Thread.currentThread().interrupt();
}
}
}In real systems, HTTP clients usually have their own connection and request timeout settings. The concept is the same: do not wait forever.
Retries: Useful, But Dangerous
A retry means: if the first call fails, try again.
Retries are useful for transient failures:
- brief network blip
- temporary service restart
- short database failover
- HTTP 429 throttling response
- temporary 503 from a dependency
But retries are dangerous when the downstream service is already overloaded.
Without Retry Limit
With Bounded Retries
Bad Retry Example
This is the kind of retry logic you do not want in production.
while (true) {
try {
return paymentClient.charge(request);
} catch (Exception ex) {
// Try again immediately forever
}
}This can hammer an already struggling dependency. It also gives the caller no clear failure boundary.
Better Retry Strategy
A safer retry strategy usually has:
- a maximum number of attempts
- a timeout per attempt
- backoff between attempts
- jitter to avoid retry storms
- idempotency for operations that change state
Exponential Backoff
Exponential backoff means each retry waits longer than the previous retry.
Attempt 1: call immediately
Attempt 2: wait 100ms
Attempt 3: wait 200ms
Attempt 4: wait 400ms
Attempt 5: wait 800msThis gives the downstream service time to recover instead of immediately flooding it with more requests.
Jitter: Avoid Everyone Retrying Together
If thousands of clients all retry at exactly the same schedule, they can create waves of traffic.
Bad:
Client A retries after 100ms
Client B retries after 100ms
Client C retries after 100ms
Better:
Client A retries after 83ms
Client B retries after 141ms
Client C retries after 117msJitter adds randomness to retry delays so clients do not all retry at the exact same time.
Simple Java Retry With Backoff Example
This is a simplified copy-pasteable example. Real systems usually use libraries, HTTP client settings, or resilience frameworks, but the core idea is the same.
import java.util.Random;
public class RetryWithBackoffExample {
private static final Random random = new Random();
public static void main(String[] args) throws InterruptedException {
String result = callWithRetries();
System.out.println(result);
}
static String callWithRetries() throws InterruptedException {
int maxAttempts = 4;
long baseDelayMillis = 100;
for (int attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return callDependency();
} catch (RuntimeException ex) {
if (attempt == maxAttempts) {
throw ex;
}
long exponentialDelay =
baseDelayMillis * (1L << (attempt - 1));
long jitter =
random.nextLong(50);
long sleepMillis =
exponentialDelay + jitter;
Thread.sleep(sleepMillis);
}
}
throw new IllegalStateException("unreachable");
}
static String callDependency() {
if (random.nextBoolean()) {
throw new RuntimeException("temporary failure");
}
return "success";
}
}Notice the important parts: retries are limited, delay increases, and jitter spreads retry traffic out.
Idempotency Matters
Retrying a read request is usually safer than retrying a write request.
GET /products/123
Usually safe to retry.But retrying this can be dangerous:
POST /payments
If retried incorrectly, the user may be charged twice.For operations that modify state, retries should usually be paired with idempotency keys.
POST /payments
Idempotency-Key: abc-123Timeouts and Retries Work Together
Timeouts and retries should be designed together.
Bad:
Timeout = 10 seconds
Retries = 3
Worst case:
request waits around 30 seconds before failingA better approach is to think about the total request budget.
User-facing API budget: 800ms
Attempt 1 timeout: 250ms
Retry delay: 50ms
Attempt 2 timeout: 250ms
Retry delay: 100ms
Attempt 3 timeout: 150ms
Still fits inside the user's expected latency budget.This is the senior-level point: retries are not free. They consume time, threads, network capacity, and downstream capacity.
What Should Be Retried?
Usually retry
Temporary network failures, timeouts, 429 throttling responses, and some 503 service unavailable responses.
Usually do not retry
Validation errors, authentication failures, authorization failures, and most 4xx client errors.
Be careful retrying
Payments, order creation, inventory changes, and anything with side effects.
Use async retries
For background jobs, queues are often better than making the user wait through many retry attempts.
How This Connects to Circuit Breakers
If a dependency is clearly failing, retries may make the situation worse. That is where a circuit breaker helps.
Dependency starts failing
↓
Retries increase traffic
↓
Dependency gets more overloaded
↓
More failures
↓
Circuit breaker opens
↓
Fail fast instead of continuing to hammer dependencyTimeouts and retries are often the first layer. Circuit breakers are another layer that protects the system when failure is no longer temporary.
How to Answer This in an Interview
Common Interview Follow-Ups
Why do we need timeouts?
Without timeouts, a service can wait indefinitely on a slow dependency. That can tie up threads, increase latency, and cause failures to spread.
Are retries always good?
No. Retries help with transient failures, but aggressive retries can overload a struggling dependency and make an outage worse.
What is exponential backoff?
Exponential backoff means each retry waits longer than the previous one, such as 100ms, 200ms, 400ms, and 800ms.
What is jitter?
Jitter adds randomness to retry delays so many clients do not retry at exactly the same time.
Why is idempotency important for retries?
If a retried operation changes state, the same request may be applied more than once. Idempotency prevents duplicate side effects, such as double-charging a payment.