Reliability + System Design

Timeouts and retries in system design

Timeouts prevent services from waiting forever, while retries help recover from temporary failures. Used badly, retries can overload dependencies and cause cascading failures.

ReliabilityTimeoutsRetriesBackoffSystem Design

The Short Answer

Timeouts prevent your service from waiting forever on a slow dependency.

Retries help recover from temporary failures, such as a brief network issue or a dependency that is momentarily unavailable.

The key idea: timeouts stop waiting forever, and retries give temporary failures another chance. But retries must be limited, delayed, and used carefully.

The Real Problem

Imagine an Order Service calling a Payment Service.

Order Service

Receives checkout request

Payment Service

Becomes slow or unavailable

Customer

Waits forever or gets an error

Without a timeout, the Order Service may keep waiting. Threads stay occupied, request queues grow, latency increases, and eventually the Order Service may become unhealthy too.

A slow dependency can become your outage if you keep waiting on it forever.

Timeouts: Fail Instead of Waiting Forever

A timeout sets a maximum amount of time you are willing to wait for a dependency.

text
Order Service calls Payment Service

No timeout:
    wait... wait... wait...

With timeout:
    wait up to 300ms
    then fail fast or use fallback

This protects your own service resources. The goal is not to make the dependency healthy. The goal is to avoid letting one slow dependency consume all of your threads.

Simple Java Timeout Example

Here is a simple example using CompletableFuture. If the work does not complete within the timeout, the call fails.

java
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.TimeUnit;

public class TimeoutExample {
    public static void main(String[] args) {
        CompletableFuture<String> paymentCall =
                CompletableFuture.supplyAsync(() -> {
                    sleep(2_000);
                    return "payment approved";
                });

        try {
            String result = paymentCall
                    .orTimeout(500, TimeUnit.MILLISECONDS)
                    .join();

            System.out.println(result);
        } catch (Exception ex) {
            System.out.println("Payment service timed out");
        }
    }

    private static void sleep(long millis) {
        try {
            Thread.sleep(millis);
        } catch (InterruptedException ex) {
            Thread.currentThread().interrupt();
        }
    }
}

In real systems, HTTP clients usually have their own connection and request timeout settings. The concept is the same: do not wait forever.

Retries: Useful, But Dangerous

A retry means: if the first call fails, try again.

Retries are useful for transient failures:

  • brief network blip
  • temporary service restart
  • short database failover
  • HTTP 429 throttling response
  • temporary 503 from a dependency

But retries are dangerous when the downstream service is already overloaded.

Without Retry Limit

Request fails
Retry immediately
Retry again
Dependency gets even more traffic

With Bounded Retries

Request fails
Wait briefly
Retry limited number of times
Stop before causing more damage

Bad Retry Example

This is the kind of retry logic you do not want in production.

java
while (true) {
    try {
        return paymentClient.charge(request);
    } catch (Exception ex) {
        // Try again immediately forever
    }
}

This can hammer an already struggling dependency. It also gives the caller no clear failure boundary.

Better Retry Strategy

A safer retry strategy usually has:

  • a maximum number of attempts
  • a timeout per attempt
  • backoff between attempts
  • jitter to avoid retry storms
  • idempotency for operations that change state
Retrying without limits can turn a small failure into a cascading failure.

Exponential Backoff

Exponential backoff means each retry waits longer than the previous retry.

text
Attempt 1: call immediately
Attempt 2: wait 100ms
Attempt 3: wait 200ms
Attempt 4: wait 400ms
Attempt 5: wait 800ms

This gives the downstream service time to recover instead of immediately flooding it with more requests.

Jitter: Avoid Everyone Retrying Together

If thousands of clients all retry at exactly the same schedule, they can create waves of traffic.

text
Bad:
Client A retries after 100ms
Client B retries after 100ms
Client C retries after 100ms

Better:
Client A retries after 83ms
Client B retries after 141ms
Client C retries after 117ms

Jitter adds randomness to retry delays so clients do not all retry at the exact same time.

Simple Java Retry With Backoff Example

This is a simplified copy-pasteable example. Real systems usually use libraries, HTTP client settings, or resilience frameworks, but the core idea is the same.

java
import java.util.Random;

public class RetryWithBackoffExample {
    private static final Random random = new Random();

    public static void main(String[] args) throws InterruptedException {
        String result = callWithRetries();

        System.out.println(result);
    }

    static String callWithRetries() throws InterruptedException {
        int maxAttempts = 4;
        long baseDelayMillis = 100;

        for (int attempt = 1; attempt <= maxAttempts; attempt++) {
            try {
                return callDependency();
            } catch (RuntimeException ex) {
                if (attempt == maxAttempts) {
                    throw ex;
                }

                long exponentialDelay =
                        baseDelayMillis * (1L << (attempt - 1));

                long jitter =
                        random.nextLong(50);

                long sleepMillis =
                        exponentialDelay + jitter;

                Thread.sleep(sleepMillis);
            }
        }

        throw new IllegalStateException("unreachable");
    }

    static String callDependency() {
        if (random.nextBoolean()) {
            throw new RuntimeException("temporary failure");
        }

        return "success";
    }
}

Notice the important parts: retries are limited, delay increases, and jitter spreads retry traffic out.

Idempotency Matters

Retrying a read request is usually safer than retrying a write request.

text
GET /products/123

Usually safe to retry.

But retrying this can be dangerous:

text
POST /payments

If retried incorrectly, the user may be charged twice.

For operations that modify state, retries should usually be paired with idempotency keys.

text
POST /payments
Idempotency-Key: abc-123
Retry write operations only when you know the operation is idempotent or protected by an idempotency key.

Timeouts and Retries Work Together

Timeouts and retries should be designed together.

text
Bad:
Timeout = 10 seconds
Retries = 3

Worst case:
request waits around 30 seconds before failing

A better approach is to think about the total request budget.

text
User-facing API budget: 800ms

Attempt 1 timeout: 250ms
Retry delay: 50ms
Attempt 2 timeout: 250ms
Retry delay: 100ms
Attempt 3 timeout: 150ms

Still fits inside the user's expected latency budget.

This is the senior-level point: retries are not free. They consume time, threads, network capacity, and downstream capacity.

What Should Be Retried?

Usually retry

Temporary network failures, timeouts, 429 throttling responses, and some 503 service unavailable responses.

Usually do not retry

Validation errors, authentication failures, authorization failures, and most 4xx client errors.

Be careful retrying

Payments, order creation, inventory changes, and anything with side effects.

Use async retries

For background jobs, queues are often better than making the user wait through many retry attempts.

How This Connects to Circuit Breakers

If a dependency is clearly failing, retries may make the situation worse. That is where a circuit breaker helps.

text
Dependency starts failing
Retries increase traffic
Dependency gets more overloaded
More failures
Circuit breaker opens
Fail fast instead of continuing to hammer dependency

Timeouts and retries are often the first layer. Circuit breakers are another layer that protects the system when failure is no longer temporary.

How to Answer This in an Interview

I would use timeouts on all remote calls so my service does not wait forever. Then I would use retries only for transient failures, with a maximum attempt count, exponential backoff, jitter, and an overall request deadline. For write operations, I would make sure retries are safe using idempotency keys. I would also avoid retry storms and add circuit breakers or graceful degradation when a dependency is unhealthy.

Common Interview Follow-Ups

Why do we need timeouts?

Without timeouts, a service can wait indefinitely on a slow dependency. That can tie up threads, increase latency, and cause failures to spread.

Are retries always good?

No. Retries help with transient failures, but aggressive retries can overload a struggling dependency and make an outage worse.

What is exponential backoff?

Exponential backoff means each retry waits longer than the previous one, such as 100ms, 200ms, 400ms, and 800ms.

What is jitter?

Jitter adds randomness to retry delays so many clients do not retry at exactly the same time.

Why is idempotency important for retries?

If a retried operation changes state, the same request may be applied more than once. Idempotency prevents duplicate side effects, such as double-charging a payment.

Final Takeaway

Timeouts protect your service from waiting forever. Retries help with temporary failures. But production-safe retries need limits, backoff, jitter, total request budgets, and idempotency for side effects.