Reliability + System Design
Graceful degradation in system design
Graceful degradation means keeping the most important user flows working even when optional features or dependencies fail.
The Short Answer
Graceful degradation means the system keeps the most important user flow working even when some part of the system is slow, unavailable, or failing.
The Real Problem
Modern systems depend on many things: databases, caches, search services, recommendation systems, payment providers, image services, analytics pipelines, third-party APIs, and more.
If every dependency failure causes the entire product to fail, the system is brittle.
Brittle System
Gracefully Degraded System
Simple Ecommerce Example
Suppose an ecommerce product page depends on these features:
- product details
- price
- inventory status
- recommendations
- reviews
- recently viewed items
- analytics tracking
These are not equally important. If analytics or recommendations are down, the user should probably still be able to view the product and checkout.
Critical vs Non-Critical Features
Critical Path
Can Degrade
Common Graceful Degradation Techniques
Fallback response
Return a simpler response when a dependency fails.
Cached data
Show slightly stale data if fresh data is temporarily unavailable.
Hide non-critical features
Remove recommendations, reviews, or personalization while core flows continue.
Default values
Use safe defaults when optional data cannot be loaded.
Async processing
Queue non-critical work instead of blocking the user.
Feature flags / kill switches
Disable risky or broken features at runtime without redeploying.
Circuit breakers
Stop calling dependencies that are clearly failing.
Load shedding
Drop or reject less important work to protect critical work.
Example: Recommendation Service Fails
Without graceful degradation, the product page may fail because one optional feature failed.
Bad:
Product Page
needs Product API
needs Price API
needs Recommendation API
Recommendation API times out
↓
Entire Product Page returns 500A better design treats recommendations as optional.
Better:
Product API succeeds
Price API succeeds
Recommendation API times out
↓
Show product
Show price
Hide recommendations
Still allow checkoutSimple Java Example: Fallback on Failure
This is a simplified example showing the idea. If recommendations fail, the page still returns with an empty recommendation list.
import java.util.List;
public class ProductPageService {
private final ProductClient productClient;
private final RecommendationClient recommendationClient;
public ProductPageService(
ProductClient productClient,
RecommendationClient recommendationClient
) {
this.productClient = productClient;
this.recommendationClient = recommendationClient;
}
public ProductPage getProductPage(String productId) {
Product product = productClient.getProduct(productId);
List<Product> recommendations;
try {
recommendations =
recommendationClient.getRecommendations(productId);
} catch (RuntimeException ex) {
recommendations = List.of();
}
return new ProductPage(product, recommendations);
}
record Product(String id, String name) {}
record ProductPage(
Product product,
List<Product> recommendations
) {}
interface ProductClient {
Product getProduct(String productId);
}
interface RecommendationClient {
List<Product> getRecommendations(String productId);
}
}In a real system, you would usually log the failure and emit a metric so the degraded mode is visible to operators.
Example: Analytics Should Not Block Checkout
Analytics is useful, but it usually should not block the customer from completing an order.
Bad:
User clicks checkout
↓
Write order
↓
Send analytics event synchronously
↓
Analytics system is slow
↓
Checkout becomes slow or failsBetter:
User clicks checkout
↓
Write order
↓
Publish analytics event asynchronously
↓
Return success to userDesigning a Degradation Plan
Graceful degradation should be intentional. For each dependency, ask:
- Is this dependency required for the core user flow?
- What happens if it is slow?
- What happens if it is unavailable?
- Can we use cached data?
- Can we return partial results?
- Can we hide this feature temporarily?
- Can we queue the work for later?
- How will we know we are in degraded mode?
Graceful Degradation vs Fail Fast
These ideas are related, but they are not the same.
Fail fast
Stop waiting quickly when something is unlikely to succeed.
Graceful degradation
Continue offering useful reduced functionality after something fails.
Example: a timeout or circuit breaker may fail fast. The fallback behavior after that failure is graceful degradation.
What to Measure
Degraded mode should not be invisible. You want to know when users are getting a reduced experience.
Fallback rate
How often are we using fallback data or hiding features?
Dependency failure rate
Which service is causing degraded behavior?
User-visible impact
Are checkouts, logins, or core actions still succeeding?
Latency
Did degradation keep the core flow fast enough?
How to Answer This in an Interview
Common Interview Follow-Ups
Is graceful degradation the same as high availability?
No. Graceful degradation is one technique that can improve practical availability. High availability is the broader goal of keeping the system usable despite failures.
What is a good example?
If a recommendation service fails, an ecommerce site can hide recommendations but still allow product viewing, cart actions, and checkout.
What should not be degraded?
Critical correctness paths usually should not silently degrade. For example, payment authorization, account balance updates, and security checks need strict handling.
How do feature flags help?
Feature flags and kill switches let teams disable non-critical or risky functionality at runtime without redeploying.
What mistake do candidates make?
They say 'use fallback' but do not explain which features are critical, which are optional, and how degraded mode will be measured.