Designing Resilient APIs: Circuit Breakers, Bulkheads, and Graceful Degradation
AI Assisted Content — This article was written with the help of AI tools. It has been reviewed and curated by our team.
PraiseGod
13 min read
Designing Resilient APIs: Circuit Breakers, Bulkheads, and Graceful Degradation
Distributed systems fail. Not occasionally — continuously. Network partitions, dependency latency spikes, database connection exhaustion, third-party service degradation, and deployment-induced errors are not exceptional events. They are the normal operating conditions of any system with more than a few services.
The engineers who treat failure as a design input — rather than an incident — build systems that serve users during degraded conditions. The engineers who don't build systems that cascade a single upstream slowdown into a full-stack outage.
Why Naive Retry Makes Things Worse
The instinct when a request fails is to retry. For synchronous services under load, naive retry is actively harmful.
If a downstream service is responding slowly (not down, just slow), every retry holds a connection open for the full timeout duration. A 10-second timeout with 5 retries means a single request holds a connection for up to 50 seconds. Multiply by your concurrent request count, and you have connection pool exhaustion.
The fix is exponential backoff with jitter:
1import asyncio
2import random
3from tenacity import retry, stop_after_attempt, wait_exponential_jitter
4
5@retry(
6 stop=stop_after_attempt(3),
7 wait=wait_exponential_jitter(initial=1, max=10, jitter=2),
8 reraise=True
9)
10async def call_payment_service(payload: dict) -> PaymentResult:
11 async with http_client.post("/charge", json=payload, timeout=5.0) as response:
12 response.raise_for_status()
13 return PaymentResult.model_validate(response.json())Jitter is not optional. Without it, all retrying clients synchronize their retry schedule (retry storm) and hit the recovering service with a load spike precisely when it is trying to recover. Full jitter — randomizing the wait within the backoff window — prevents synchronization.
Only retry on transient failures. A 400 Bad Request should not be retried — the request is malformed and will always fail. A 429 Too Many Requests should respect the Retry-After header. A 503 with a Retry-After header should wait. Network timeouts and 5xx responses without Retry-After are retry candidates.
Circuit Breaker Implementation
The circuit breaker sits between your service and a dependency. It tracks the failure rate and, when failures exceed a threshold, opens the circuit — rejecting all calls immediately without attempting the network request.
1import time
2from enum import Enum
3from dataclasses import dataclass, field
4from threading import Lock
5
6class State(Enum):
7 CLOSED = "closed"
8 OPEN = "open"
9 HALF_OPEN = "half_open"
10
11@dataclass
12class CircuitBreaker:
13 failure_threshold: int = 5
14 success_threshold: int = 2 # probes needed to close from half-open
15 timeout: float = 30.0 # seconds before moving open → half-open
16
17 _state: State = State.CLOSED
18 _failure_count: int = 0
19 _success_count: int = 0
20 _last_failure_time: float = 0.0
21 _lock: Lock = field(default_factory=Lock)
22
23 def call(self, func, *args, **kwargs):
24 with self._lock:
25 if self._state == State.OPEN:
26 if time.time() - self._last_failure_time > self.timeout:
27 self._state = State.HALF_OPEN
28 self._success_count = 0
29 else:
30 raise CircuitOpenError("Circuit is open — fast-failing")
31
32 try:
33 result = func(*args, **kwargs)
34 self._on_success()
35 return result
36 except Exception as e:
37 self._on_failure()
38 raise
39
40 def _on_success(self):
41 with self._lock:
42 self._failure_count = 0
43 if self._state == State.HALF_OPEN:
44 self._success_count += 1
45 if self._success_count >= self.success_threshold:
46 self._state = State.CLOSED
47
48 def _on_failure(self):
49 with self._lock:
50 self._failure_count += 1
51 self._last_failure_time = time.time()
52 if self._failure_count >= self.failure_threshold or self._state == State.HALF_OPEN:
53 self._state = State.OPENUse pybreaker or circuitbreaker libraries in production rather than rolling your own — they handle thread safety and sliding window calculations correctly. The pattern above illustrates the state machine; production implementations add sliding window failure rate calculation (not absolute count), per-exception type configuration, and metrics emission.
What to emit when the circuit opens:
- Increment a
circuit_breaker_open_totalcounter (Prometheus label: dependency name) - Log at ERROR level with the dependency name, failure count, and threshold
- Alert on sustained open state (> 60 seconds) via PagerDuty / OpsGenie
Bulkhead Pattern: Isolating Resource Pools
The bulkhead pattern partitions resources so that a failure in one partition cannot exhaust resources across the entire system. Named after ship bulkheads that prevent a flooded compartment from sinking the entire vessel.
In practice, this means separate thread pools or connection pools for different categories of downstream dependencies:
1from asyncio import Semaphore
2
3class BulkheadPool:
4 def __init__(self, max_concurrent: int):
5 self._semaphore = Semaphore(max_concurrent)
6
7 async def execute(self, coro):
8 async with self._semaphore:
9 return await coro
10
11# Separate pools for separate concerns
12payment_pool = BulkheadPool(max_concurrent=20)
13inventory_pool = BulkheadPool(max_concurrent=50)
14notification_pool = BulkheadPool(max_concurrent=10)
15
16async def process_order(order: Order):
17 # Inventory and payment run concurrently from isolated pools
18 inventory_task = asyncio.create_task(
19 inventory_pool.execute(reserve_inventory(order))
20 )
21 payment_task = asyncio.create_task(
22 payment_pool.execute(charge_payment(order))
23 )
24
25 inventory, payment = await asyncio.gather(inventory_task, payment_task)
26
27 # Notification is non-critical — failure is acceptable
28 asyncio.create_task(
29 notification_pool.execute(send_confirmation(order, payment))
30 )
31 return {"order_id": order.id, "status": "confirmed"}The notification pool exhausting does not block order processing. The payment pool reaching capacity causes backpressure on payment operations but does not affect inventory queries. Failure domains are isolated.
Timeout Hierarchy
Every external call must have a timeout. "Set timeouts" is advice engineers routinely ignore until they spend a Saturday debugging why their API was holding 400 open connections for 90 seconds.
Design a timeout hierarchy from the outermost to innermost call:
| Layer | Timeout | Purpose |
|---|---|---|
| Client → API Gateway | 30s | Maximum user-facing wait |
| API Gateway → Service | 25s | Service must respond before gateway times out |
| Service → DB (read) | 5s | Query must complete |
| Service → DB (write) | 10s | Transaction must complete |
| Service → External API | 8s | Third-party response |
| Service → Cache | 1s | Cache miss should be fast |
Connect timeout vs. read timeout vs. total timeout — use all three:
1timeout = httpx.Timeout(
2 connect=2.0, # TCP connection establishment
3 read=8.0, # Time between bytes once connected
4 write=3.0, # Time to send the request body
5 pool=1.0, # Time to acquire a connection from the pool
6)A service that takes 25 seconds to respond is effectively down. A tight timeout converts that to a fast failure that you can handle with a fallback.
Fallback Strategies by Criticality
Not every operation has the same criticality. Design fallbacks proportional to the business impact of failure:
Critical path (order processing, authentication): No fallback — surface a clear error to the user. Do not silently serve stale or incorrect data on the critical path.
User experience enhancement (personalization, recommendations): Serve a default/generic response. A homepage recommendation widget that shows generic popular items when the personalization service is unavailable is better than a blank section or a 500 error.
Analytics and logging: Fire and forget. If the analytics pipeline is degraded, log locally and do not impact the user-facing response. Background retry or queue the event locally.
Read replicas and caching: Serve stale cached data with a Cache-Control: max-stale header and a degraded-mode indicator. Stale product prices displayed with "Prices may be slightly out of date" is better than a 503.
Health Checks: Liveness vs. Readiness
Two distinct probes — do not conflate them:
Liveness probe (/healthz): Is the process alive? Should never fail in normal operation. A failing liveness probe causes Kubernetes to kill and restart the container. Return 200 with no external dependencies checked.
Readiness probe (/readyz): Is this instance ready to accept traffic? Check database connectivity, cache connectivity, any dependency required to serve requests. A failing readiness probe removes the instance from the load balancer rotation without killing it.
1@app.get("/readyz")
2async def readiness(db: DbSession = Depends(get_db), cache: Redis = Depends(get_cache)):
3 results = {}
4 try:
5 await db.execute(text("SELECT 1"))
6 results["database"] = "ok"
7 except Exception as e:
8 results["database"] = f"error: {e}"
9
10 try:
11 await cache.ping()
12 results["cache"] = "ok"
13 except Exception as e:
14 results["cache"] = f"error: {e}"
15
16 all_ok = all(v == "ok" for v in results.values())
17 return JSONResponse(content=results, status_code=200 if all_ok else 503)Resilience is not about preventing failures — that is impossible in distributed systems. It is about converting failures from catastrophic system-wide events into bounded, recoverable, observable incidents.
Share this article
Please or to leave a comment.