Designing Resilient APIs: Circuit Breakers, Bulkheads, and Graceful Degradation

Distributed systems fail. Not occasionally — continuously. Network partitions, dependency latency spikes, database connection exhaustion, third-party service degradation, and deployment-induced errors are not exceptional events. They are the normal operating conditions of any system with more than a few services.

The engineers who treat failure as a design input — rather than an incident — build systems that serve users during degraded conditions. The engineers who don't build systems that cascade a single upstream slowdown into a full-stack outage.

Why Naive Retry Makes Things Worse

The instinct when a request fails is to retry. For synchronous services under load, naive retry is actively harmful.

If a downstream service is responding slowly (not down, just slow), every retry holds a connection open for the full timeout duration. A 10-second timeout with 5 retries means a single request holds a connection for up to 50 seconds. Multiply by your concurrent request count, and you have connection pool exhaustion.

The fix is exponential backoff with jitter:

python

1import asyncio
2import random
3from tenacity import retry, stop_after_attempt, wait_exponential_jitter
4
5@retry(
6    stop=stop_after_attempt(3),
7    wait=wait_exponential_jitter(initial=1, max=10, jitter=2),
8    reraise=True
9)
10async def call_payment_service(payload: dict) -> PaymentResult:
11    async with http_client.post("/charge", json=payload, timeout=5.0) as response:
12        response.raise_for_status()
13        return PaymentResult.model_validate(response.json())

Jitter is not optional. Without it, all retrying clients synchronize their retry schedule (retry storm) and hit the recovering service with a load spike precisely when it is trying to recover. Full jitter — randomizing the wait within the backoff window — prevents synchronization.

Only retry on transient failures. A 400 Bad Request should not be retried — the request is malformed and will always fail. A 429 Too Many Requests should respect the Retry-After header. A 503 with a Retry-After header should wait. Network timeouts and 5xx responses without Retry-After are retry candidates.

Circuit Breaker Implementation

The circuit breaker sits between your service and a dependency. It tracks the failure rate and, when failures exceed a threshold, opens the circuit — rejecting all calls immediately without attempting the network request.

python

1import time
2from enum import Enum
3from dataclasses import dataclass, field
4from threading import Lock
5
6class State(Enum):
7    CLOSED = "closed"
8    OPEN = "open"
9    HALF_OPEN = "half_open"
10
11@dataclass
12class CircuitBreaker:
13    failure_threshold: int = 5
14    success_threshold: int = 2  # probes needed to close from half-open
15    timeout: float = 30.0       # seconds before moving open → half-open
16
17    _state: State = State.CLOSED
18    _failure_count: int = 0
19    _success_count: int = 0
20    _last_failure_time: float = 0.0
21    _lock: Lock = field(default_factory=Lock)
22
23    def call(self, func, *args, **kwargs):
24        with self._lock:
25            if self._state == State.OPEN:
26                if time.time() - self._last_failure_time > self.timeout:
27                    self._state = State.HALF_OPEN
28                    self._success_count = 0
29                else:
30                    raise CircuitOpenError("Circuit is open — fast-failing")
31
32        try:
33            result = func(*args, **kwargs)
34            self._on_success()
35            return result
36        except Exception as e:
37            self._on_failure()
38            raise
39
40    def _on_success(self):
41        with self._lock:
42            self._failure_count = 0
43            if self._state == State.HALF_OPEN:
44                self._success_count += 1
45                if self._success_count >= self.success_threshold:
46                    self._state = State.CLOSED
47
48    def _on_failure(self):
49        with self._lock:
50            self._failure_count += 1
51            self._last_failure_time = time.time()
52            if self._failure_count >= self.failure_threshold or self._state == State.HALF_OPEN:
53                self._state = State.OPEN

Use pybreaker or circuitbreaker libraries in production rather than rolling your own — they handle thread safety and sliding window calculations correctly. The pattern above illustrates the state machine; production implementations add sliding window failure rate calculation (not absolute count), per-exception type configuration, and metrics emission.

What to emit when the circuit opens:

Increment a circuit_breaker_open_total counter (Prometheus label: dependency name)
Log at ERROR level with the dependency name, failure count, and threshold
Alert on sustained open state (> 60 seconds) via PagerDuty / OpsGenie

Bulkhead Pattern: Isolating Resource Pools

The bulkhead pattern partitions resources so that a failure in one partition cannot exhaust resources across the entire system. Named after ship bulkheads that prevent a flooded compartment from sinking the entire vessel.

In practice, this means separate thread pools or connection pools for different categories of downstream dependencies:

python

1from asyncio import Semaphore
2
3class BulkheadPool:
4    def __init__(self, max_concurrent: int):
5        self._semaphore = Semaphore(max_concurrent)
6
7    async def execute(self, coro):
8        async with self._semaphore:
9            return await coro
10
11# Separate pools for separate concerns
12payment_pool = BulkheadPool(max_concurrent=20)
13inventory_pool = BulkheadPool(max_concurrent=50)
14notification_pool = BulkheadPool(max_concurrent=10)
15
16async def process_order(order: Order):
17    # Inventory and payment run concurrently from isolated pools
18    inventory_task = asyncio.create_task(
19        inventory_pool.execute(reserve_inventory(order))
20    )
21    payment_task = asyncio.create_task(
22        payment_pool.execute(charge_payment(order))
23    )
24
25    inventory, payment = await asyncio.gather(inventory_task, payment_task)
26
27    # Notification is non-critical — failure is acceptable
28    asyncio.create_task(
29        notification_pool.execute(send_confirmation(order, payment))
30    )
31    return {"order_id": order.id, "status": "confirmed"}

The notification pool exhausting does not block order processing. The payment pool reaching capacity causes backpressure on payment operations but does not affect inventory queries. Failure domains are isolated.

Timeout Hierarchy

Every external call must have a timeout. "Set timeouts" is advice engineers routinely ignore until they spend a Saturday debugging why their API was holding 400 open connections for 90 seconds.

Design a timeout hierarchy from the outermost to innermost call:

Layer	Timeout	Purpose
Client → API Gateway	30s	Maximum user-facing wait
API Gateway → Service	25s	Service must respond before gateway times out
Service → DB (read)	5s	Query must complete
Service → DB (write)	10s	Transaction must complete
Service → External API	8s	Third-party response
Service → Cache	1s	Cache miss should be fast

Connect timeout vs. read timeout vs. total timeout — use all three:

python

1timeout = httpx.Timeout(
2    connect=2.0,  # TCP connection establishment
3    read=8.0,     # Time between bytes once connected
4    write=3.0,    # Time to send the request body
5    pool=1.0,     # Time to acquire a connection from the pool
6)

A service that takes 25 seconds to respond is effectively down. A tight timeout converts that to a fast failure that you can handle with a fallback.

Fallback Strategies by Criticality

Not every operation has the same criticality. Design fallbacks proportional to the business impact of failure:

Critical path (order processing, authentication): No fallback — surface a clear error to the user. Do not silently serve stale or incorrect data on the critical path.

User experience enhancement (personalization, recommendations): Serve a default/generic response. A homepage recommendation widget that shows generic popular items when the personalization service is unavailable is better than a blank section or a 500 error.

Analytics and logging: Fire and forget. If the analytics pipeline is degraded, log locally and do not impact the user-facing response. Background retry or queue the event locally.

Read replicas and caching: Serve stale cached data with a Cache-Control: max-stale header and a degraded-mode indicator. Stale product prices displayed with "Prices may be slightly out of date" is better than a 503.

Health Checks: Liveness vs. Readiness

Two distinct probes — do not conflate them:

Liveness probe (/healthz): Is the process alive? Should never fail in normal operation. A failing liveness probe causes Kubernetes to kill and restart the container. Return 200 with no external dependencies checked.

Readiness probe (/readyz): Is this instance ready to accept traffic? Check database connectivity, cache connectivity, any dependency required to serve requests. A failing readiness probe removes the instance from the load balancer rotation without killing it.

python

1@app.get("/readyz")
2async def readiness(db: DbSession = Depends(get_db), cache: Redis = Depends(get_cache)):
3    results = {}
4    try:
5        await db.execute(text("SELECT 1"))
6        results["database"] = "ok"
7    except Exception as e:
8        results["database"] = f"error: {e}"
9
10    try:
11        await cache.ping()
12        results["cache"] = "ok"
13    except Exception as e:
14        results["cache"] = f"error: {e}"
15
16    all_ok = all(v == "ok" for v in results.values())
17    return JSONResponse(content=results, status_code=200 if all_ok else 503)

Resilience is not about preventing failures — that is impossible in distributed systems. It is about converting failures from catastrophic system-wide events into bounded, recoverable, observable incidents.