Skip to main content

Retry Patterns Reference

Consult this file when implementing retry logic, backoff strategies, circuit breakers, or dead letter queue handling.


Exponential Backoff Formulas

Basic Exponential Backoff

delay = base * (multiplier ^ attempt)
  • base: Starting delay (e.g., 100ms)
  • multiplier: Growth factor (typically 2)
  • attempt: Zero-indexed attempt number

Example: 100ms, 200ms, 400ms, 800ms, 1600ms...

Problem: All retrying clients synchronize on the same delay, causing thundering herd when a service recovers.

delay = random(0, min(cap, base * (2 ^ attempt)))

Adds full randomization across the entire range. Produces the best throughput distribution when many clients retry simultaneously.

function retryDelay(attempt: number, base = 100, cap = 30_000): number {
const exponential = Math.min(cap, base * Math.pow(2, attempt));
return Math.random() * exponential; // full jitter
}

Equal Jitter

temp = min(cap, base * (2 ^ attempt))
delay = temp/2 + random(0, temp/2)

Guarantees a minimum wait (temp/2) while adding jitter. Useful when you want some delay guaranteed but still want distribution.

Decorrelated Jitter (AWS Recommendation)

delay = min(cap, random(base, prev_delay * 3))

Each retry is uncorrelated from the previous. Best when clients have different base delays or retry independently.

function* decorrelatedBackoff(base = 100, cap = 30_000) {
let prev = base;
while (true) {
const next = Math.min(cap, base + Math.random() * (prev * 3 - base));
prev = next;
yield next;
}
}

Retry Decision Matrix

Error TypeHTTP StatusRetry?Strategy
Transient network0, ECONNRESETYesFull jitter backoff
Rate limit429YesUse Retry-After header
Service unavailable503YesFull jitter backoff
Gateway timeout504YesFull jitter backoff
Bad request400NoPermanent failure
Unauthorized401NoRefresh token first, then retry once
Forbidden403NoPermanent failure
Not found404NoPermanent failure
Conflict409MaybeDepends on idempotency
Internal server error500MaybeOnce with delay; escalate if persists

Circuit Breaker Pattern

The circuit breaker prevents cascading failures by stopping requests to a failing dependency before it overwhelms or queues endlessly.

States

CLOSED ──(failure threshold exceeded)──► OPEN
▲ │
│ │ (timeout elapsed)
└──(probe succeeds)──── HALF-OPEN ◄──────┘

(probe fails)


OPEN (reset timeout)

Implementation (TypeScript)

type CircuitState = 'CLOSED' | 'OPEN' | 'HALF_OPEN';

interface CircuitBreakerConfig {
failureThreshold: number; // failures before opening
successThreshold: number; // successes in HALF_OPEN before closing
openTimeoutMs: number; // how long to stay OPEN before probing
requestTimeoutMs: number; // individual request timeout
}

class CircuitBreaker<T> {
private state: CircuitState = 'CLOSED';
private failures = 0;
private successes = 0;
private openedAt: number | null = null;

constructor(
private fn: () => Promise<T>,
private config: CircuitBreakerConfig
) {}

async call(): Promise<T> {
if (this.state === 'OPEN') {
if (Date.now() - this.openedAt! < this.config.openTimeoutMs) {
throw new CircuitOpenError('Circuit breaker is OPEN');
}
this.state = 'HALF_OPEN';
this.successes = 0;
}

try {
const result = await Promise.race([
this.fn(),
this.timeout(),
]);
this.onSuccess();
return result;
} catch (e) {
this.onFailure();
throw e;
}
}

private onSuccess() {
this.failures = 0;
if (this.state === 'HALF_OPEN') {
this.successes++;
if (this.successes >= this.config.successThreshold) {
this.state = 'CLOSED';
}
}
}

private onFailure() {
this.failures++;
if (this.failures >= this.config.failureThreshold) {
this.state = 'OPEN';
this.openedAt = Date.now();
}
if (this.state === 'HALF_OPEN') {
this.state = 'OPEN';
this.openedAt = Date.now();
}
}

private timeout(): Promise<never> {
return new Promise((_, reject) =>
setTimeout(() => reject(new TimeoutError('Request timed out')), this.config.requestTimeoutMs)
);
}
}

Configuration Guidelines

Dependency TypeFailure ThresholdOpen TimeoutNotes
External payment API560sLow tolerance, long recovery
Internal microservice1030sHigher tolerance
Database primary310sFail fast, failover fast
Cache (Redis)205sDegrade gracefully without cache
Email provider5120sExternal SLA, long cool-down

Dead Letter Queue (DLQ) Patterns

Messages that have exhausted retries go to a DLQ rather than being discarded.

When to Use DLQ

  • Background job failed all retries (permanent or unknown failure)
  • Message processing is non-idempotent and failed partway through
  • You need auditability of all failures
  • Human review or manual replay may be needed later

DLQ Message Schema

interface DLQMessage<T> {
// Original message
originalMessage: T;
originalQueue: string;

// Failure metadata
failureReason: string;
failureCode: string;
lastAttemptAt: string; // ISO 8601
totalAttempts: number;
errorStack?: string; // sanitized — no secrets

// Routing metadata
messageId: string;
correlationId: string;
enqueuedAt: string;
dlqEnqueuedAt: string;

// Replay support
replayable: boolean; // false if side effects partially applied
replayInstructions?: string; // human-readable notes for operators
}

DLQ Strategies

Alarm on DLQ depth: Alert when DLQ grows beyond expected volume. Silence on DLQ growth = hidden failures accumulating.

Replay pipeline: Build a separate process to inspect DLQ messages and replay them to the original queue after root cause is fixed. Never replay blindly — check idempotency first.

DLQ TTL: Set expiration on DLQ messages (7-30 days). After expiry, log a final failure metric and discard. Indefinite DLQ retention causes storage bloat and operational debt.

Separate DLQs per severity: High-value failures (payment processing) → monitored DLQ with pager. Low-value (analytics events) → silent DLQ with daily review.


Python Retry Implementation

import asyncio
import random
import logging
from functools import wraps
from typing import TypeVar, Callable, Awaitable

T = TypeVar('T')
logger = logging.getLogger(__name__)


def with_retry(
max_attempts: int = 3,
base_delay: float = 0.1, # seconds
max_delay: float = 30.0,
retryable_exceptions: tuple = (Exception,),
jitter: bool = True,
):
"""Decorator for async functions with exponential backoff + full jitter."""
def decorator(fn: Callable[..., Awaitable[T]]) -> Callable[..., Awaitable[T]]:
@wraps(fn)
async def wrapper(*args, **kwargs) -> T:
for attempt in range(max_attempts):
try:
return await fn(*args, **kwargs)
except retryable_exceptions as e:
if attempt == max_attempts - 1:
logger.error(
"Max retries exceeded",
extra={"function": fn.__name__, "attempts": attempt + 1, "error": str(e)}
)
raise

delay = min(max_delay, base_delay * (2 ** attempt))
if jitter:
delay = random.uniform(0, delay)

logger.warning(
"Retrying after error",
extra={"function": fn.__name__, "attempt": attempt + 1, "delay": delay, "error": str(e)}
)
await asyncio.sleep(delay)
raise RuntimeError("Unreachable") # type checker satisfaction
return wrapper
return decorator


# Usage
@with_retry(max_attempts=3, retryable_exceptions=(aiohttp.ClientError, asyncio.TimeoutError))
async def fetch_user(user_id: str) -> dict:
async with aiohttp.ClientSession() as session:
async with session.get(f"/api/users/{user_id}", timeout=aiohttp.ClientTimeout(total=5)) as resp:
resp.raise_for_status()
return await resp.json()

Idempotency and Retry Safety

Before retrying any operation, confirm it is idempotent or make it idempotent:

Naturally idempotent: GET, PUT (full replacement), DELETE (on already-deleted resource) Not idempotent by default: POST (creates new record), PATCH (incremental update), financial debits

Making POST idempotent: Idempotency keys. Send a client-generated UUID with every request. Server stores the key and returns the same response for duplicate requests within a TTL window.

// Client sends idempotency key
await fetch('/api/payments', {
method: 'POST',
headers: {
'Idempotency-Key': crypto.randomUUID(),
'Content-Type': 'application/json',
},
body: JSON.stringify(paymentData),
});

// Server checks key in Redis before processing
const existing = await redis.get(`idempotency:${key}`);
if (existing) return JSON.parse(existing); // replay cached response

Store idempotency results for 24-48 hours. Use a key namespace that includes the operation type to prevent cross-operation collisions.