RETRY_DISCIPLINE_FOR_REAL_INCIDENTS

Retry storms are one of the most predictable causes of extended outages, and one of the least often prevented. The pattern is consistent: a partial failure causes clients to retry at high frequency, which saturates the already degraded system, which causes more failures, which causes more retries. The incident that started with one failing component ends with several.

THE_CORE_RULE

Every retry attempt draws from a finite capacity budget. When clients are designed as if retries are free, they will spend that budget in ways that extend incidents rather than resolve them.

HOW_STORMS_START

The failure usually does not start with the retry logic. It starts with saturation: a database queue fills, a cache shard stalls, a network segment degrades. The storm starts when every upstream layer independently decides to retry the same request class against the same already-overloaded dependency.

Three patterns dominate postmortems. Identical timeout values across services — a legacy of copy-pasted service templates that no one owns — cause all clients to fire simultaneously when the timeout triggers. Retries stacked at multiple layers of the call graph without coordination turn one user request into dozens of backend requests within seconds. Exponential backoff with no jitter causes synchronized retry bursts where all clients wait the same duration and then hit the service together.

"Retries are load. Treat them as a controlled spend, not a free second chance."

BUDGET_THE_RETRIES

The right model is to allocate timeout budget from the top-level SLA downward. If the user-facing endpoint has a 1200ms SLA, the downstream call cannot quietly consume 1100ms and leave 100ms for retries, rendering, and error handling. Per-hop timeouts should be derived from the total budget, not inherited from defaults copied across every service in the stack.

  • Set per-hop timeouts proportionally from the top-level latency budget — not from default values copied at service creation and never revisited.
  • Permit retries at exactly one layer of the call stack and disable implicit retry behavior in SDKs and client libraries at every other layer.
  • Add full jitter to backoff intervals: randomize the wait time across the full range rather than applying a fixed multiplier that keeps clients synchronized.
  • Use idempotency keys for all mutation endpoints so that retries on checkout, billing, and provisioning do not produce duplicate effects.

OVERLOAD_SIGNALS_AND_EMERGENCY_CONTROLS

Well-behaved services emit overload signals that clients should act on. A 429 Too Many Requests response with a Retry-After header is the clearest signal, but many services return generic 500 responses under overload. Clients that do not distinguish between transient errors and overload will keep retrying at full rate regardless of what the server returns.

The fastest intervention during a live storm is usually not a code change. It is deploying a configuration change that reduces retry limits and increases backoff floors. If that configuration does not exist before the incident, it is much harder to produce and deploy under pressure.

  • Publish a retry-emergency profile as a config flag, deployable without a code change or full redeploy.
  • Log retry-attempt counts alongside error rates so incident responders can identify storm sources within minutes rather than after a postmortem.
  • Test overload behavior explicitly: does your service emit useful signals under load, and do your clients correctly interpret and act on them?

PRIMARY_REFERENCES