RETRY_DISCIPLINE_FOR_REAL_INCIDENTS

Retry storms are one of the most predictable causes of extended outages, and one of the least often prevented. The pattern is consistent: a partial failure causes clients to retry at high frequency, which saturates the already degraded system, which causes more failures, which causes more retries. The incident that started with one failing component ends with several.

THE_CORE_RULE

Every retry attempt draws from a finite capacity budget. When clients are designed as if retries are free, they will spend that budget in ways that extend incidents rather than resolve them.

HOW_STORMS_START

The failure usually does not start with the retry logic. It starts with saturation: a database queue fills, a cache shard stalls, a network segment degrades. The storm starts when every upstream layer independently decides to retry the same request class against the same already-overloaded dependency.

Three patterns dominate postmortems. Identical timeout values across services — a legacy of copy-pasted service templates that no one owns — cause all clients to fire simultaneously when the timeout triggers. Retries stacked at multiple layers of the call graph without coordination turn one user request into dozens of backend requests within seconds. Exponential backoff with no jitter causes synchronized retry bursts where all clients wait the same duration and then hit the service together.

"Retries are load. Treat them as a controlled spend, not a free second chance."

BUDGET_THE_RETRIES

The right model is to allocate timeout budget from the top-level SLA downward. If the user-facing endpoint has a 1200ms SLA, the downstream call cannot quietly consume 1100ms and leave 100ms for retries, rendering, and error handling. Per-hop timeouts should be derived from the total budget, not inherited from defaults copied across every service in the stack.

  • Set per-hop timeouts proportionally from the top-level latency budget — not from default values copied at service creation and never revisited.
  • Permit retries at exactly one layer of the call stack and disable implicit retry behavior in SDKs and client libraries at every other layer.
  • Add full jitter to backoff intervals: randomize the wait time across the full range rather than applying a fixed multiplier that keeps clients synchronized.
  • Use idempotency keys for all mutation endpoints so that retries on checkout, billing, and provisioning do not produce duplicate effects.

OVERLOAD_SIGNALS_AND_EMERGENCY_CONTROLS

Well-behaved services emit overload signals that clients should act on. A 429 Too Many Requests response with a Retry-After header is the clearest signal, but many services return generic 500 responses under overload. Clients that do not distinguish between transient errors and overload will keep retrying at full rate regardless of what the server returns.

The fastest intervention during a live storm is usually not a code change. It is deploying a configuration change that reduces retry limits and increases backoff floors. If that configuration does not exist before the incident, it is much harder to produce and deploy under pressure.

  • Publish a retry-emergency profile as a config flag, deployable without a code change or full redeploy.
  • Log retry-attempt counts alongside error rates so incident responders can identify storm sources within minutes rather than after a postmortem.
  • Test overload behavior explicitly: does your service emit useful signals under load, and do your clients correctly interpret and act on them?

CIRCUIT_BREAKERS_AND_THE_HALF_OPEN_STATE

A circuit breaker is a state machine between a caller and a dependency. In the closed state, calls pass through normally. When failures exceed a threshold, the circuit trips to open and fails fast without contacting the dependency at all. After a configurable wait, it enters half-open and allows a single probe request. If the probe succeeds the circuit closes; if it fails it opens again with a fresh timer.

The half-open state is where most circuit breaker implementations diverge in practice. Some allow a configurable number of simultaneous probes; others block until a single probe resolves. The right behavior depends on whether the dependency can handle burst traffic immediately after recovery, and whether a failed probe during half-open should reset the open timer or apply a separate backoff multiplier. Neither answer is wrong in every case, but the defaults in most libraries were chosen for the library's maintainer's use case — not yours. Verify what your implementation does before relying on it under pressure.

  • Set open circuit duration to match the typical recovery time of the dependency, not a constant copied from a tutorial.
  • Log circuit state transitions with timestamps so responders can correlate circuit trips with upstream failure events in the same timeline.
  • Make circuit breaker state visible in your operations dashboards — a tripped circuit that nobody can see is not helping anyone.
  • Test half-open behavior explicitly in a staging environment under realistic load. Probe failures during half-open can cause oscillation rather than recovery if the timer resets incorrectly.

INSTRUMENTATION_THAT_REVEALS_STORM_RISK

Retry behavior is invisible unless you make it visible. The minimum useful instrumentation is retry attempt count per request logged alongside the final outcome, and a circuit breaker state metric exported to your metrics pipeline. Without these, incident responders diagnosing a storm spend most of their time establishing baseline behavior that should already be known.

Retry-after compliance is a particularly useful signal. When clients obey server-specified backoff windows, you see it in the request rate distribution — traffic drops after a 429 and resumes after the specified interval. When clients ignore the header or the server is not emitting one, synchronized retry bursts appear in the request histogram. That pattern is distinctive and easy to identify once you know what to look for.

The most actionable metric for storm risk is retry rate as a fraction of total requests. Under normal conditions this should be low and stable. A rising retry fraction ahead of a full degradation event is an early warning that something in the dependency graph is straining. Alerting on retry fraction gives operations teams a leading indicator rather than a lagging one — the difference between intervening before a storm and diagnosing one after it has run its course.

  • Export retry attempt count as a labeled metric, not just a log line. You need to aggregate and alert on it, not search for it during an incident.
  • Set an alert on rising retry fraction — not just error rate — so the signal reaches on-call before the customer-facing impact is visible.
  • Include retry context in distributed trace spans: which attempt number, what the previous outcome was, and what backoff interval was applied.

MAKING_THE_CONFIGURATION_DEPLOYABLE

The fastest intervention during a live retry storm is rarely a code change. It is deploying a configuration change that reduces retry limits, increases backoff floors, and widens jitter ranges. If that configuration does not exist as a deployable artifact before the incident starts, it is much harder to produce, review, and push to production under the time and communication pressure of a live event.

The operational preparation is to define a "retry emergency profile" as a first-class config artifact — one that has been reviewed, tested in staging, and is deployable without a full code release pipeline. Knowing that this lever exists and where it lives reduces the time between "we are in a storm" and "we have reduced the load on the degraded dependency" from tens of minutes to under five.

PRIMARY_REFERENCES