Rate Limiting

Controlling the rate of requests or operations to prevent abuse and ensure fair resource usage.

Core Idea

Rate limiting protects systems from overload by capping how many requests a client can make in a time window. Implementations include token bucket, sliding window, and leaky bucket algorithms.

Common Algorithms

AlgorithmBest forStrengthWeakness
Fixed windowSimple APIsEasy to implementBurst at window edges
Sliding window logPrecise fairnessAccurateExpensive at scale
Sliding window counterPublic APIsGood accuracy/cost tradeoffApproximation
Token bucketUser-facing trafficAllows short burstsSlightly harder to reason about
Leaky bucketSmoothing background workStable output rateLess flexible for bursts

Token bucket is the default choice for most production APIs. You refill tokens at a steady rate and spend one token per request. A bucket size larger than the refill rate allows short bursts without letting a client sustain abuse.

from time import monotonic
 
class TokenBucket:
    def __init__(self, rate_per_sec: float, capacity: int):
        self.rate = rate_per_sec
        self.capacity = capacity
        self.tokens = capacity
        self.updated_at = monotonic()
 
    def allow(self, cost: int = 1) -> bool:
        now = monotonic()
        elapsed = now - self.updated_at
        self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
        self.updated_at = now
        if self.tokens < cost:
            return False
        self.tokens -= cost
        return True

What You Rate Limit

  • Requests per IP or API key
  • Login attempts per account and per subnet
  • Expensive work units, not just HTTP requests
  • Background jobs pulled from a queue
  • Internal dependencies where overload cascades downstream

The important design choice is the key. Rate limiting by IP is cheap but unfair behind NAT. Rate limiting by account is fairer but does not protect anonymous endpoints. Many systems combine both.

Distributed Rate Limiting

Single-process counters work only when traffic hits one node. Once the service scales horizontally, you need shared state or consistent routing.

Common approaches:

  1. Store counters in Redis with expiry.
  2. Use a token bucket in each node plus a global coarse limit.
  3. Apply rate limiting at the edge gateway and keep stricter per-user controls in the app.

Failure mode to avoid: when the limiter backend fails open, abuse gets through; when it fails closed, healthy traffic is dropped. Decide that behavior explicitly.

Practical Rules

  • Return 429 Too Many Requests with Retry-After when the caller can back off.
  • Separate read limits, write limits, and auth limits.
  • Rate limit expensive endpoints more aggressively than cheap ones.
  • Track allow and deny counts per limiter key so you can see whether thresholds are sensible.
  • If overload is already happening, combine rate limiting with queues, timeouts, and backpressure.

Engineering Checklist

  • What exactly is being protected: CPU, DB, third-party API, login form?
  • Is burstiness acceptable or should traffic be smoothed?
  • Can one bad tenant starve everyone else?
  • What happens if the limiter store is unavailable?
  • Heaps — priority scheduling and fairness tradeoffs
  • Message Queues — backpressure after admission control