Rate Limiting
Controlling the rate of requests or operations to prevent abuse and ensure fair resource usage.
Core Idea
Rate limiting protects systems from overload by capping how many requests a client can make in a time window. Implementations include token bucket, sliding window, and leaky bucket algorithms.
Common Algorithms
| Algorithm | Best for | Strength | Weakness |
|---|---|---|---|
| Fixed window | Simple APIs | Easy to implement | Burst at window edges |
| Sliding window log | Precise fairness | Accurate | Expensive at scale |
| Sliding window counter | Public APIs | Good accuracy/cost tradeoff | Approximation |
| Token bucket | User-facing traffic | Allows short bursts | Slightly harder to reason about |
| Leaky bucket | Smoothing background work | Stable output rate | Less flexible for bursts |
Token bucket is the default choice for most production APIs. You refill tokens at a steady rate and spend one token per request. A bucket size larger than the refill rate allows short bursts without letting a client sustain abuse.
from time import monotonic
class TokenBucket:
def __init__(self, rate_per_sec: float, capacity: int):
self.rate = rate_per_sec
self.capacity = capacity
self.tokens = capacity
self.updated_at = monotonic()
def allow(self, cost: int = 1) -> bool:
now = monotonic()
elapsed = now - self.updated_at
self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
self.updated_at = now
if self.tokens < cost:
return False
self.tokens -= cost
return TrueWhat You Rate Limit
- Requests per IP or API key
- Login attempts per account and per subnet
- Expensive work units, not just HTTP requests
- Background jobs pulled from a queue
- Internal dependencies where overload cascades downstream
The important design choice is the key. Rate limiting by IP is cheap but unfair behind NAT. Rate limiting by account is fairer but does not protect anonymous endpoints. Many systems combine both.
Distributed Rate Limiting
Single-process counters work only when traffic hits one node. Once the service scales horizontally, you need shared state or consistent routing.
Common approaches:
- Store counters in Redis with expiry.
- Use a token bucket in each node plus a global coarse limit.
- Apply rate limiting at the edge gateway and keep stricter per-user controls in the app.
Failure mode to avoid: when the limiter backend fails open, abuse gets through; when it fails closed, healthy traffic is dropped. Decide that behavior explicitly.
Practical Rules
- Return
429 Too Many RequestswithRetry-Afterwhen the caller can back off. - Separate read limits, write limits, and auth limits.
- Rate limit expensive endpoints more aggressively than cheap ones.
- Track allow and deny counts per limiter key so you can see whether thresholds are sensible.
- If overload is already happening, combine rate limiting with queues, timeouts, and backpressure.
Engineering Checklist
- What exactly is being protected: CPU, DB, third-party API, login form?
- Is burstiness acceptable or should traffic be smoothed?
- Can one bad tenant starve everyone else?
- What happens if the limiter store is unavailable?
Links
- Heaps — priority scheduling and fairness tradeoffs
- Message Queues — backpressure after admission control