Back to Blog
SaaS

API Rate Limiting That Doesn't Make Your Users Hate You

Zyptr Admin
8 July 2024
8 min read

Most Rate Limiting Is User-Hostile

The standard approach: return HTTP 429 with a "Retry-After" header. The user gets a vague error, their request fails, and they have no idea what they did wrong or how to avoid it. This is technically correct and practically terrible. We've built rate limiting into eight production APIs and we've iterated our approach significantly based on user feedback and support tickets.

Here's how we think about rate limiting now.

Communicate Before You Block

Every API response from our systems includes rate limit headers: X-RateLimit-Limit (total allowed requests), X-RateLimit-Remaining (requests left in the current window), and X-RateLimit-Reset (when the window resets). This lets well-behaved clients self-throttle before hitting the limit. We also send a warning webhook or email when a customer reaches 80% of their rate limit, giving them time to optimize or upgrade.

When we do return a 429, the response body includes: the specific limit that was exceeded (was it the per-second limit? per-minute? per-day?), the exact reset time, and a link to documentation explaining how to handle rate limiting. Compare this to APIs that return a bare 429 with no body. The extra information saves hours of debugging for the API consumer and reduces support tickets for us.

The Algorithm Matters: Token Bucket vs Sliding Window

We use the token bucket algorithm for most rate limiting. It's more forgiving than fixed windows because it allows bursts of traffic up to the bucket size while maintaining an average rate. A user with a 100 requests/minute limit can send 50 requests in one second (if their bucket is full) without being rate-limited, as long as they don't sustain that rate.

The implementation: we use Redis with a Lua script that atomically checks and updates the token count. The Lua script ensures that the check-and-decrement is atomic, preventing race conditions that could allow users to exceed their limit. Each user/API key has a Redis key that stores their current token count and last refill timestamp. Tokens refill at a constant rate up to the bucket maximum.

We chose token bucket over sliding window log (which is more precise but uses more memory — O(n) per user where n is the number of requests in the window) and over fixed window counters (which have the thundering herd problem at window boundaries). Token bucket is O(1) memory per user and handles bursts gracefully.

Different Limits for Different Things

A single global rate limit is too blunt. We implement tiered limits. Per-second limits protect against accidental infinite loops and DoS (usually 10-50 req/s depending on the plan). Per-minute limits prevent sustained abuse while allowing reasonable bursts (usually 5-10x the per-second limit). Per-day limits enforce plan quotas (the total number of API calls included in the plan). And per-endpoint limits for expensive operations (AI inference endpoints get tighter limits than simple CRUD endpoints).

We also implement cost-based rate limiting for some APIs. An endpoint that triggers an LLM call "costs" 10 rate limit tokens. A simple database read costs 1 token. This prevents a user from burning through their rate limit on expensive operations while having headroom for cheap ones. The cost is communicated via an X-RateLimit-Cost header on each response.

Graceful Degradation Instead of Hard Blocking

For some use cases, returning a 429 is too aggressive. For our monitoring product, if a customer exceeds their check frequency limit, we don't stop monitoring their endpoints (that would be terrible for an uptime monitor). Instead, we reduce the check frequency to the base tier rate and send a notification. The customer's endpoints are still monitored, just less frequently. We call this "soft rate limiting" and it's significantly better for customer experience than a hard block.

Testing Rate Limiting Is Tricky

You need to test both the happy path (requests within limits succeed) and the edge cases: exactly at the limit, one request over the limit, requests from multiple clients simultaneously, behavior during Redis failover (we fail open — if Redis is down, requests are allowed through, because blocking all API access is worse than allowing a brief period without rate limiting), and behavior at window boundaries.

We use k6 for load testing rate limiting. A typical test scenario: ramp up to 2x the rate limit, verify that responses transition from 200 to 429 at the expected threshold, verify that the rate limit headers are accurate throughout, and verify that requests resume after the window resets. These tests run in our staging environment before every deployment that touches the rate limiting code.

rate-limitingapiredisinfrastructure
Let's Work Together

Have a Project in Mind?
Great?

Let's talk about building your next product.