Rate Limiting and Quota Management for Multi-Tenant MCP Servers
Essential strategies for protecting MCP server resources and ensuring fair usage across multiple AI agent clients
Multi-tenant MCP (Model Context Protocol) servers face a critical challenge: how do you prevent a single misbehaving or resource-hungry agent from starving others? Without proper rate limiting and quota management, a client that pounds your server with requests can degrade service for everyone.
In this post, we'll explore practical strategies for implementing rate limiting and quotas in MCP servers, keeping them fair, performant, and resilient.
Why Rate Limiting Matters for MCP Servers
MCP servers expose tools and resources to multiple AI agents. Unlike traditional APIs where you control the client code, MCP servers often integrate with agents you don't fully control—they might retry aggressively, spawn parallel requests, or simply have different resource needs than you anticipated.
Without limits:
- A single misconfigured agent exhausts connection pools
- Spike in traffic from one client delays responses for others
- Resource-intensive tools (file I/O, computation) can become bottlenecks
- You lose visibility into who's consuming what
Rate limiting and quotas solve this by enforcing fair-share resource allocation and making resource consumption predictable.
Core Strategies
Token Bucket Algorithm
The token bucket is the workhorse of rate limiting. Each client gets a bucket that refills at a fixed rate (e.g., 100 requests per minute). Each request costs tokens; when the bucket is empty, new requests wait or fail.
class TokenBucket {
constructor(capacity, refillRate) {
this.capacity = capacity;
this.tokens = capacity;
this.refillRate = refillRate; // tokens per second
this.lastRefill = Date.now();
}
refill() {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
this.tokens = Math.min(
this.capacity,
this.tokens + elapsed * this.refillRate
);
this.lastRefill = now;
}
tryConsume(cost = 1) {
this.refill();
if (this.tokens >= cost) {
this.tokens -= cost;
return true;
}
return false;
}
}
Simple, effective, and works well for per-client rate limiting.
Per-Client Quotas
Quotas differ from rate limits—they measure total usage over longer windows (hourly, daily, monthly). Combine them with rate limits for multi-layer protection:
interface ClientQuota {
clientId: string;
dailyLimit: number;
hourlyLimit: number;
requestsToday: number;
requestsThisHour: number;
hourStarted: number;
}
function checkQuota(quota: ClientQuota): { allowed: boolean; reason?: string } {
const now = Date.now();
const hourAgo = now - 3600000;
// Reset hourly counter if hour boundary crossed
if (quota.hourStarted < hourAgo) {
quota.requestsThisHour = 0;
quota.hourStarted = now;
}
if (quota.requestsThisHour >= quota.hourlyLimit) {
return { allowed: false, reason: "hourly quota exceeded" };
}
if (quota.requestsToday >= quota.dailyLimit) {
return { allowed: false, reason: "daily quota exceeded" };
}
quota.requestsThisHour++;
quota.requestsToday++;
return { allowed: true };
}
Tool-Specific Costs
Not all requests are equal. Reading from a cache is cheap; running a large computation is expensive. Assign costs to operations:
const toolCosts: Record<string, number> = {
"list-files": 1,
"compute-similarity": 50,
"fetch-from-api": 10,
"run-expensive-analysis": 100,
};
function applyRateLimit(clientId: string, toolName: string): boolean {
const cost = toolCosts[toolName] ?? 1;
const bucket = getOrCreateBucket(clientId);
return bucket.tryConsume(cost);
}
This lets you protect expensive operations without over-restricting cheap ones.
Response Headers for Transparency
Tell clients their remaining quota so they can make smart decisions:
function attachRateLimitHeaders(
response: Response,
clientId: string,
toolName: string
) {
const bucket = getBucket(clientId);
const quota = getQuota(clientId);
const cost = toolCosts[toolName];
bucket.refill(); // Ensure current state
response.headers.set("X-RateLimit-Limit", bucket.capacity.toString());
response.headers.set("X-RateLimit-Remaining", Math.floor(bucket.tokens).toString());
response.headers.set("X-RateLimit-Reset", new Date(bucket.lastRefill + bucket.capacity / bucket.refillRate * 1000).toISOString());
response.headers.set("X-Quota-Daily-Remaining", (quota.dailyLimit - quota.requestsToday).toString());
response.headers.set("X-Quota-Cost", cost.toString());
return response;
}
Clients that honor these headers will self-regulate and retry intelligently.
Implementation Considerations
Distributed State
If your MCP server runs on multiple instances, in-memory buckets won't work—a request routed to instance A sees different state than one going to instance B. Use Redis or a similar cache:
const redis = new Redis();
async function consumeToken(clientId: string, cost: number) {
const key = `bucket:${clientId}`;
const result = await redis.eval(
`
local current = tonumber(redis.call('get', KEYS[1]) or ARGV[1])
if current >= tonumber(ARGV[2]) then
redis.call('decrby', KEYS[1], tonumber(ARGV[2]))
return 1
end
return 0
`,
1,
key,
cost
);
return result === 1;
}
This ensures consistent state across all server instances.
Backpressure Handling
When a client hits a limit, decide: reject with a 429 (Too Many Requests) status, or queue the request? For async tools, queuing with exponential backoff is friendlier:
async function executeWithBackoff(
clientId: string,
toolName: string,
fn: () => Promise<any>
) {
let attempt = 0;
const maxAttempts = 5;
while (attempt < maxAttempts) {
if (applyRateLimit(clientId, toolName)) {
return await fn();
}
const delay = Math.pow(2, attempt) * 100; // 100ms, 200ms, 400ms, ...
await new Promise((r) => setTimeout(r, delay));
attempt++;
}
throw new Error("Rate limit exceeded after retries");
}
Monitoring and Alerting
Track quota usage per client over time. Alert when a client consistently approaches limits—they may need a higher tier or you may have a bug:
setInterval(() => {
for (const [clientId, quota] of Object.entries(quotas)) {
const usage = quota.requestsToday / quota.dailyLimit;
if (usage > 0.9) {
console.warn(`⚠️ Client ${clientId} at ${(usage * 100).toFixed(1)}% daily quota`);
}
}
}, 60000); // Every minute
Tiered Access
Offer different quota tiers so power users don't hit limits:
const tiers = {
free: { dailyLimit: 1000, hourlyLimit: 100, refillRate: 5 },
pro: { dailyLimit: 100000, hourlyLimit: 10000, refillRate: 500 },
enterprise: { dailyLimit: Infinity, hourlyLimit: Infinity, refillRate: Infinity },
};
Tiers let you monetize fairly while protecting the free tier from abuse.
Testing Your Limits
Before deploying, stress-test your rate limiting:
async function testRateLimits() {
const clientId = "test-client";
let allowed = 0,
blocked = 0;
// Try 200 rapid requests with 50/sec rate limit
const start = Date.now();
for (let i = 0; i < 200; i++) {
if (applyRateLimit(clientId, "list-files")) {
allowed++;
} else {
blocked++;
}
}
const elapsed = Date.now() - start;
console.log(`Allowed: ${allowed}, Blocked: ${blocked}, Elapsed: ${elapsed}ms`);
// Should see ~50 allowed in the first second, rest blocked
}
Wrap-Up
Rate limiting and quotas are essential infrastructure for multi-tenant MCP servers. They protect your resources, ensure fairness, and give clients visibility into their usage. Start with simple token bucket rate limiting, add tiered quotas for different use cases, and monitor closely.
The goal isn't to reject requests—it's to shape demand so your server stays responsive for everyone.