May 10, 2026·8 min read·BitAtlas

Rate Limiting and Quota Management for Multi-Tenant MCP Servers

Essential strategies for protecting MCP server resources and ensuring fair usage across multiple AI agent clients

MCP serverrate limitingquotasresource managementfairness

Multi-tenant MCP (Model Context Protocol) servers face a critical challenge: how do you prevent a single misbehaving or resource-hungry agent from starving others? Without proper rate limiting and quota management, a client that pounds your server with requests can degrade service for everyone.

In this post, we'll explore practical strategies for implementing rate limiting and quotas in MCP servers, keeping them fair, performant, and resilient.

Why Rate Limiting Matters for MCP Servers

MCP servers expose tools and resources to multiple AI agents. Unlike traditional APIs where you control the client code, MCP servers often integrate with agents you don't fully control—they might retry aggressively, spawn parallel requests, or simply have different resource needs than you anticipated.

Without limits:

A single misconfigured agent exhausts connection pools
Spike in traffic from one client delays responses for others
Resource-intensive tools (file I/O, computation) can become bottlenecks
You lose visibility into who's consuming what

Rate limiting and quotas solve this by enforcing fair-share resource allocation and making resource consumption predictable.

Core Strategies

Token Bucket Algorithm

The token bucket is the workhorse of rate limiting. Each client gets a bucket that refills at a fixed rate (e.g., 100 requests per minute). Each request costs tokens; when the bucket is empty, new requests wait or fail.

class TokenBucket {
  constructor(capacity, refillRate) {
    this.capacity = capacity;
    this.tokens = capacity;
    this.refillRate = refillRate; // tokens per second
    this.lastRefill = Date.now();
  }

  refill() {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(
      this.capacity,
      this.tokens + elapsed * this.refillRate
    );
    this.lastRefill = now;
  }

  tryConsume(cost = 1) {
    this.refill();
    if (this.tokens >= cost) {
      this.tokens -= cost;
      return true;
    }
    return false;
  }
}

Simple, effective, and works well for per-client rate limiting.

Per-Client Quotas

Quotas differ from rate limits—they measure total usage over longer windows (hourly, daily, monthly). Combine them with rate limits for multi-layer protection:

interface ClientQuota {
  clientId: string;
  dailyLimit: number;
  hourlyLimit: number;
  requestsToday: number;
  requestsThisHour: number;
  hourStarted: number;
}

function checkQuota(quota: ClientQuota): { allowed: boolean; reason?: string } {
  const now = Date.now();
  const hourAgo = now - 3600000;

  // Reset hourly counter if hour boundary crossed
  if (quota.hourStarted < hourAgo) {
    quota.requestsThisHour = 0;
    quota.hourStarted = now;
  }

  if (quota.requestsThisHour >= quota.hourlyLimit) {
    return { allowed: false, reason: "hourly quota exceeded" };
  }
  if (quota.requestsToday >= quota.dailyLimit) {
    return { allowed: false, reason: "daily quota exceeded" };
  }

  quota.requestsThisHour++;
  quota.requestsToday++;
  return { allowed: true };
}

Tool-Specific Costs

Not all requests are equal. Reading from a cache is cheap; running a large computation is expensive. Assign costs to operations:

const toolCosts: Record<string, number> = {
  "list-files": 1,
  "compute-similarity": 50,
  "fetch-from-api": 10,
  "run-expensive-analysis": 100,
};

function applyRateLimit(clientId: string, toolName: string): boolean {
  const cost = toolCosts[toolName] ?? 1;
  const bucket = getOrCreateBucket(clientId);
  return bucket.tryConsume(cost);
}

This lets you protect expensive operations without over-restricting cheap ones.

Response Headers for Transparency

Tell clients their remaining quota so they can make smart decisions:

function attachRateLimitHeaders(
  response: Response,
  clientId: string,
  toolName: string
) {
  const bucket = getBucket(clientId);
  const quota = getQuota(clientId);
  const cost = toolCosts[toolName];

  bucket.refill(); // Ensure current state
  response.headers.set("X-RateLimit-Limit", bucket.capacity.toString());
  response.headers.set("X-RateLimit-Remaining", Math.floor(bucket.tokens).toString());
  response.headers.set("X-RateLimit-Reset", new Date(bucket.lastRefill + bucket.capacity / bucket.refillRate * 1000).toISOString());
  response.headers.set("X-Quota-Daily-Remaining", (quota.dailyLimit - quota.requestsToday).toString());
  response.headers.set("X-Quota-Cost", cost.toString());

  return response;
}

Clients that honor these headers will self-regulate and retry intelligently.

Implementation Considerations

Distributed State

If your MCP server runs on multiple instances, in-memory buckets won't work—a request routed to instance A sees different state than one going to instance B. Use Redis or a similar cache:

const redis = new Redis();

async function consumeToken(clientId: string, cost: number) {
  const key = `bucket:${clientId}`;
  const result = await redis.eval(
    `
      local current = tonumber(redis.call('get', KEYS[1]) or ARGV[1])
      if current >= tonumber(ARGV[2]) then
        redis.call('decrby', KEYS[1], tonumber(ARGV[2]))
        return 1
      end
      return 0
    `,
    1,
    key,
    cost
  );
  return result === 1;
}

This ensures consistent state across all server instances.

Backpressure Handling

When a client hits a limit, decide: reject with a 429 (Too Many Requests) status, or queue the request? For async tools, queuing with exponential backoff is friendlier:

async function executeWithBackoff(
  clientId: string,
  toolName: string,
  fn: () => Promise<any>
) {
  let attempt = 0;
  const maxAttempts = 5;

  while (attempt < maxAttempts) {
    if (applyRateLimit(clientId, toolName)) {
      return await fn();
    }

    const delay = Math.pow(2, attempt) * 100; // 100ms, 200ms, 400ms, ...
    await new Promise((r) => setTimeout(r, delay));
    attempt++;
  }

  throw new Error("Rate limit exceeded after retries");
}

Monitoring and Alerting

Track quota usage per client over time. Alert when a client consistently approaches limits—they may need a higher tier or you may have a bug:

setInterval(() => {
  for (const [clientId, quota] of Object.entries(quotas)) {
    const usage = quota.requestsToday / quota.dailyLimit;
    if (usage > 0.9) {
      console.warn(`⚠️ Client ${clientId} at ${(usage * 100).toFixed(1)}% daily quota`);
    }
  }
}, 60000); // Every minute

Tiered Access

Offer different quota tiers so power users don't hit limits:

const tiers = {
  free: { dailyLimit: 1000, hourlyLimit: 100, refillRate: 5 },
  pro: { dailyLimit: 100000, hourlyLimit: 10000, refillRate: 500 },
  enterprise: { dailyLimit: Infinity, hourlyLimit: Infinity, refillRate: Infinity },
};

Tiers let you monetize fairly while protecting the free tier from abuse.

Testing Your Limits

Before deploying, stress-test your rate limiting:

async function testRateLimits() {
  const clientId = "test-client";
  let allowed = 0,
    blocked = 0;

  // Try 200 rapid requests with 50/sec rate limit
  const start = Date.now();
  for (let i = 0; i < 200; i++) {
    if (applyRateLimit(clientId, "list-files")) {
      allowed++;
    } else {
      blocked++;
    }
  }
  const elapsed = Date.now() - start;

  console.log(`Allowed: ${allowed}, Blocked: ${blocked}, Elapsed: ${elapsed}ms`);
  // Should see ~50 allowed in the first second, rest blocked
}

Wrap-Up

Rate limiting and quotas are essential infrastructure for multi-tenant MCP servers. They protect your resources, ensure fairness, and give clients visibility into their usage. Start with simple token bucket rate limiting, add tiered quotas for different use cases, and monitor closely.

The goal isn't to reject requests—it's to shape demand so your server stays responsive for everyone.

Encrypt your agent's data today

BitAtlas gives your AI agents AES-256-GCM encrypted storage with zero-knowledge guarantees. Free tier, no credit card required.