Back to blog
·9 min read·BitAtlas

Designing MCP Servers for Streaming Responses and Async Request Handling

Learn how to architect MCP servers that handle large responses efficiently through streaming and async patterns, reducing memory overhead and improving end-to-end latency.

MCP serverstreamingasyncperformancelarge responses

When building Model Context Protocol servers at scale, you'll inevitably encounter requests that produce large responses—think vector search results, paginated data exports, or real-time log streams. Naive implementations that buffer entire responses in memory before sending will hit memory limits, timeout, or tank latency for your LLM clients. This post explores streaming and async patterns that let you serve large datasets efficiently.

The Buffer Problem

Let's start with what goes wrong:

// ❌ Naive approach: buffer everything
async function handleRequest(query) {
  const results = [];
  for (let i = 0; i < 100000; i++) {
    results.push(await fetchItem(i));
  }
  return results; // Send entire array at once
}

This approach:

  • Allocates massive arrays in memory before sending a single byte
  • Blocks the event loop if iteration takes time
  • Delays the first byte until all data is ready (poor perceived latency)
  • Kills tail latency if the network is slow—the server holds everything in RAM

For a vector database returning 10K embeddings, you're looking at 10000 * 1536 dimensions * 8 bytes ≈ 122MB sitting in RAM per concurrent request. Add 10 concurrent clients and you've blown your memory budget.

Streaming: Send-As-You-Go

MCP servers can emit responses incrementally using streaming patterns. Instead of returning a single large object, you yield partial results over time.

// ✅ Streaming approach: yield results as they're ready
async function* streamResults(query) {
  for (let i = 0; i < 100000; i++) {
    const item = await fetchItem(i);
    yield item; // Emit as soon as it's ready
  }
}

The client receives results incrementally, can begin processing immediately, and the server releases memory after each yield. For a vector search with 10K results:

  • Memory footprint: O(batch_size) instead of O(total_results)
  • Time to first result: milliseconds instead of seconds
  • Network efficiency: Can start processing while remaining data transfers

Implementing Streaming in MCP

MCP's resource-based architecture supports streaming through the read_resource call. When your tool returns a resource, the client can request it in chunks:

const server = new Server({
  tools: {
    search_vectors: {
      description: "Search embeddings and stream results",
      inputSchema: {
        type: "object",
        properties: {
          query: { type: "array", items: { type: "number" } },
          limit: { type: "integer", default: 1000 }
        }
      },
      execute: async (args) => {
        // Return a resource URI instead of the full result
        return {
          type: "resource",
          uri: `mcp://vector-search/${Date.now()}/${args.query}`,
          mimeType: "application/jsonlines"
        };
      }
    }
  },
  resources: {
    read_resource: async (uri, range) => {
      const [, timestamp, queryStr] = uri.match(/vector-search\/(\d+)\/(.*)/);
      const query = JSON.parse(queryStr);
      
      // Yield results in chunks
      let offset = range?.start || 0;
      let emitted = 0;
      const limit = (range?.end || offset + 10000) - offset;

      const lines = [];
      for (let i = offset; i < offset + limit; i++) {
        const result = await vectorDb.search(query, i);
        lines.push(JSON.stringify(result));
        emitted++;
        
        // Emit in batches to avoid re-allocation
        if (lines.length >= 100) {
          process.stdout.write(lines.join("\n") + "\n");
          lines.length = 0;
        }
      }
      
      if (lines.length) {
        process.stdout.write(lines.join("\n") + "\n");
      }
      
      return { contents: emitted };
    }
  }
});

The client fetches the resource with range headers—Range: bytes=0-8192—and receives JSONL-formatted results. Each line is a complete JSON object, parsed independently.

Async Request Handling

Streaming handles the output side. On the input side, you need async patterns for concurrency.

Concurrent Request Handlers

By default, MCP tool execution is sequential—finish one request, start the next. For I/O-bound operations (database queries, API calls), this is wasteful:

// ❌ Sequential (one at a time)
for (const request of requests) {
  const result = await slowDatabase.query(request);
}

Use Promise.all or Promise.allSettled:

// ✅ Concurrent
const results = await Promise.all(
  requests.map(r => slowDatabase.query(r))
);

// ✅ Resilient (partial failures don't crash)
const outcomes = await Promise.allSettled(
  requests.map(r => slowDatabase.query(r))
);
const successes = outcomes
  .filter(o => o.status === 'fulfilled')
  .map(o => o.value);

For 100 requests each taking 50ms serially, you're at 5 seconds. Concurrently, you're down to 50ms (bottlenecked by the slowest request).

Timeout and Backpressure

Async concurrency can overwhelm downstream services if not throttled:

// ✅ Bounded concurrency with p-queue or similar
import PQueue from 'p-queue';

const queue = new PQueue({ concurrency: 10 });

const results = await Promise.all(
  requests.map(req => queue.add(() => vectorDb.query(req)))
);

This limits simultaneous work to 10 at a time, preventing memory blowout and respecting database connection pools.

Timeouts for Slow Operations

Long operations should have explicit timeouts:

async function withTimeout(promise, ms) {
  return Promise.race([
    promise,
    new Promise((_, reject) =>
      setTimeout(() => reject(new Error('Timeout')), ms)
    )
  ]);
}

// Usage
try {
  const result = await withTimeout(
    vectorDb.search(query),
    5000 // 5 second timeout
  );
} catch (e) {
  // Return partial results or fail gracefully
  return { error: 'Search timeout', partial: [] };
}

Combining Streaming + Async

For maximum efficiency, combine both patterns:

  1. Stream output so results arrive incrementally
  2. Parallelize input processing with bounded concurrency
  3. Add backpressure handling to avoid overwhelming the server
async function* streamLargeSearch(queries) {
  const queue = new PQueue({ concurrency: 5 });
  
  for (const query of queries) {
    try {
      const result = await withTimeout(
        queue.add(() => vectorDb.search(query)),
        3000
      );
      yield result; // Emit as ready, don't wait for others
    } catch (e) {
      yield { error: e.message, query };
    }
  }
}

Monitoring and Observability

Streaming servers need different observability:

  • Peak memory per request, not total buffered data
  • Time to first byte (TTFB) and bytes per second throughput
  • Chunk boundaries and batch sizes to tune backpressure
  • Timeout rates to detect downstream bottlenecks

Instrument your streaming handlers:

async function* streamWithMetrics(source) {
  let chunkCount = 0;
  const startTime = Date.now();
  
  for await (const item of source) {
    chunkCount++;
    if (chunkCount % 1000 === 0) {
      const elapsed = Date.now() - startTime;
      console.log(`Streamed ${chunkCount} items in ${elapsed}ms`);
    }
    yield item;
  }
  
  console.log(`Total chunks: ${chunkCount}, duration: ${Date.now() - startTime}ms`);
}

Takeaways

  • Buffer only what you must. Streaming decouples the request size from server memory.
  • Parallelize I/O. Bounded concurrency keeps your server responsive under load.
  • Set timeouts. Long operations without limits degrade gracefully into hangs and cascading failures.
  • Measure TTFB and throughput. Buffering hides latency; streaming exposes real performance.

With these patterns, your MCP servers can handle arbitrarily large datasets—vector searches, analytics exports, real-time feeds—without sacrificing latency or memory efficiency. The result is an MCP deployment that scales with your LLM workload, not against it.

Encrypt your agent's data today

BitAtlas gives your AI agents AES-256-GCM encrypted storage with zero-knowledge guarantees. Free tier, no credit card required.