LLM Sampling with FastMCP: Using Client LLMs for Scalable AI Workflows in your MCP

blog preview

Building on our previous discussions about creating your first MCP server and automating MCP server creation, today we're exploring a powerful capability that addresses one of the most significant challenges in mcp server architecture: scaling language model inference.

When building AI-powered MCP servers, you often need to make language model calls within your server-side functions. But what if your server could delegate those LLM requests to the client instead of handling them itself? This is exactly what FastMCP's LLM sampling feature enables.

By offloading language model processing to the client side, your MCP server remains lightweight and scalable, even as the number of concurrent users grows. This approach eliminates server-side bottlenecks for LLM inference while still giving your server functions access to powerful AI capabilities. Furthermore, this provides a way to scale the economics of your MCP server: You can implement advanced AI workflows on the server, but the client takes responsibility for the actual LLM calls and therefore the costs associated with them.

What is LLM Sampling?

LLM sampling in MCP servers refers to the ability of server-side functions to request language model completions from the client's language model rather than using a server-side LLM. This creates a unique inverted architecture where the server sends prompt instructions to the client, which then performs the LLM inference and returns the results back to the server.

In traditional AI architectures, the server hosts both the application logic and the language model integration. With MCP's sampling approach, your server-side code can make requests like "generate a summary of this text" or "analyze the sentiment of this comment," but the actual language model processing happens on the client side.

The flow wrks as follows:

  1. Your MCP server function needs LLM capabilities (e.g., to classify text or generate content)
  2. The function sends a prompt to the client
  3. The client receives this request and processes it using its own LLM (which could be a local model or another API)
  4. The client returns the generated text back to your server function
  5. Your server function continues processing with the LLM-generated response - almost as if the server had performed the LLM inference itself

This architecture offers significant advantages for MCP applications:

  • Scalability: Your server avoids the compute-intensive LLM inference, allowing it to handle more concurrent users
  • Cost efficiency: The processing costs are distributed across clients rather than centralized on your server
  • Flexibility: Clients can use different models based on their capabilities or preferences
  • Reduced latency: For some applications, eliminating server-side queuing for LLM inference can improve response times

LLM sampling acts as a bridge between your server's processing capabilities and the language model intelligence accessible through the client, creating a more distributed and efficient AI system.

How to Use LLM Sampling with FastMCP

To implement LLM sampling in your FastMCP server, you can use the Context object of FastMCP. Let's walk through a simple example of how to set this up.

First, create the FastMCP server object.

1from fastmcp import FastMCP, Context
2
3mcp = FastMCP(
4 name="Document Assistant",
5 instructions="This server allows to fetch, summarize and update documents related to PONDHOUSE DATA DOCS."
6)

Next, define a tool that will be used to summarize a document. Note that the following is a simplified example. Document summarization normally involves multiple steps.

Note: The llm sampling is available in the Context object. Simply add a parameter with type-hint Context to your tool function. FastMCP will inject the context object for you.

To use LLM sampling instead of a server-side LLM, use the ctx.sample function. Provide a message (this can be a string or a list of strings), a system prompt and the model parameters, like temperature and max tokens.

1@mcp.tool()
2async def summarize_document(document_text: str, ctx: Context) -> str:
3 """Generate a summary of a document"""
4 response = await ctx.sample(
5 messages=f"Summarize the following document:\n{document_text}",
6 system_prompt="Your are a document assistant. First extract the key ideas of the document. Then summirize it in a few sentences.",
7 temperature=0.7,
8 max_tokens=300,
9 )
10
11 summary = response.text
12 return f"Summary:\n{summary}"

Note: You can also provide a list of SamplingMessage objects to the messages parameter. This is helpful if you want to provide the clients LLM with context of a previous conversation between a user an an assistant.

1# Use this if you want to provide the client LLM with context of a previous
2# conversation between a user an an assistant. Stick to the easier string
3# version if you don't need this.
4from mcp.types import SamplingMessage, TextContent
5sampling_messages = [
6 SamplingMessage(
7 role="user", content=TextContent(type="text", text="How are you?")
8 ),
9 SamplingMessage(
10 role="assistant", content=TextContent(type="text", text="I am fine.")
11 ),
12]

That's actually all you need to do. Run your FastMCP server as described in our previous blog post and you're done. Keep in mind that the client side needs to implement support for LLM sampling as well.

Implement LLM Sampling on MCP Clients using FastMCP

To demonstrate how LLM sampling works on the client side, let's use the FastMCP MCP client library. To support LLM sampling, you simply need to implement a sampling handler. As the name implies, this is a function which takes the incoming sampling request from the server and returns the generated text.

1from fastmcp.client import Client
2from fastmcp.client.sampling import (
3 SamplingMessage,
4 SamplingParams,
5 RequestContext,
6)
7# Note the `sse` path post-fix for sse servers
8# For all transports, see https://gofastmcp.com/clients/client#transports
9sse_url = "http://localhost:8000/sse"
10
11async def sampling_handler(
12 messages: list[SamplingMessage],
13 params: SamplingParams,
14 context: RequestContext
15) -> str:
16 messages = [{"role": m.role, "message": m.content.text} for m in messages]
17 system_prompt = params.systemPrompt
18 temperature = params.temperature
19 max_tokens = params.maxTokens
20
21 # Here you can implement your own LLM calling logic
22 # For example, using OpenAI's API or any other LLM provider
23 # For this example, we will just return a dummy response
24 return f"Generated text for prompt: {prompt}"
25
26client = Client(sse_url, sampling_handler=sampling_handler)
27
28<BlogFooter />

And we're done. As mentioned above, the concept of LLM sampling is rather simple and also quite easy to implement, as the low-level MCP protocol handles the complexity of the communication between server and client.

In summary:

More information on our managed RAG solution?
To Pondhouse AI
More tips and tricks on how to work with AI?
To our Blog