LLM Sampling with FastMCP: Using Client LLMs for Scalable AI Workflows in your MCP

Building on our previous discussions about creating your first MCP server and automating MCP server creation, today we're exploring a powerful capability that addresses one of the most significant challenges in mcp server architecture: scaling language model inference.
When building AI-powered MCP servers, you often need to make language model calls within your server-side functions. But what if your server could delegate those LLM requests to the client instead of handling them itself? This is exactly what FastMCP's LLM sampling feature enables.
By offloading language model processing to the client side, your MCP server remains lightweight and scalable, even as the number of concurrent users grows. This approach eliminates server-side bottlenecks for LLM inference while still giving your server functions access to powerful AI capabilities. Furthermore, this provides a way to scale the economics of your MCP server: You can implement advanced AI workflows on the server, but the client takes responsibility for the actual LLM calls and therefore the costs associated with them.
What is LLM Sampling?
LLM sampling in MCP servers refers to the ability of server-side functions to request language model completions from the client's language model rather than using a server-side LLM. This creates a unique inverted architecture where the server sends prompt instructions to the client, which then performs the LLM inference and returns the results back to the server.
In traditional AI architectures, the server hosts both the application logic and the language model integration. With MCP's sampling approach, your server-side code can make requests like "generate a summary of this text" or "analyze the sentiment of this comment," but the actual language model processing happens on the client side.
The flow wrks as follows:
- Your MCP server function needs LLM capabilities (e.g., to classify text or generate content)
- The function sends a prompt to the client
- The client receives this request and processes it using its own LLM (which could be a local model or another API)
- The client returns the generated text back to your server function
- Your server function continues processing with the LLM-generated response - almost as if the server had performed the LLM inference itself
This architecture offers significant advantages for MCP applications:
- Scalability: Your server avoids the compute-intensive LLM inference, allowing it to handle more concurrent users
- Cost efficiency: The processing costs are distributed across clients rather than centralized on your server
- Flexibility: Clients can use different models based on their capabilities or preferences
- Reduced latency: For some applications, eliminating server-side queuing for LLM inference can improve response times
LLM sampling acts as a bridge between your server's processing capabilities and the language model intelligence accessible through the client, creating a more distributed and efficient AI system.
How to Use LLM Sampling with FastMCP
To implement LLM sampling in your FastMCP server, you can use the Context object of FastMCP. Let's walk through a simple example of how to set this up.
First, create the FastMCP server object.
Next, define a tool that will be used to summarize a document. Note that the following is a simplified example. Document summarization normally involves multiple steps.
Note: The llm sampling is available in the Context
object. Simply add
a parameter with type-hint Context to your tool function. FastMCP will inject
the context object for you.
To use LLM sampling instead of a server-side LLM, use the ctx.sample
function.
Provide a message (this can be a string or a list of strings), a system prompt
and the model parameters, like temperature and max tokens.
Note: You can also provide a list of SamplingMessage
objects to the
messages parameter. This is helpful if you want to provide the clients LLM
with context of a previous conversation between a user an an assistant.
That's actually all you need to do. Run your FastMCP server as described in our previous blog post and you're done. Keep in mind that the client side needs to implement support for LLM sampling as well.
Implement LLM Sampling on MCP Clients using FastMCP
To demonstrate how LLM sampling works on the client side, let's use the FastMCP MCP client library. To support LLM sampling, you simply need to implement a sampling handler. As the name implies, this is a function which takes the incoming sampling request from the server and returns the generated text.
And we're done. As mentioned above, the concept of LLM sampling is rather simple and also quite easy to implement, as the low-level MCP protocol handles the complexity of the communication between server and client.
In summary:
-
On the server side, you define a tool that uses the
ctx.sample
function to request LLM inference from the client -
On the client side, you implement a sampling handler that processes the incoming sampling request and returns the generated text