Run, Test, and Evaluate Models and MCP Locally with Docker + Promptfoo

December 8, 2025 · 883 words · 5 min

is an open-source CLI and library for evaluating LLM apps. makes it easy to manage, run, and deplo

is an open-source CLI and library for evaluating LLM apps. makes it easy to manage, run, and deploy AI models using Docker. The is a local gateway that lets you set up, manage, and run containerized MCP servers and connect them to AI agents.  Together, these tools let you compare models, evaluate MCP servers, and even perform LLM red-teaming from the comfort of your own dev machine. Let’s look at a few examples to see it in action. Before jumping into the examples, we’ll first need to , , pull a few models with docker model, and install promptfoo. 1. Docker MCP Toolkit in Docker Desktop. 2. Docker Model Runner in Docker Desktop. 3. Use the Docker Model Runner CLI to pull the following models 4. Promptfoo With the prerequisites complete, we can get into our first example. Does your prompt and context require paying for tokens from an AI cloud provider or will an open source model provide 80% of the value for a fraction of the cost? How will you systematically re-assess this dilemma every month when your prompt changes, a new model drops, or token costs change? With the in promptfoo, it’s easy to set up a Promptfoo eval to compare a prompt across local and cloud models. In this example, we’ll compare & grade running locally with DMR to Claude Opus 4.1 with a simple prompt about whales.  Promptfoo provides a host of to assess and grade model output.  These assertions range from traditional deterministic evals, such as contains, to model-assisted evals, such as llm-rubric.  By default, the model-assisted evals use Open AI models, but in this example, we’ll use local models powered by DMR.  Specifically, we’ve configured smollm3:Q4_K_M to judge the output and mxbai-embed-large:335M-F16 to perform embedding to check the output semantics. We’ll run the eval and view the results: Figure 1: Evaluating LLM performance in prompfoo and Docker Model Runner Reviewing the results, the smollm3 model judged both responses as passing with similar scores, suggesting that locally running Gemma3 is sufficient for our contrived & simplistic use-case.  For real-world production use-cases, we would employ a richer set of assertions.  MCP servers are sprouting up everywhere, but how do you find the right MCP tools for your use cases, run them, and then assess them for quality and safety?  And again, how do you reassess tools, models, and prompt configurations with every new development in the AI space? The is a centralized, trusted registry for discovering, sharing, and running MCP servers. You can easily add any MCP server in the catalog to the MCP Toolkit running in Docker Desktop.  And it’s straightforward to connect promptfoo to the MCP Toolkit to evaluate each tool. Let’s look at an example of direct MCP testing.  Direct MCP testing is helpful to validate how the server handles authentication, authorization, and input validation.  First, we’ll quickly enable the Fetch, GitHub, and Playwright MCP servers in Docker Desktop with the MCP Toolkit.  Only the GitHub MCP server requires authentication, but the MCP Toolkit makes it straightforward to quickly configure it with the built-in OAuth provider. Figure 2: Enabling the Fetch, GitHub, and Playwright MCP servers in Docker MCP Toolkit with one click Next, we’ll configure the MCP Toolkit as a Promptfoo provider.  Additionally, it’s straightforward to run & connect containerized MCP servers, so we’ll also manually enable the mcp/youtube-transcript MCP server to be launched with a simple docker run command. With the MCP provider configured, we can declare some tests to validate the MCP server tools are available, authenticated, and functional. We can run this eval with the promptfoo eval command. Direct testing of MCP tools is helpful, but how do we evaluate the entire MCP stack for privacy, safety, and accuracy?  Enter Promptfoo servers.  And the Docker MCP Toolkit makes it very straightforward to integrate Promptfoo with agent applications using MCP servers. In this example, we evaluate an agent that summarizes GitHub repositories with the GitHub MCP server.  We’ll start by configuring the provider with Claude Opus 4.1 connected to Docker MCP toolkit with the GitHub MCP server.  The GitHub MCP server will be authenticated with the built-in OAuth integration in Docker Desktop. Next, we’ll define a prompt for the application agent. And then we’ll define a prompt for the red-team agent along with plugins and strategies for evaluating the MCP application. Next, we’ll use the promptfoo redteam run command to generate and run a plan.  The test plan, including synthetic test cases and data, is written to redteam.yaml. You can use promptfoo view to launch the evaluation results in the browser. After reviewing the results, we can see that our agent is vulnerable to Tool Discovery, so we’ll update our application prompt to include the following guideline and re-run the red-team to validate that the new guideline sufficiently mitigates the vulnerability. Figure 3: Red-team Results Summary with Tool Discovery failures Figure 4: Red-team Tool Discovery Failure And that’s a wrap. Promptfoo, Docker Model Runner, and Docker MCP Toolkit enable teams to evaluate prompts with different models, directly test MCP tools, and perform AI-assisted red-team tests of agentic MCP applications. If you’re interested in test driving these examples yourself, clone the repository to run them.