in

Even the best AI agents are thwarted by this protocol – what can be done

Yuuji/E+ via Getty Images

Follow ZDNET: Add us as a preferred source<!–> on Google.


ZDNET’s key takeaways

  • Even the best AI models are challenged to carry out tasks via MCP.
  • New benchmarks show models struggle when tasks become more complex.
  • More training of AI models is required that’s specific to MCP use.

An emerging category of artificial intelligence middleware known as Model Context Protocol is meant to make generative AI programs such as chatbots bots more powerful by letting them connect with various resources, including packaged software such as databases. 

Multiple studies, however, reveal that even the best AI models struggle to use Model Context Protocol. Top AI models such as Google’s Gemini 5 require many, many rounds of interactions with the external programs, leading to long delays in the performance of the AI models. 

Also: What is Model Context Protocol? The emerging standard bridging AI and data, explained

“Even state-of-the-art models struggle with different capabilities,” writes Zhenting Wang and team at consulting firm Accenture, the MIT-IBM Watson AI Lab, and the University of California at Berkeley in an August work that introduced MCP-Bench, a set of 250 tasks for AI agents employing MCP.

“Performance generally declines as tasks transition from Single Server to Multi Server scopes,” writes Zikang Guo and team at the University of Science and Technology of China last month when they tested several AI models on their own benchmark test, MCP-AgentBench.

–>

Even the best models today, including OpenAI’s GPT-5, have “failure cases” arising from “repetitive or exploratory interactions that fail to make meaningful progress,” writes lead author Zijian Wu and the team of the National University of Singapore and collaborating institutions in the paper announcing their benchmark, MCPMArk, last month.

Where an AI model can go wrong with MCP

MCP is a kind of middleware for turning AI into client-server interactions. It was introduced last year by gen AI startup Anthropic (makers of the Claude family of large language models and chatbots) as a secure, industry-standard way to connect LLMs and AI agents to external software resources such as databases and customer relationship management software. 

As ZDNET’s Steven Vaughan-Nichols explains, middleware like MCP can reduce the number of connections that an AI program has to initiate to connect to multiple external resources. 

Also: ChatGPT can now connect to MCP servers – here’s how, and what to watch for

However, having a standard does not mean that an AI model, whose functionality includes a heavy dose of chance (“probability” in technical terms), will faithfully implement MCP.

An AI model plugged into MCP has to generate output that achieves several things, such as formulating a plan to answer a query by choosing which external resources to access, in what order to contact the MCP servers that lead to those external applications, and then structuring several requests for information to produce a final output to answer the query. 

The various studies show that while top-of-the-line models such as Gemini 5 and GPT-5 can do better than less-impressive programs, all models are still limited in their ability to manage all those challenges. Issues across all the models include taking an excessive number of steps to retrieve the information, even when the language model’s plan of approach was sound to begin with.

What the benchmarks tell us 

<!–> u-berkeley-2025-mcp-bench-workflow

–>

UC Berkeley, Accenture, IBM

All the benchmark tests take a similar approach: They collect a group of challenging queries for information and a collection of MCP servers to which the AI models can gain access, and the information resources to which those MCP servers grant access.

The resources in these tests are often publicly available resources such as Google Search, Wikipedia, or some other widely available repository of information. 

<!–> u-berkeley-2025-mcp-bench-example-task

–>

UC Berkeley, Accenture, IBM

An example problem from the Accenture work of Wang and team was to retrieve online information to plan a weekend hiking trip. The prompt began with “I’m trying to plan a week-long hiking and camping loop that starts and ends in Denver, and I’m hoping you can really nerd out with me on the details,” and then went on to specify several requirements, such as which parks to visit, visitor hours, chances of rain, etc.

The request was to be sent to multiple MCP server-enabled information services, including Google Maps and the US national park websites, and to specific tools such as “findParks, getParkDetails, getAlerts, getVisitorCenters, getCampgrounds, getEvents.”

Also: Anthropic now lets developers use Claude Code with any remote MCP server

All of the benchmarks are meant to evolve the measurement of AI models from simple function-calling challenges. The benchmarks require the AI models to achieve multiple requirements, including turning the natural-language prompt into search requests that respect the schema — the order of communications for MCP specified in the JSON code on which MCP is built. 

Respecting schema is just the lowest level of achievement. At a higher level, “agents must identify the correct tools from large, heterogeneous tool spaces when confronted with ambiguous or underspecified task descriptions,” writes Wang and team. “This requires disambiguating semantic variants, coping with naming inconsistencies, and avoiding traps posed by superficially plausible but irrelevant tools.” 

The benchmarks typically measure how many different resources a program will tap into, and how many “turns” are required, a measure of the efficiency by which an AI model uses those resources. 

Also: Is AI even worth it for your business? 5 expert tips to help prove ROI

As Wang and team describe it, MCP-Bench “measures structural coherence, dependency awareness, parallelism efficiency, and reflective adaptation. Tasks include not only linear workflows but also complex compositions requiring concurrent interactions across multiple servers with multiple objectives.” All of which is taken as a greater or lesser ability by the models to engage in what’s called “long-horizon planning.”

If an AI model has to take increasingly more turns to get the information it needs from an MCP server, it may suggest that it is not able to properly plan how to use the available resources. 

All of these benchmarks employ multiple large language models to compare how the current landscape of offerings perform on a relative basis. 

<!–> u-berkeley-2025-mcp-bench-scores

–>

UC Berkeley, Accenture, IBM

The good news is that all three studies mentioned here reported that bigger, more powerful AI models scored better than smaller models. That suggests that as models get better in many respects, they can also improve on MCP-related challenges. 

<!–> u-singapore-2025-mcpmark-outline

–>

National University of Singapore