Even the best AI agents are thwarted by this protocol - what can be done - technology-news.space

Yuuji/E+ via Getty Images

Follow ZDNET: Add us as a preferred source<!–> on Google.

ZDNET’s key takeaways

Even the best AI models are challenged to carry out tasks via MCP.
New benchmarks show models struggle when tasks become more complex.
More training of AI models is required that’s specific to MCP use.

An emerging category of artificial intelligence middleware known as Model Context Protocol is meant to make generative AI programs such as chatbots bots more powerful by letting them connect with various resources, including packaged software such as databases.

Multiple studies, however, reveal that even the best AI models struggle to use Model Context Protocol. Top AI models such as Google’s Gemini 5 require many, many rounds of interactions with the external programs, leading to long delays in the performance of the AI models.

Also: What is Model Context Protocol? The emerging standard bridging AI and data, explained

“Even state-of-the-art models struggle with different capabilities,” writes Zhenting Wang and team at consulting firm Accenture, the MIT-IBM Watson AI Lab, and the University of California at Berkeley in an August work that introduced MCP-Bench, a set of 250 tasks for AI agents employing MCP.

“Performance generally declines as tasks transition from Single Server to Multi Server scopes,” writes Zikang Guo and team at the University of Science and Technology of China last month when they tested several AI models on their own benchmark test, MCP-AgentBench.

–>

Even the best models today, including OpenAI’s GPT-5, have “failure cases” arising from “repetitive or exploratory interactions that fail to make meaningful progress,” writes lead author Zijian Wu and the team of the National University of Singapore and collaborating institutions in the paper announcing their benchmark, MCPMArk, last month.

Where an AI model can go wrong with MCP

MCP is a kind of middleware for turning AI into client-server interactions. It was introduced last year by gen AI startup Anthropic (makers of the Claude family of large language models and chatbots) as a secure, industry-standard way to connect LLMs and AI agents to external software resources such as databases and customer relationship management software.

As ZDNET’s Steven Vaughan-Nichols explains, middleware like MCP can reduce the number of connections that an AI program has to initiate to connect to multiple external resources.

Also: ChatGPT can now connect to MCP servers – here’s how, and what to watch for

However, having a standard does not mean that an AI model, whose functionality includes a heavy dose of chance (“probability” in technical terms), will faithfully implement MCP.

An AI model plugged into MCP has to generate output that achieves several things, such as formulating a plan to answer a query by choosing which external resources to access, in what order to contact the MCP servers that lead to those external applications, and then structuring several requests for information to produce a final output to answer the query.

The various studies show that while top-of-the-line models such as Gemini 5 and GPT-5 can do better than less-impressive programs, all models are still limited in their ability to manage all those challenges. Issues across all the models include taking an excessive number of steps to retrieve the information, even when the language model’s plan of approach was sound to begin with.

What the benchmarks tell us

u-berkeley-2025-mcp-bench-workflow — UC Berkeley, Accenture, IBM

All the benchmark tests take a similar approach: They collect a group of challenging queries for information and a collection of MCP servers to which the AI models can gain access, and the information resources to which those MCP servers grant access.

The resources in these tests are often publicly available resources such as Google Search, Wikipedia, or some other widely available repository of information.

What can be done to make models better?

The immediate takeaway from the various benchmarks is that AI models need to adapt to a new epoch in which using MCP is a challenge. AI models may have to evolve in new directions to fulfill the challenge.

All three studies identify a problem: Performance degrades as the AI models have to access more MCP servers. The complexity of multiple resources starts to overwhelm even the models that can best plan what steps to take at the outset.

As Wu and team put it in their MCPMark paper, the complexity of all those MCP servers strains any AI model’s ability to keep track of it all.

Also: Consumers more likely to pay for ‘responsible’ AI tools, Deloitte survey says

They identify a key challenge in “the agent’s ability to manage an ever-growing history” of MCP interactions, and a “core unreliability that can only be solved by building agents with robust error-handling and self-correction capabilities.”

The most immediate route to ameliorating AI models’ performance gap may be to train them specifically for MCP.

Using a form of fine-tuning, which means training AI models a second time after the main pre-training stage, scholars at the University of Washington and the MIT-IBM Watson AI Lab have developed a data set for fine-tuning consisting of millions of examples of MCP interactions between an AI program and external tools. As they put it, it is “the largest publicly available tool-agentic dataset to date.”

Introduced this month, the data set, Toucan, was able to make relatively small AI models such as the open-source Qwen3-32B perform better at MCP tasks overall compared to much larger AI models such as DeepSeek V3 and OpenAI’s o3 mini, using the same benchmark tests propounded by Wang and others.

Get the biggest stories in tech every Friday with ZDNET’s Week in Review newsletter.

As encouraging as Toucan is, a big open question is what to do with all the non-public, non-standard resources to which MCP may be connected in private data centers. For example, if AI models are fine-tuned to work with MCP more efficiently in the greatest number of cases, will that necessarily improve a particular AI model’s performance on XYZ Corp.’s on-premise installation of Salesforce CRM, or Oracle database?

We won’t know until CIOs implement MCP and find out.

Artificial Intelligence

<!–>

–>

Even the best AI agents are thwarted by this protocol – what can be done

ZDNET’s key takeaways

Where an AI model can go wrong with MCP

What the benchmarks tell us

What can be done to make models better?

Artificial Intelligence

5 Linux distros that take a little work out of the box, but are so worth the effort

How wearable health tech could help catch breast cancer

ITALIAN LANGUAGE

ENGLISH LANGUAGE