Tool Chaos No More: How We’re Measuring Model-Tool Accuracy in the Age of MCP

Introduction
Picture this scenario: you’ve built an AI agent, given it access to dozens of tools, and deployed it to handle a complex workflow. But instead of executing queries crisply, it’s making redundant tool calls, burning API credits needlessly, and overcomplicating straightforward processes.
This isn’t just an edge-case bug – it’s a fundamental challenge in the era of agentic AI. With the Model Context Protocol (MCP) making it easier than ever to connect models to a broad tool ecosystem, the real test is whether these models can reliably choose the right tool when it matters. That’s why we set out to benchmark tool call accuracy across leading SOTA models, comparing their performance as the number of available tools and the amount of context fed to the model vary. The results reveal what’s working, what’s not, and why smart tool selection is now at the heart of effective AI automation.
Experiment Setup: Models, Tools, and Evaluation
To benchmark tool call accuracy, we evaluated five leading models – Claude Sonnet 4, Claude Opus 4, Claude 3.7 Sonnet, Gemini 2.5 Pro, and GPT 4.1 – using a suite of GitHub and Notion tools, all exposed via our own MCP servers. We observed how tool call accuracy is affected by two factors: the number of tools provided and the amount of context fed to the model. In the first experiment, each model was given access to 48 distinct tools and tasked with performing a range of real-world actions involving both reading from and writing to GitHub and Notion. We then repeated the process with a reduced set of 25 tools to observe how tool count affects model performance.
For each scenario, we used our Tool Call Accuracy Evaluator, which compares the model’s tool usage against a defined set of expected tool calls for every query. To ensure the results were meaningful and comparable, we designed each query to have a minimal set of correct tool calls - ideally just one - thereby eliminating ambiguity in cases where multiple tool combinations could have worked. Nevertheless, block types such as anyOf
and inAnyOrder
allow for flexible handling of tool calls, enabling the evaluator to accept any one of several specified tool combinations and any sequence in which the tools are called, respectively. This approach allowed us to isolate and measure each model’s ability to select and use the right tools through the standardized interface provided by MCP.
Results: Tool Call Accuracy Across Models and Toolsets
(i) Experiment 1: Variation in the Number of Tools Provided to the Model
- Fewer tools, better accuracy: Reducing the number of available tools from 48 to 25 led to improved accuracy across all models. With fewer options, the models were less likely to hallucinate or mistakenly invoke irrelevant tools, resulting in more reliable task completion.
- Claude models are generally more accurate: Both Claude 4 Sonnet and Claude 3.7 Sonnet consistently achieved the highest tool call accuracy in our benchmarks. However, Claude 4 Sonnet exhibited a tendency to overuse tools - making more calls than strictly necessary, which, while still leading in accuracy, could impact efficiency and cost in real-world scenarios.
Model | Accuracy (48 Tools) | Accuracy (25 Tools) | Eval Report (48 Tools) | Eval Report (25 Tools) |
---|---|---|---|---|
Claude Sonnet 4 | 66.67% | 73.33% |
![]() |
![]() |
Claude Opus 4 | 66.67% | 66.67% |
![]() |
![]() |
Claude 3.7 Sonnet | 66.67% | 73.33% |
![]() |
![]() |
GPT 4.1 | 46.67% | 53.33% |
![]() |
![]() |
Gemini 2.5 Pro | 53.33% | 60% |
![]() |
![]() |
- GPT 4.1 struggles with schema understanding, but it is extremely fast: GPT-4.1 often misinterpreted tool schemas, leading to lower accuracy compared to the Claude models. However, it completed tasks exceptionally quickly - up to 22.5 times faster, which may partly explain its schema misinterpretations. Nevertheless, the speed-versus-accuracy trade-off was poor.
- Gemini 2.5 Pro delivers average performance: Gemini 2.5 Pro’s accuracy consistently landed in the middle of the pack. It showed moderate improvements with fewer tools but failed to match the top-performing Claude models.
- Lower temperature yields slight accuracy gains: In additional tests, decreasing the model temperature led to a small but measurable increase in tool call accuracy, as models became more conservative and less prone to speculative tool use.
After this, we wanted to see how the addition of conversation history as context would affect tool call accuracy. Using a similar set of queries, we introduced a conversation history column and re-ran the evaluations to analyze model performance under richer contextual inputs.
(ii) Experiment 2: Increase in Context Provided (25 Tools Provided)
- Increase in conversation history increased the tool call accuracy: We observed an increase in accuracy scores for most models, as longer conversation history meant that relevant details such as previous tool calls and their responses were included within the model’s context window. With access to this richer context and a better understanding of user intent, it’s more likely to pick the right tools and parameters.
Dataset Info | Model | Accuracy | Eval Report |
---|---|---|---|
15 Github + Notion Samples | Claude Sonnet 4 | 80% | ![]() |
15 Github + Notion Samples | Claude Opus 4 | 73.33% | ![]() |
15 Github + Notion Samples | Claude 3.7 Sonnet | 66.67% | ![]() |
15 Github + Notion Samples | GPT 4.1 | 80% | ![]() |
15 Github + Notion Samples | Gemini 2.5 Pro | 73.33% | ![]() |
However, providing too much context can actually make the model less accurate, as it may start to hallucinate or misuse tools instead of improving performance.
Practical Implications
So, the answer to the crucial question - “How many tools are too many tools to give to a model through MCP?”- depends entirely on the use case and workflow. If you give your agent two toolsets via MCP that have overlapping functionality, the model can easily become confused and pick the wrong tool. To improve accuracy, it's essential to carefully select and provide only the tools that are truly necessary for the task at hand.
Model Selection: Claude Sonnet 4 is relatively better at tool calling among all the other models, but in specific scenarios where speed is essential and the number of tools is minimal, with simple schemas - GPT 4.1 can be a viable option. Additionally, models tend to make more accurate tool calls after a few interaction turns, as increased context helps them better understand user intent and task requirements.
Monitoring and Observability: Testing tool usage patterns before production deployment is absolutely critical. AI agents can behave unpredictably, make redundant tool calls that unnecessarily consume API credits, and degrade system performance.
So, evaluating tool call accuracy provides essential feedback that helps teams identify edge cases, optimize tool configurations, and prevent performance issues before they reach end users.
For more information on our experiment, please go through this document.
Ready to Optimize Your Agent’s Tool Calls?
Whether you’re building vertical agents for a specific industry or designing a workflow automation system, optimizing tool call accuracy is essential before deploying your agent in the real world.
Maxim’s suite of evaluators - such as Tool Call Accuracy, Agent Trajectory, Step Utility, and others - helps you not only track what actions your agents take, but also understand how reliably and efficiently they’re selecting the right tools. By analyzing precisely what failed and why, you unlock smarter, more cost-effective AI automation.
Get started today:
- ⚡ Quick Start: Sign up for free evaluation credits
- 🔧 Easy Integration: RESTful APIs & SDKs with comprehensive documentation
- 📊 Instant Insights: Real-time AI quality assessments and monitoring
- 💡 Expert Support: Our team helps optimize your evaluation strategy