Model Benchmarking Suite¶
Overview¶
Different LLMs have different propensities to hallucinate tool calls. agentguard's benchmarking suite lets you systematically measure and compare models on tool-call accuracy.
Why Benchmark Tool-Call Accuracy?¶
Published benchmarks (MMLU, HumanEval, etc.) don't measure tool-call reliability. A model that scores 90% on MMLU might hallucinate 30% of tool calls when under context pressure. agentguard's benchmark measures:
- Tool call accuracy — does the model call the right tool with the right parameters?
- Hallucination rate — how often does the model fabricate responses?
- Error recovery — does the model fix its tool calls after a validation error?
- Consistency — does the model produce the same calls for the same prompts?
Quick Start¶
from agentguard.benchmarking import ToolCallBenchmark, BenchmarkSuite
# Define your tools and test cases
tools = [get_weather, search_web, query_database]
test_cases = [
{
"prompt": "What's the weather in London?",
"expected_tool": "get_weather",
"expected_args": {"city": "London"},
},
{
"prompt": "Search for recent AI papers",
"expected_tool": "search_web",
"expected_args": {"query": ...}, # Fuzzy match
},
]
suite = BenchmarkSuite(tools=tools, test_cases=test_cases)
# Run across multiple models
results = suite.run(
models=[
"gpt-4o",
"gpt-4o-mini",
"claude-3-5-sonnet-20241022",
"llama-3.3-70b-versatile",
],
runs_per_case=5, # Average over 5 runs per case for consistency
)
suite.print_report(results)
Output:
Model Accuracy Hallucination% Error Recovery Consistency
gpt-4o 94.2% 5.8% 78.3% 96.1%
gpt-4o-mini 87.1% 12.9% 61.2% 89.4%
claude-3-5-sonnet-20241022 91.8% 8.2% 71.4% 93.2%
llama-3.3-70b-versatile 82.3% 17.7% 54.1% 85.7%
Benchmark Metrics¶
Accuracy¶
Fraction of test cases where the model called the correct tool with the correct arguments.
"Correct" means: correct tool name + all required arguments present with correct types.
Hallucination Rate¶
Fraction of calls where agentguard's HallucinationDetector flagged the response.
Error Recovery Rate¶
When the model makes a bad call (wrong tool, wrong args, validation failure), how often does it fix the call in the next turn?
Consistency¶
Given the same prompt, how often does the model make the same tool call across multiple runs?
Custom Test Cases¶
from agentguard.benchmarking import TestCase, FuzzyArg
cases = [
TestCase(
prompt="Get the stock price for Apple",
expected_tool="get_stock_price",
expected_args={
"ticker": FuzzyArg(["AAPL", "Apple", "apple"]), # Any of these is correct
},
# Optional: test that the model corrects itself after an error
inject_error=True,
error_message="ValidationError: ticker must be a stock symbol like 'AAPL'",
),
TestCase(
prompt="Search for Python async tutorials",
expected_tool="search_web",
expected_args={
"query": FuzzyArg(contains="async"), # Must contain "async"
},
),
]
Saving Benchmark Results¶
suite.save_results(results, "benchmarks/results_2026_04.json")
# Compare with previous run
suite.compare("benchmarks/results_2026_03.json", "benchmarks/results_2026_04.json")
Using Benchmark Results to Choose Models¶
After benchmarking, choose your deployment model based on your priorities:
| Priority | Metric to optimise | Recommendation |
|---|---|---|
| Lowest hallucination rate | hallucination_rate |
Choose lowest |
| Best error recovery | error_recovery_rate |
Choose highest |
| Most consistent | consistency |
Choose highest |
| Balanced | Weighted score | See suite.rank_models(results, weights=...) |
rankings = suite.rank_models(
results,
weights={
"accuracy": 0.4,
"hallucination": -0.3, # Negative — lower is better
"error_recovery": 0.2,
"consistency": 0.1,
}
)
Running Benchmarks in CI¶
Add benchmarking to your CI pipeline to catch model regressions:
# .github/workflows/benchmark.yml
name: Model Benchmark
on:
schedule:
- cron: '0 9 * * 1' # Weekly on Monday
workflow_dispatch:
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- run: pip install awesome-agentguard[benchmark]
- run: python benchmarks/run_benchmark.py --output results.json
- run: python benchmarks/check_regression.py --baseline benchmarks/baseline.json --current results.json --threshold 0.05
The regression check fails if any metric degrades by more than threshold (5% in this example).