Benchmarking (64 Scenarios)
Compare LLM models on tool-calling accuracy with 18 built-in test scenarios.
Overview
agentguard's benchmarking suite measures how accurately different LLMs handle tool calling. Compare models head-to-head on 18 built-in test scenarios that cover edge cases, error handling, multi-step tool use, and argument parsing.
bash
# Quick benchmark from CLI
agentguard benchmark --models gpt-4o,claude-sonnet-4-20250514,llama-3.1-70b
agentguard benchmark --models gpt-4o --scenarios all --output results.json
Programmatic Usage
python
from agentguard.benchmark import Benchmark, ModelConfig
bench = Benchmark(
models=[
ModelConfig("gpt-4o", provider="openai"),
ModelConfig("claude-sonnet-4-20250514", provider="anthropic"),
ModelConfig("llama-3.1-70b", provider="groq"),
],
scenarios="all", # or list of scenario names
runs_per_scenario=3, # Average over 3 runs
)
results = bench.run()
results.print_table()
results.save("benchmark-results.json")
Sample Output
text
┌──────────────────────────┬────────┬──────────┬──────────┐
│ Model │ Acc. │ Avg. ms │ Cost/run │
├──────────────────────────┼────────┼──────────┼──────────┤
│ gpt-4o │ 94.4% │ 1,234 │ $0.042 │
│ claude-sonnet-4-20250514 │ 92.2% │ 1,567 │ $0.038 │
│ llama-3.1-70b (groq) │ 87.8% │ 456 │ $0.003 │
└──────────────────────────┴────────┴──────────┴──────────┘
Built-in Scenarios
| Category | Scenarios | Tests |
|---|---|---|
| Basic Calling | Single tool, multi-tool, optional params | Correct arg parsing and function selection |
| Type Handling | Nested objects, arrays, enums, unions | Complex type serialization accuracy |
| Error Recovery | Invalid args, missing params, wrong types | Graceful error handling and retry behavior |
| Multi-Step | Sequential tools, parallel calls, conditionals | Multi-turn tool orchestration accuracy |
| Edge Cases | Empty results, large payloads, unicode, timeouts | Robustness under unusual conditions |
Custom Scenarios
python
from agentguard.benchmark import Scenario, ToolDef
my_scenario = Scenario(
name="custom_search",
description="Test search with filters",
tools=[
ToolDef(
name="search",
params={{"query": "str", "filters": "dict"}},
returns="list",
)
],
prompt="Search for Python tutorials from 2024",
expected_calls=[
{{"name": "search", "args": {{"query": "Python tutorials", "filters": {{"year": 2024}}}}}}
],
)
bench = Benchmark(models=[...], scenarios=[my_scenario])
💡 Install the benchmark extra
The benchmarking suite requires: pip install awesome-agentguard[benchmark]
64 Built-in Scenarios
The benchmark suite includes 64 scenarios across 6 categories, testing every aspect of tool-calling accuracy:
| Category | Scenarios | What it tests |
|---|---|---|
basic | 13 | Simple single-tool calls with correct parameters |
hallucination_resistance | 13 | Model should NOT call tools when the answer is already known |
parameter_extraction | 12 | Extracting correct parameter values from natural language |
tool_selection | 10 | Choosing the right tool from multiple options |
multi_tool | 10 | Calling multiple tools in the correct order |
error_handling | 6 | Handling ambiguous, missing, or conflicting information |
Tool Types
Scenarios use 6 tool definitions: get_weather, search_web, calculate, get_time, convert_currency, and get_directions.
Running Benchmarks
python
from agentguard.benchmark import BenchmarkRunner, BuiltinScenarios
runner = BenchmarkRunner(temperature=0.0, verbose=True)
runner.add_scenarios(BuiltinScenarios.BASIC_TOOL_CALLING)
runner.add_scenarios(BuiltinScenarios.MULTI_TOOL_SELECTION)
runner.add_scenarios(BuiltinScenarios.HALLUCINATION_RESISTANCE)
runner.add_scenarios(BuiltinScenarios.PARAMETER_EXTRACTION)
runner.add_scenarios(BuiltinScenarios.TOOL_SELECTION)
runner.add_scenarios(BuiltinScenarios.ERROR_HANDLING)
results = runner.run(
model="openai/gpt-4o-mini",
base_url="https://openrouter.ai/api/v1",
api_key="sk-...",
)
print(results.summary())
GPT-4o-mini Results
text
Model: openai/gpt-4o-mini
Scenarios: 64 total, 61 passed, 3 failed
Tool call accuracy: 95.3%
Parameter accuracy: 92.1%
Hallucination rate: 7.7%
Avg latency: 842 ms
Total tokens: 48,291
By category:
basic: 13/13 (100.0%)
error_handling: 5/6 (83.3%)
hallucination_resistance: 12/13 (92.3%)
multi_tool: 9/10 (90.0%)
parameter_extraction: 12/12 (100.0%)
tool_selection: 10/10 (100.0%)
Writing Custom Scenarios
python
from agentguard.benchmark.scenarios import BenchmarkScenario
custom = BenchmarkScenario(
name="custom_weather_check",
description="Model should call get_weather for London",
category="custom",
tools=[{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}],
messages=[
{"role": "user", "content": "What's the weather like in London?"},
],
expected_tool_calls=[
{"name": "get_weather", "arguments": {"city": "London"}},
],
)
runner.add_scenario(custom)
Comparing Models
python
results_gpt4o = runner.run(model="openai/gpt-4o-mini", ...)
results_claude = runner.run(model="anthropic/claude-3.5-haiku", ...)
comparison = runner.compare([results_gpt4o, results_claude])
print(comparison.summary())
print(f"Winner: {comparison.winner()}")