Benchmarking (64 Scenarios)

Compare LLM models on tool-calling accuracy with 18 built-in test scenarios.

Overview

agentguard's benchmarking suite measures how accurately different LLMs handle tool calling. Compare models head-to-head on 18 built-in test scenarios that cover edge cases, error handling, multi-step tool use, and argument parsing.

bash

# Quick benchmark from CLI
agentguard benchmark --models gpt-4o,claude-sonnet-4-20250514,llama-3.1-70b
agentguard benchmark --models gpt-4o --scenarios all --output results.json

Programmatic Usage

python

from agentguard.benchmark import Benchmark, ModelConfig

bench = Benchmark(
    models=[
        ModelConfig("gpt-4o", provider="openai"),
        ModelConfig("claude-sonnet-4-20250514", provider="anthropic"),
        ModelConfig("llama-3.1-70b", provider="groq"),
    ],
    scenarios="all",      # or list of scenario names
    runs_per_scenario=3,  # Average over 3 runs
)

results = bench.run()
results.print_table()
results.save("benchmark-results.json")

Sample Output

text

┌──────────────────────────┬────────┬──────────┬──────────┐
│ Model                    │ Acc.   │ Avg. ms  │ Cost/run │
├──────────────────────────┼────────┼──────────┼──────────┤
│ gpt-4o                   │ 94.4%  │ 1,234    │ $0.042   │
│ claude-sonnet-4-20250514 │ 92.2%  │ 1,567    │ $0.038   │
│ llama-3.1-70b (groq)     │ 87.8%  │ 456      │ $0.003   │
└──────────────────────────┴────────┴──────────┴──────────┘

Built-in Scenarios

Category	Scenarios	Tests
Basic Calling	Single tool, multi-tool, optional params	Correct arg parsing and function selection
Type Handling	Nested objects, arrays, enums, unions	Complex type serialization accuracy
Error Recovery	Invalid args, missing params, wrong types	Graceful error handling and retry behavior
Multi-Step	Sequential tools, parallel calls, conditionals	Multi-turn tool orchestration accuracy
Edge Cases	Empty results, large payloads, unicode, timeouts	Robustness under unusual conditions

Custom Scenarios

python

from agentguard.benchmark import Scenario, ToolDef

my_scenario = Scenario(
    name="custom_search",
    description="Test search with filters",
    tools=[
        ToolDef(
            name="search",
            params={{"query": "str", "filters": "dict"}},
            returns="list",
        )
    ],
    prompt="Search for Python tutorials from 2024",
    expected_calls=[
        {{"name": "search", "args": {{"query": "Python tutorials", "filters": {{"year": 2024}}}}}}
    ],
)

bench = Benchmark(models=[...], scenarios=[my_scenario])

💡 Install the benchmark extra

The benchmarking suite requires: pip install awesome-agentguard[benchmark]

64 Built-in Scenarios

The benchmark suite includes 64 scenarios across 6 categories, testing every aspect of tool-calling accuracy:

Category	Scenarios	What it tests
`basic`	13	Simple single-tool calls with correct parameters
`hallucination_resistance`	13	Model should NOT call tools when the answer is already known
`parameter_extraction`	12	Extracting correct parameter values from natural language
`tool_selection`	10	Choosing the right tool from multiple options
`multi_tool`	10	Calling multiple tools in the correct order
`error_handling`	6	Handling ambiguous, missing, or conflicting information

Tool Types

Scenarios use 6 tool definitions: get_weather, search_web, calculate, get_time, convert_currency, and get_directions.

Running Benchmarks

python

from agentguard.benchmark import BenchmarkRunner, BuiltinScenarios

runner = BenchmarkRunner(temperature=0.0, verbose=True)
runner.add_scenarios(BuiltinScenarios.BASIC_TOOL_CALLING)
runner.add_scenarios(BuiltinScenarios.MULTI_TOOL_SELECTION)
runner.add_scenarios(BuiltinScenarios.HALLUCINATION_RESISTANCE)
runner.add_scenarios(BuiltinScenarios.PARAMETER_EXTRACTION)
runner.add_scenarios(BuiltinScenarios.TOOL_SELECTION)
runner.add_scenarios(BuiltinScenarios.ERROR_HANDLING)

results = runner.run(
    model="openai/gpt-4o-mini",
    base_url="https://openrouter.ai/api/v1",
    api_key="sk-...",
)

print(results.summary())

GPT-4o-mini Results

text

Model: openai/gpt-4o-mini
Scenarios: 64 total, 61 passed, 3 failed
Tool call accuracy: 95.3%
Parameter accuracy: 92.1%
Hallucination rate: 7.7%
Avg latency: 842 ms
Total tokens: 48,291
By category:
  basic: 13/13 (100.0%)
  error_handling: 5/6 (83.3%)
  hallucination_resistance: 12/13 (92.3%)
  multi_tool: 9/10 (90.0%)
  parameter_extraction: 12/12 (100.0%)
  tool_selection: 10/10 (100.0%)

Writing Custom Scenarios

python

from agentguard.benchmark.scenarios import BenchmarkScenario

custom = BenchmarkScenario(
    name="custom_weather_check",
    description="Model should call get_weather for London",
    category="custom",
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"],
            },
        },
    }],
    messages=[
        {"role": "user", "content": "What's the weather like in London?"},
    ],
    expected_tool_calls=[
        {"name": "get_weather", "arguments": {"city": "London"}},
    ],
)

runner.add_scenario(custom)

Comparing Models

python

results_gpt4o = runner.run(model="openai/gpt-4o-mini", ...)
results_claude = runner.run(model="anthropic/claude-3.5-haiku", ...)

comparison = runner.compare([results_gpt4o, results_claude])
print(comparison.summary())
print(f"Winner: {comparison.winner()}")

← Trace Recording

Middleware Pipeline →

Edit this page on GitHub