Benchmarking (64 Scenarios)

Compare LLM models on tool-calling accuracy with 18 built-in test scenarios.

Overview

agentguard's benchmarking suite measures how accurately different LLMs handle tool calling. Compare models head-to-head on 18 built-in test scenarios that cover edge cases, error handling, multi-step tool use, and argument parsing.

bash
# Quick benchmark from CLI
agentguard benchmark --models gpt-4o,claude-sonnet-4-20250514,llama-3.1-70b
agentguard benchmark --models gpt-4o --scenarios all --output results.json

Programmatic Usage

python
from agentguard.benchmark import Benchmark, ModelConfig

bench = Benchmark(
    models=[
        ModelConfig("gpt-4o", provider="openai"),
        ModelConfig("claude-sonnet-4-20250514", provider="anthropic"),
        ModelConfig("llama-3.1-70b", provider="groq"),
    ],
    scenarios="all",      # or list of scenario names
    runs_per_scenario=3,  # Average over 3 runs
)

results = bench.run()
results.print_table()
results.save("benchmark-results.json")

Sample Output

text
┌──────────────────────────┬────────┬──────────┬──────────┐
│ Model                    │ Acc.   │ Avg. ms  │ Cost/run │
├──────────────────────────┼────────┼──────────┼──────────┤
│ gpt-4o                   │ 94.4%  │ 1,234    │ $0.042   │
│ claude-sonnet-4-20250514 │ 92.2%  │ 1,567    │ $0.038   │
│ llama-3.1-70b (groq)     │ 87.8%  │ 456      │ $0.003   │
└──────────────────────────┴────────┴──────────┴──────────┘

Built-in Scenarios

CategoryScenariosTests
Basic CallingSingle tool, multi-tool, optional paramsCorrect arg parsing and function selection
Type HandlingNested objects, arrays, enums, unionsComplex type serialization accuracy
Error RecoveryInvalid args, missing params, wrong typesGraceful error handling and retry behavior
Multi-StepSequential tools, parallel calls, conditionalsMulti-turn tool orchestration accuracy
Edge CasesEmpty results, large payloads, unicode, timeoutsRobustness under unusual conditions

Custom Scenarios

python
from agentguard.benchmark import Scenario, ToolDef

my_scenario = Scenario(
    name="custom_search",
    description="Test search with filters",
    tools=[
        ToolDef(
            name="search",
            params={{"query": "str", "filters": "dict"}},
            returns="list",
        )
    ],
    prompt="Search for Python tutorials from 2024",
    expected_calls=[
        {{"name": "search", "args": {{"query": "Python tutorials", "filters": {{"year": 2024}}}}}}
    ],
)

bench = Benchmark(models=[...], scenarios=[my_scenario])
💡 Install the benchmark extra

The benchmarking suite requires: pip install awesome-agentguard[benchmark]

64 Built-in Scenarios

The benchmark suite includes 64 scenarios across 6 categories, testing every aspect of tool-calling accuracy:

CategoryScenariosWhat it tests
basic13Simple single-tool calls with correct parameters
hallucination_resistance13Model should NOT call tools when the answer is already known
parameter_extraction12Extracting correct parameter values from natural language
tool_selection10Choosing the right tool from multiple options
multi_tool10Calling multiple tools in the correct order
error_handling6Handling ambiguous, missing, or conflicting information

Tool Types

Scenarios use 6 tool definitions: get_weather, search_web, calculate, get_time, convert_currency, and get_directions.

Running Benchmarks

python
from agentguard.benchmark import BenchmarkRunner, BuiltinScenarios

runner = BenchmarkRunner(temperature=0.0, verbose=True)
runner.add_scenarios(BuiltinScenarios.BASIC_TOOL_CALLING)
runner.add_scenarios(BuiltinScenarios.MULTI_TOOL_SELECTION)
runner.add_scenarios(BuiltinScenarios.HALLUCINATION_RESISTANCE)
runner.add_scenarios(BuiltinScenarios.PARAMETER_EXTRACTION)
runner.add_scenarios(BuiltinScenarios.TOOL_SELECTION)
runner.add_scenarios(BuiltinScenarios.ERROR_HANDLING)

results = runner.run(
    model="openai/gpt-4o-mini",
    base_url="https://openrouter.ai/api/v1",
    api_key="sk-...",
)

print(results.summary())

GPT-4o-mini Results

text
Model: openai/gpt-4o-mini
Scenarios: 64 total, 61 passed, 3 failed
Tool call accuracy: 95.3%
Parameter accuracy: 92.1%
Hallucination rate: 7.7%
Avg latency: 842 ms
Total tokens: 48,291
By category:
  basic: 13/13 (100.0%)
  error_handling: 5/6 (83.3%)
  hallucination_resistance: 12/13 (92.3%)
  multi_tool: 9/10 (90.0%)
  parameter_extraction: 12/12 (100.0%)
  tool_selection: 10/10 (100.0%)

Writing Custom Scenarios

python
from agentguard.benchmark.scenarios import BenchmarkScenario

custom = BenchmarkScenario(
    name="custom_weather_check",
    description="Model should call get_weather for London",
    category="custom",
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"],
            },
        },
    }],
    messages=[
        {"role": "user", "content": "What's the weather like in London?"},
    ],
    expected_tool_calls=[
        {"name": "get_weather", "arguments": {"city": "London"}},
    ],
)

runner.add_scenario(custom)

Comparing Models

python
results_gpt4o = runner.run(model="openai/gpt-4o-mini", ...)
results_claude = runner.run(model="anthropic/claude-3.5-haiku", ...)

comparison = runner.compare([results_gpt4o, results_claude])
print(comparison.summary())
print(f"Winner: {comparison.winner()}")
Edit this page on GitHub