Testing¶
Overview¶
Testing AI agent tool calls is hard. The inputs are non-deterministic, external services are unreliable, and manually writing assertions for every possible response is tedious.
agentguard provides three testing utilities:
assert_tool_call— fluent assertion builder for trace entriesTestGenerator— auto-generate pytest tests from production tracesTraceReplayer— replay recorded traces to detect regressions
assert_tool_call — Fluent Assertions¶
assert_tool_call provides a readable, chainable assertion API for inspecting TraceEntry objects:
from agentguard import assert_tool_call
from agentguard import record_session
with record_session("./traces", backend="sqlite") as recorder:
result = get_weather("San Francisco")
entry = recorder.entries[-1]
assert_tool_call(entry) \
.succeeded() \
.within_ms(5000) \
.returned_dict() \
.has_keys("temperature", "humidity") \
.no_retries()
Available assertions¶
| Method | Description |
|---|---|
.succeeded() |
Asserts status == SUCCESS |
.failed() |
Asserts status is a failure state |
.within_ms(n) |
Asserts execution_time_ms <= n |
.returned_dict() |
Asserts return value is a dict |
.returned_list() |
Asserts return value is a list |
.returned_str() |
Asserts return value is a str |
.has_keys(*keys) |
Asserts all keys present in the return dict |
.no_retries() |
Asserts retry_count == 0 |
.retried(n) |
Asserts retry_count == n |
.not_hallucinated() |
Asserts hallucination detection passed |
.status(s) |
Asserts specific ToolCallStatus |
Full example¶
import pytest
from agentguard import guard, assert_tool_call, record_session
from agentguard.core.types import TraceEntry
@guard(validate_input=True, max_retries=2, record=True, trace_backend="sqlite", trace_dir="./test_traces")
def get_weather(city: str) -> dict:
return {"city": city, "temperature": 72, "humidity": 60}
def test_get_weather_response_structure():
with record_session("./test_traces", backend="sqlite") as recorder:
get_weather("Tokyo")
entry: TraceEntry = recorder.entries[-1]
assert_tool_call(entry) \
.succeeded() \
.within_ms(1000) \
.returned_dict() \
.has_keys("city", "temperature", "humidity") \
.no_retries()
TestGenerator — Generate Tests from Traces¶
After running your agent in production (or staging), generate a regression test suite automatically:
from agentguard.testing import TestGenerator
generator = TestGenerator(traces_dir="./production_traces")
generator.generate_tests(output="tests/test_generated.py")
What gets generated¶
"""Auto-generated test suite from agentguard production traces.
Generated: 2026-04-08T14:30:22Z
Source: ./production_traces
DO NOT EDIT — regenerate with: agentguard generate ./production_traces
"""
def test_get_weather_0():
"""Recorded: get_weather('San Francisco') → success in 312.1ms"""
result = get_weather("San Francisco")
assert isinstance(result, dict)
assert "temperature" in result
assert "humidity" in result
def test_search_web_0():
"""Recorded: search_web('Python tutorials') → success in 234.1ms"""
result = search_web("Python tutorials")
assert isinstance(result, str)
assert len(result) > 0
def test_query_database_0():
"""Recorded: query_database('SELECT * FROM users LIMIT 5') → success in 12.1ms"""
result = query_database("SELECT * FROM users LIMIT 5")
assert isinstance(result, list)
CLI equivalent¶
Customising generated tests¶
generator = TestGenerator(
traces_dir="./traces",
include_tools=["get_weather", "search_web"], # Only these tools
exclude_tools=["debug_tool"], # Exclude these
max_tests_per_tool=10, # Cap per tool
)
generator.generate_tests("tests/test_generated.py")
TraceReplayer — Regression Detection¶
TraceReplayer re-runs recorded calls against your current implementation and compares results. This catches regressions after refactoring:
from agentguard.testing import TraceReplayer
replayer = TraceReplayer(traces_dir="./traces")
results = replayer.replay_all(
tools={
"get_weather": get_weather,
"search_web": search_web,
}
)
for r in results:
status = "PASS" if r["match"] else "FAIL"
print(f"[{status}] {r['tool_name']}({r['call_kwargs']})")
if not r["match"]:
print(f" Expected type: {r['expected_type']}")
print(f" Got type: {r['actual_type']}")
What counts as a "match"¶
By default, the replayer checks:
- Both old and new calls succeed (or both fail)
- Return type is the same
- Dict keys are the same (for dict responses)
The replayer doesn't require exact value equality, since real API responses naturally vary. Override the comparison:
def my_comparator(expected, actual) -> bool:
if isinstance(expected, dict) and isinstance(actual, dict):
return set(expected.keys()) == set(actual.keys())
return type(expected) == type(actual)
results = replayer.replay_all(
tools={"get_weather": get_weather},
comparator=my_comparator,
)
Replaying a single session¶
results = replayer.replay_session(
session_id="user_abc_session_42",
tools={"get_weather": get_weather},
)
Writing Tests for Guarded Tools¶
Testing with mocked tools¶
from unittest.mock import patch
import pytest
from agentguard import guard, GuardConfig
config = GuardConfig(validate_input=True, max_retries=2)
@guard(config=config)
def get_weather(city: str) -> dict:
import requests
return requests.get(f"https://wttr.in/{city}?format=j1").json()
def test_get_weather_success():
with patch("requests.get") as mock_get:
mock_get.return_value.json.return_value = {
"temperature": 72, "humidity": 60, "conditions": "sunny"
}
result = get_weather("London")
assert result["temperature"] == 72
assert result["humidity"] == 60
def test_get_weather_retry():
"""Verify retries are attempted on failure."""
call_count = 0
def flaky_request(*args, **kwargs):
nonlocal call_count
call_count += 1
if call_count < 3:
raise ConnectionError("Network error")
mock = patch("requests.Response")()
mock.json.return_value = {"temperature": 72}
return mock
with patch("requests.get", side_effect=flaky_request):
result = get_weather("London")
assert call_count == 3
assert result["temperature"] == 72
Testing validation¶
import pytest
from agentguard.core.types import ValidationError
@guard(validate_input=True)
def add(a: int, b: int) -> int:
return a + b
def test_valid_input():
assert add(a=2, b=3) == 5
def test_invalid_input():
with pytest.raises(Exception):
add(a="two", b=3) # Should fail validation
Testing circuit breaker¶
from agentguard import guard
from agentguard.core.types import CircuitBreakerConfig, CircuitOpenError
@guard(circuit_breaker=CircuitBreakerConfig(
failure_threshold=2,
recovery_timeout=9999, # Never recover during tests
))
def unreliable_api(x: int) -> int:
if x < 0:
raise ValueError("Negative input")
return x * 2
def test_circuit_opens_after_threshold():
for _ in range(2):
try:
unreliable_api(-1)
except ValueError:
pass
with pytest.raises(CircuitOpenError):
unreliable_api(1) # Circuit is now open
Integration with pytest¶
Using record_session in fixtures¶
# conftest.py
import pytest
from agentguard import record_session
@pytest.fixture
def trace_recorder(tmp_path):
"""Record all tool calls during a test."""
with record_session(str(tmp_path), backend="sqlite") as recorder:
yield recorder
# In tests:
def test_agent_flow(trace_recorder):
run_agent("Summarise today's weather")
# Inspect what the agent called
tool_names = [e.tool_name for e in trace_recorder.entries]
assert "get_weather" in tool_names
assert "summarise_text" in tool_names
Parameterised tests from traces¶
import pytest
from agentguard.core.trace import TraceStore
store = TraceStore(directory="./test_fixtures", backend="sqlite")
all_entries = store.read_all()
@pytest.mark.parametrize("entry", all_entries, ids=[e.call_id[:8] for e in all_entries])
def test_replay_recorded_call(entry):
"""Replay each recorded call and verify success."""
tool = get_tool_by_name(entry.tool_name)
result = tool(**entry.call.kwargs)
# Basic type check — same return type as recorded
assert type(result) == type(entry.result.return_value)
Best Practices¶
-
Generate tests from real traffic — synthetic test data misses the edge cases your agent discovers in production.
-
Version exported fixtures intentionally — keep JSONL exports only when you need portable replay fixtures; otherwise prefer SQLite for day-to-day recording.
-
Run
TraceReplayerin CI — add trace replay to your CI pipeline to catch regressions before deployment. -
Keep generated tests short-lived — regenerate them regularly from fresh traces. Old generated tests test old behavior; new ones test current behavior.
-
Use
assert_tool_callfor structure, not values — tool responses change; their structure rarely does. Assert that"temperature" in resultrather thanresult["temperature"] == 72.