Runtime budget control and tool-call reliability for AI agents

Hard spend caps · Real LLM cost tracking · Tool validation · Response verification · Shared multi-agent budgets · Tracing and tests

pip install awesome-agentguard

Quick Start

Start with the two things teams actually need in production: keep runs inside budget and make tool calls trustworthy.

python
import os
from openai import OpenAI

from agentguard import TokenBudget, guard
from agentguard.integrations import guard_openai_client

# 1. Put a hard cap on model spend
budget = TokenBudget(max_cost_per_session=5.00, max_calls_per_session=100)
client = guard_openai_client(
    OpenAI(api_key=os.getenv("OPENAI_API_KEY")),
    budget=budget,
)

# 2. Guard the tools your agent depends on
@guard(validate_input=True, verify_response=True, max_retries=2)
def search_web(query: str) -> dict:
    return requests.get(f"https://api.search.com?q={query}").json()
Zero config works too

Use @guard with no arguments for basic wrapping, then layer in budgets and response profiles as your agent gets more expensive or more critical.

Installation

Install agentguard from PyPI:

bash
pip install awesome-agentguard

Optional Dependencies

bash
pip install awesome-agentguard[all]        # OpenAI + Anthropic + LangChain integrations
pip install awesome-agentguard[costs]      # LiteLLM-backed real LLM cost tracking
pip install awesome-agentguard[rich]       # Rich terminal output

Requirements: Python 3.10+ · Only core dependency: pydantic>=2.0.

Want the full technical documentation, API reference, and deeper guides? Visit the detailed docs at rigvedrs.github.io/agentguard/.

Features

Budget Enforcement
Cap spend per call, per session, or across multiple agents. Block or warn before a loop turns into a bill.
Real LLM Cost Tracking
Wrap OpenAI, Anthropic, and compatible clients to read provider usage, resolve pricing, and record spend automatically.
Circuit Breakers
CLOSED → OPEN → HALF_OPEN state machine. Prevent cascading failures when tools go down.
Rate Limiting
Token bucket algorithm. Per-second, per-minute, or per-hour limits with burst allowance.
Verification Engine
Bayesian multi-signal fusion with calibrated likelihood ratios, SPC baselines, and adaptive thresholds for tool-result verification.
Retry with Backoff
Exponential backoff with jitter. Configurable max retries, delays, backoff factor.
Trace Recording
SQLite-backed trace storage with JSONL import/export, local dashboard inspection, and generated tests from real runs.
64-Scenario Benchmarks
Compare LLM models on tool-calling accuracy across 6 categories. GPT-4o-mini: 95.3% accuracy.
Input/Output Validation
Automatic type validation from Python type hints. Catch wrong parameters before execution and reject malformed outputs early.
Multi-Agent State
SharedBudget and SharedCircuitBreaker across agents. Thread-safe, registry-based state sharing.
Middleware Pipeline
Composable middleware chain around every tool call. Auth, logging, timing — sync or async.
Policy-as-Code
Define guard rules in TOML/YAML files. Version-controlled safety policies across teams.

Why This Shape

agentguard is pitched around two problems teams feel first in production: runaway spend and untrustworthy tool execution. The verification engine is a big part of the second problem, and it is backed by a substantial research base.

  • Latency-as-proofWhat it is: use runtime as a sanity check (a real network/database call can’t consistently finish in ~0–2ms). Why: catches “tool-result hallucinations” early when a tool claims it ran but the timing makes that basically impossible.
  • Log-odds Bayesian fusionWhat it is: convert each weak signal (latency, schema validity, past accuracy, etc.) into a log-likelihood ratio, then add them up. Why: combining evidence becomes stable and interpretable (each signal contributes a clear “push” toward trust or distrust).
  • Western Electric SPC rulesWhat it is: classic Statistical Process Control heuristics (a small set of rules) for spotting when a metric’s behavior shifts. Why: detects regressions like “this tool suddenly got flaky” without needing a perfect model of the tool.
  • Cross-session consistencyWhat it is: compare today’s tool outputs to historical outputs for the same tool + args pattern. Why: flags surprising deviations (often a bug, stale data, or fabrication) when the “same question” starts returning incompatible answers.
  • Adaptive thresholdsWhat it is: update decision thresholds over time with an Exponential Moving Average (EMA) from real feedback. Why: reduces false alarms and improves detection as your environment changes (new infra, new data, new models).

Framework Support

Works with every major AI framework and provider out of the box, while keeping one consistent runtime layer for both budget control and tool-call safety.

⚡ OpenAI 🔮 Anthropic 🔀 OpenRouter 🚀 Groq 🤝 Together AI 🔥 Fireworks AI 🦜 LangChain 👥 CrewAI 🤖 AutoGen 🔌 MCP

agentguard also supports real response-based LLM cost tracking for OpenAI, Anthropic, and OpenAI-compatible providers via optional LiteLLM-backed pricing.

Edit this page on GitHub