agentguard — Runtime budget control and tool-call reliability for AI agents

agentguard

Runtime budget control and tool-call reliability for AI agents

Hard spend caps · Real LLM cost tracking · Tool validation · Response verification · Shared multi-agent budgets · Tracing and tests

pip install awesome-agentguard

Get Started Full Docs View on GitHub

Quick Start

Start with the two things teams actually need in production: keep runs inside budget and make tool calls trustworthy.

python

import os
from openai import OpenAI

from agentguard import TokenBudget, guard
from agentguard.integrations import guard_openai_client

# 1. Put a hard cap on model spend
budget = TokenBudget(max_cost_per_session=5.00, max_calls_per_session=100)
client = guard_openai_client(
    OpenAI(api_key=os.getenv("OPENAI_API_KEY")),
    budget=budget,
)

# 2. Guard the tools your agent depends on
@guard(validate_input=True, verify_response=True, max_retries=2)
def search_web(query: str) -> dict:
    return requests.get(f"https://api.search.com?q={query}").json()

Zero config works too

Use @guard with no arguments for basic wrapping, then layer in budgets and response profiles as your agent gets more expensive or more critical.

Installation

Install agentguard from PyPI:

bash

pip install awesome-agentguard

Optional Dependencies

bash

pip install awesome-agentguard[all]        # OpenAI + Anthropic + LangChain integrations
pip install awesome-agentguard[costs]      # LiteLLM-backed real LLM cost tracking
pip install awesome-agentguard[rich]       # Rich terminal output

Requirements: Python 3.10+ · Only core dependency: pydantic>=2.0.

Want the full technical documentation, API reference, and deeper guides? Visit the detailed docs at rigvedrs.github.io/agentguard/.

Features

Budget Enforcement

Cap spend per call, per session, or across multiple agents. Block or warn before a loop turns into a bill.

Real LLM Cost Tracking

Wrap OpenAI, Anthropic, and compatible clients to read provider usage, resolve pricing, and record spend automatically.

Circuit Breakers

CLOSED → OPEN → HALF_OPEN state machine. Prevent cascading failures when tools go down.

Rate Limiting

Token bucket algorithm. Per-second, per-minute, or per-hour limits with burst allowance.

Verification Engine

Bayesian multi-signal fusion with calibrated likelihood ratios, SPC baselines, and adaptive thresholds for tool-result verification.

Retry with Backoff

Exponential backoff with jitter. Configurable max retries, delays, backoff factor.

Trace Recording

SQLite-backed trace storage with JSONL import/export, local dashboard inspection, and generated tests from real runs.

64-Scenario Benchmarks

Compare LLM models on tool-calling accuracy across 6 categories. GPT-4o-mini: 95.3% accuracy.

Input/Output Validation

Automatic type validation from Python type hints. Catch wrong parameters before execution and reject malformed outputs early.

Multi-Agent State

SharedBudget and SharedCircuitBreaker across agents. Thread-safe, registry-based state sharing.

Middleware Pipeline

Composable middleware chain around every tool call. Auth, logging, timing — sync or async.

Policy-as-Code

Define guard rules in TOML/YAML files. Version-controlled safety policies across teams.

Why This Shape

agentguard is pitched around two problems teams feel first in production: runaway spend and untrustworthy tool execution. The verification engine is a big part of the second problem, and it is backed by a substantial research base.

Latency-as-proof — What it is: use runtime as a sanity check (a real network/database call can’t consistently finish in ~0–2ms). Why: catches “tool-result hallucinations” early when a tool claims it ran but the timing makes that basically impossible.
Log-odds Bayesian fusion — What it is: convert each weak signal (latency, schema validity, past accuracy, etc.) into a log-likelihood ratio, then add them up. Why: combining evidence becomes stable and interpretable (each signal contributes a clear “push” toward trust or distrust).
Western Electric SPC rules — What it is: classic Statistical Process Control heuristics (a small set of rules) for spotting when a metric’s behavior shifts. Why: detects regressions like “this tool suddenly got flaky” without needing a perfect model of the tool.
Cross-session consistency — What it is: compare today’s tool outputs to historical outputs for the same tool + args pattern. Why: flags surprising deviations (often a bug, stale data, or fabrication) when the “same question” starts returning incompatible answers.
Adaptive thresholds — What it is: update decision thresholds over time with an Exponential Moving Average (EMA) from real feedback. Why: reduces false alarms and improves detection as your environment changes (new infra, new data, new models).

Framework Support

Works with every major AI framework and provider out of the box, while keeping one consistent runtime layer for both budget control and tool-call safety.

⚡ OpenAI 🔮 Anthropic 🔀 OpenRouter 🚀 Groq 🤝 Together AI 🔥 Fireworks AI 🦜 LangChain 👥 CrewAI 🤖 AutoGen 🔌 MCP

agentguard also supports real response-based LLM cost tracking for OpenAI, Anthropic, and OpenAI-compatible providers via optional LiteLLM-backed pricing.

The @guard Decorator →

Edit this page on GitHub