Rate Limiting¶
The Problem¶
AI agents are prolific tool callers. A single while True loop in an agent that calls a search API can exceed your API provider's rate limits in seconds, resulting in HTTP 429 errors, blocked accounts, or surprise bills.
agentguard's rate limiter uses the token bucket algorithm to enforce configurable per-second, per-minute, and per-hour limits.
Token Bucket Algorithm¶
The token bucket works like this:
- The bucket starts full with
bursttokens - Tokens are replenished continuously at the configured rate
- Each function call consumes 1 token
- If the bucket is empty, the call is blocked (or warned, depending on config)
This gives you both a steady-state rate limit and a burst allowance for legitimate spikes.
Bucket capacity: 5 tokens (burst)
Refill rate: 2 tokens/second (calls_per_second=2.0)
t=0: [●●●●●] 5 tokens — 5 burst calls allowed immediately
t=1: [●●●●●] refills to 5 (was 5, still 5)
t=1.5: call → [●●●●] 4 tokens remaining
t=2: [●●●●●] refilled to 5
Basic Usage¶
from agentguard import guard, RateLimiter
@guard(rate_limit=RateLimiter(calls_per_minute=30).config)
def search_api(query: str) -> list[dict]:
"""Search — limited to 30 calls/minute."""
import requests
return requests.get(f"https://search.api.com?q={query}").json()
Or use RateLimitConfig directly for full control:
from agentguard.core.types import RateLimitConfig, GuardAction
@guard(rate_limit=RateLimitConfig(
calls_per_second=2.0,
calls_per_minute=60.0,
calls_per_hour=500.0,
burst=5,
on_limit=GuardAction.BLOCK,
shared_key=None,
))
def search_api(query: str) -> list[dict]: ...
Configuration Reference¶
calls_per_second¶
Maximum sustained rate in calls per second. The bucket refills at this rate.
calls_per_minute¶
Convenience shorthand — converted to calls_per_second = calls_per_minute / 60.
calls_per_hour¶
For very slow-refilling buckets.
You can combine limits. The most restrictive is applied:
RateLimitConfig(
calls_per_second=5.0, # No more than 5/second
calls_per_hour=1000.0, # No more than 1000/hour
)
burst¶
Maximum tokens in the bucket (also the starting token count). Allows short bursts above the sustained rate.
on_limit¶
What to do when the bucket is empty:
from agentguard.core.types import GuardAction
on_limit=GuardAction.BLOCK # Raise RateLimitError (default)
on_limit=GuardAction.WARN # Log warning, return None
on_limit=GuardAction.LOG # Silently record, return None
shared_key¶
Controls whether buckets are shared across guarded tool instances:
RateLimitConfig(shared_key=None) # Default: share by tool name
RateLimitConfig(shared_key="") # Per-instance bucket
RateLimitConfig(shared_key="provider-x") # Explicit shared group
If multiple tools register the same effective shared key with different rate limit settings, the first config wins and agentguard emits a warning.
Handling RateLimitError¶
When on_limit=GuardAction.BLOCK (the default), exceeding the rate limit raises RateLimitError:
from agentguard.core.types import RateLimitError
import time
def search_with_backoff(query: str) -> list[dict]:
while True:
try:
return search_api(query)
except RateLimitError as e:
print(f"Rate limited — retrying in {e.retry_after:.1f}s")
time.sleep(e.retry_after)
RateLimitError.retry_after is the number of seconds until enough tokens are available for one call.
Common Patterns¶
Match your API provider's limits¶
# OpenAI GPT-4: 500 RPM on tier 1
@guard(rate_limit=RateLimitConfig(calls_per_minute=490, burst=10))
def call_gpt4(prompt: str) -> str: ...
# Anthropic Claude: 1000 RPM
@guard(rate_limit=RateLimitConfig(calls_per_minute=990, burst=20))
def call_claude(prompt: str) -> str: ...
# SerpAPI: 100 searches/month (very slow)
@guard(rate_limit=RateLimitConfig(
calls_per_hour=3, # 100/month ≈ 3/hour
burst=1,
))
def serpapi_search(query: str) -> dict: ...
Per-user rate limiting¶
By default, rate limits are shared by tool name across GuardedTool
instances. If you want isolated buckets per user, set shared_key="" when
creating each instance:
from agentguard import guard, GuardConfig
from agentguard.core.types import RateLimitConfig
def create_user_tools(user_id: str) -> dict:
config = GuardConfig(
rate_limit=RateLimitConfig(calls_per_minute=10, shared_key=""),
session_id=user_id,
)
from agentguard.core.guard import GuardedTool
return {
"search": GuardedTool(search_fn, config=config),
"query_db": GuardedTool(query_db_fn, config=config),
}
Shared quota across different tools¶
Use a custom shared_key when different tool names consume the same upstream
provider quota:
provider_limit = RateLimitConfig(calls_per_minute=100, shared_key="serpapi")
@guard(rate_limit=provider_limit)
def web_search(query: str) -> dict: ...
@guard(rate_limit=provider_limit)
def news_search(query: str) -> dict: ...
Graceful degradation instead of error¶
from agentguard.core.types import GuardAction
@guard(rate_limit=RateLimitConfig(
calls_per_minute=60,
on_limit=GuardAction.WARN, # Return None instead of raising
))
def enrich_lead(email: str) -> dict | None:
"""Enrich a lead — may return None if rate limited."""
...
result = enrich_lead("user@example.com")
if result is None:
# Rate limited — skip enrichment for this lead
result = {"email": email, "enriched": False}
Rate Limiting vs Circuit Breaking¶
These are complementary, not alternatives:
| Rate Limiter | Circuit Breaker | |
|---|---|---|
| Purpose | Prevent your agent from calling too fast | Prevent calls to a failing downstream service |
| Triggers on | Call volume | Failure count |
| Recovery | Automatic (bucket refills) | Timed probe |
| Use when | You have an API quota | The service is unreliable |
Use both together:
@guard(
rate_limit=RateLimitConfig(calls_per_minute=60),
circuit_breaker=CircuitBreakerConfig(failure_threshold=5),
)
def call_external_api(): ...
Troubleshooting¶
RateLimitError in tests¶
Tests that call guarded tools rapidly will hit rate limits. Use a permissive config in tests:
# conftest.py
import pytest
from agentguard import GuardConfig
@pytest.fixture
def no_limits():
return GuardConfig() # No rate limit
def test_my_tool(no_limits):
from agentguard.core.guard import GuardedTool
guarded = GuardedTool(my_fn, config=no_limits)
for _ in range(100): # Won't be rate limited
guarded(arg="value")
Rate limit not enforced¶
Check that you're using the same GuardedTool instance across calls. Each @guard application creates a fresh token bucket. If you wrap the function twice, each wrapper has its own bucket.
Burst calls all succeed, then rate limit kicks in¶
This is expected behaviour. The burst allows an initial flurry of calls, then enforces the sustained rate. If you want stricter control with no burst, set burst=1.