Rate Limiting
Token bucket rate limiting with per-second, per-minute, and per-hour controls.
Token Bucket Algorithm
agentguard uses the token bucket algorithm for rate limiting. Think of a bucket that holds tokens — each call consumes one token, and tokens refill at a steady rate. Burst traffic is allowed up to the bucket capacity.
How It Works
- Bucket starts full (capacity = burst allowance)
- Each call removes 1 token
- Tokens refill at a constant rate
- If bucket is empty, the call is blocked
- Bucket never exceeds its capacity
Configuration
python
from agentguard import guard
from agentguard.config import RateLimitConfig
# Per-minute rate limit
@guard(
rate_limit=RateLimitConfig(
calls_per_minute=60, # 60 calls per minute = 1/sec sustained
burst=10, # Allow burst of 10 rapid calls
)
)
def search(query: str) -> dict:
return api.search(query)
# Per-second rate limit (strict)
@guard(
rate_limit=RateLimitConfig(
calls_per_second=5, # Max 5 calls per second
burst=5, # No burst beyond rate
)
)
def write_db(data: dict) -> bool:
return db.insert(data)
# Per-hour rate limit (cost control)
@guard(
rate_limit=RateLimitConfig(
calls_per_hour=1000, # 1000 calls per hour
burst=50, # Allow short bursts
)
)
def llm_call(prompt: str) -> str:
return openai.chat(prompt)
Configuration Fields
| Field | Type | Default | Description |
|---|---|---|---|
calls_per_second | float | None | Maximum sustained calls per second |
calls_per_minute | float | None | Maximum sustained calls per minute |
calls_per_hour | float | None | Maximum sustained calls per hour |
burst | int | 10 | Maximum burst size (bucket capacity) |
block | bool | True | If True, raise error. If False, wait. |
Handling Rate Limit Errors
python
from agentguard.errors import RateLimitExceeded
try:
result = search("hello")
except RateLimitExceeded as e:
print(f"Rate limited. Retry after {{e.retry_after:.1f}}s")
# e.retry_after gives you the wait time in seconds
✅ Tip: Use burst wisely
Set burst equal to your rate for no burst allowance, or higher to accommodate natural traffic spikes. For interactive agents, a burst of 2-3x the sustained rate works well.