Rate Limiting

Token bucket rate limiting with per-second, per-minute, and per-hour controls.

Token Bucket Algorithm

agentguard uses the token bucket algorithm for rate limiting. Think of a bucket that holds tokens — each call consumes one token, and tokens refill at a steady rate. Burst traffic is allowed up to the bucket capacity.

How It Works

Configuration

python
from agentguard import guard
from agentguard.config import RateLimitConfig

# Per-minute rate limit
@guard(
    rate_limit=RateLimitConfig(
        calls_per_minute=60,    # 60 calls per minute = 1/sec sustained
        burst=10,               # Allow burst of 10 rapid calls
    )
)
def search(query: str) -> dict:
    return api.search(query)

# Per-second rate limit (strict)
@guard(
    rate_limit=RateLimitConfig(
        calls_per_second=5,     # Max 5 calls per second
        burst=5,                # No burst beyond rate
    )
)
def write_db(data: dict) -> bool:
    return db.insert(data)

# Per-hour rate limit (cost control)
@guard(
    rate_limit=RateLimitConfig(
        calls_per_hour=1000,    # 1000 calls per hour
        burst=50,               # Allow short bursts
    )
)
def llm_call(prompt: str) -> str:
    return openai.chat(prompt)

Configuration Fields

FieldTypeDefaultDescription
calls_per_secondfloatNoneMaximum sustained calls per second
calls_per_minutefloatNoneMaximum sustained calls per minute
calls_per_hourfloatNoneMaximum sustained calls per hour
burstint10Maximum burst size (bucket capacity)
blockboolTrueIf True, raise error. If False, wait.

Handling Rate Limit Errors

python
from agentguard.errors import RateLimitExceeded

try:
    result = search("hello")
except RateLimitExceeded as e:
    print(f"Rate limited. Retry after {{e.retry_after:.1f}}s")
    # e.retry_after gives you the wait time in seconds
✅ Tip: Use burst wisely

Set burst equal to your rate for no burst allowance, or higher to accommodate natural traffic spikes. For interactive agents, a burst of 2-3x the sustained rate works well.

Edit this page on GitHub