Rate Limits in Production

When to Use This

You’re running a production integration that:

Makes many requests per minute (near or above your tier limit)
Uses concurrent workers (background jobs, async handlers)
Hits endpoint-specific limits (e.g. Mave max 10 concurrent, Focus Groups max 5)
Needs to avoid 429s proactively instead of only reacting with retries

This cookbook covers:

Proactive throttling using X-RateLimit-* headers
Token bucket style rate limiting to smooth request rate
Concurrency limiting (semaphores) for endpoint-specific caps
Request queuing when you must process many items without bursting
Production checklist and metrics to monitor

Rate Limit Recap

Tier	Requests/minute
Starter	60
Basic	120
Professional	240
Enterprise	600

Endpoint-specific concurrency limits:

Endpoint	Max concurrent
`/mave/chat`	10
`/focus-groups`	5
`/video-analyses`	3

Every response includes:

Header	Meaning
`X-RateLimit-Limit`	Max requests per minute for your tier
`X-RateLimit-Remaining`	Requests left in current 60s window
`X-RateLimit-Reset`	Unix timestamp when window resets
`Retry-After`	(On 429) seconds to wait before retry

Pattern 1: Proactive Throttling from Headers

Don’t burst to the limit. After each request, read X-RateLimit-Remaining. If it’s low (e.g. < 10), slow down before you hit 429.

import time
import requests

API_KEY = "mvra_live_your_key_here"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}
BASE = "https://app.mavera.io/api/v1"

# Safety margin: start backing off when remaining drops below this
LOW_REMAINING_THRESHOLD = 10
MIN_INTERVAL = 0.5  # Minimum seconds between requests

def request_with_throttle(method: str, path: str, **kwargs) -> requests.Response:
    """Make request and throttle if we're approaching the limit."""
    resp = requests.request(method, f"{BASE}{path}", headers=HEADERS, **kwargs)

    remaining = int(resp.headers.get("X-RateLimit-Remaining", 999))
    limit = int(resp.headers.get("X-RateLimit-Limit", 60))

    if remaining < LOW_REMAINING_THRESHOLD and remaining > 0:
        # Spread remaining requests over the reset window
        reset = int(resp.headers.get("X-RateLimit-Reset", time.time() + 60))
        wait = max(0, (reset - time.time()) / max(1, remaining))
        wait = min(wait, 30)  # Cap at 30s
        time.sleep(wait)

    return resp

Pattern 2: Token Bucket (Smooth Request Rate)

A token bucket lets you maintain a steady request rate instead of bursting. Refill tokens over time; consume one per request. If no tokens, wait.

import time
import threading

class TokenBucket:
    """Thread-safe token bucket for rate limiting."""

    def __init__(self, rate: float, capacity: int = None):
        """
        rate: tokens per second (e.g. 2.0 for 120/min)
        capacity: max tokens (defaults to rate * 60 for 1 minute burst)
        """
        self.rate = rate
        self.capacity = capacity or int(rate * 60)
        self.tokens = float(self.capacity)
        self.last_refill = time.monotonic()
        self._lock = threading.Lock()

    def acquire(self, tokens: int = 1) -> float:
        """Consume tokens; block until available. Returns wait time in seconds."""
        with self._lock:
            now = time.monotonic()
            elapsed = now - self.last_refill
            self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
            self.last_refill = now

            if self.tokens >= tokens:
                self.tokens -= tokens
                return 0.0

            need = tokens - self.tokens
            wait = need / self.rate
            self.tokens = 0
            return wait

    def wait_and_acquire(self, tokens: int = 1):
        """Block until tokens available, then consume."""
        wait = self.acquire(tokens)
        if wait > 0:
            time.sleep(wait)


# Usage: 2 req/sec ≈ 120/min (Basic tier)
bucket = TokenBucket(rate=2.0)

def chat_with_bucket(messages, persona_id):
    bucket.wait_and_acquire()
    return client.responses.create(
        model="mavera-1",
        input=messages,
        extra_body={"persona_id": persona_id},
    )

Pattern 3: Concurrency Limiting (Semaphore)

For endpoints with max concurrent limits (Mave: 10, Focus Groups: 5, Video: 3), use a semaphore so you never exceed that many in-flight requests.

import asyncio
import httpx

# Mave allows max 10 concurrent
MAVE_SEMAPHORE = asyncio.Semaphore(10)

async def mave_chat_with_concurrency_limit(message: str, thread_id: str = None):
    async with MAVE_SEMAPHORE:
        async with httpx.AsyncClient(timeout=120.0) as client:
            payload = {"message": message}
            if thread_id:
                payload["thread_id"] = thread_id
            resp = await client.post(
                "https://app.mavera.io/api/v1/mave/chat",
                headers={"Authorization": f"Bearer {API_KEY}"},
                json=payload,
            )
            resp.raise_for_status()
            return resp.json()

Combine semaphore with token bucket: semaphore for concurrency, token bucket for overall request rate. E.g. 10 concurrent Mave requests, but only 4 new Mave requests per minute across all workers.

Pattern 4: Request Queue for Batch Processing

When you have a list of items to process (e.g. 500 chat requests), push them into a queue and process at a controlled rate. Prevents spikes and respects limits.

import asyncio
import queue
import threading

def process_queue_sync(items, process_one, rate_per_minute=60, max_workers=4):
    """
    Process items through a queue with rate limiting.
    process_one(item) -> result for each item.
    """
    q = queue.Queue()
    for item in items:
        q.put(item)

    interval = 60.0 / rate_per_minute
    results = []
    lock = threading.Lock()

    def worker():
        while True:
            try:
                item = q.get_nowait()
            except queue.Empty:
                break

            start = time.time()
            try:
                out = process_one(item)
                with lock:
                    results.append(out)
            except Exception as e:
                with lock:
                    results.append({"error": str(e)})

            elapsed = time.time() - start
            sleep_for = max(0, interval - elapsed)
            time.sleep(sleep_for)
            q.task_done()

    threads = [threading.Thread(target=worker) for _ in range(max_workers)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()

    return results

Pattern 5: Exponential Backoff on 429 (Reactive)

When you do get a 429, respect Retry-After and use exponential backoff. Combine with jitter to avoid thundering herd.

import time
import random

def backoff_on_429(func, max_retries=5):
    """Retry on 429 with exponential backoff and Retry-After."""
    for attempt in range(max_retries + 1):
        try:
            return func()
        except requests.HTTPError as e:
            if e.response.status_code != 429 or attempt == max_retries:
                raise

            retry_after = int(e.response.headers.get("Retry-After", 30))
            base = retry_after * (2 ** attempt)
            jitter = random.uniform(0, 5)
            wait = min(base + jitter, 300)  # Cap 5 min
            time.sleep(wait)

Production Checklist

Know your tier

Confirm your rate limit (60/120/240/600 rpm). Design throttling for 80–90% of that to leave headroom.

Respect endpoint concurrency

Mave (10), Focus Groups (5), Video (3). Use semaphores or equivalent.

Log rate limit headers

Log X-RateLimit-Remaining and X-RateLimit-Reset periodically. Alert when remaining < 5 frequently.

Batch when possible

Single chat with full conversation history instead of many single-message calls.

Test under load

Run load tests at 90% of your limit. Verify you get headers and throttle correctly.

Metrics to Monitor

Metric	What to track
`rate_limit_remaining`	From `X-RateLimit-Remaining`; alert if often < 5
`rate_limit_reset`	From `X-RateLimit-Reset`; for dashboards
`requests_per_minute`	Your actual throughput
`429_count`	Number of rate limit errors; should be near 0 with good throttling
`retry_count`	Retries due to 429; indicates throttling may need tuning

Rate Limits Guide

Tiers, headers, and basic handling

Error Handling Patterns

Retry logic and backoff

Credits

Credit usage (separate from rate limits)

Contact Sales

Higher limits for Enterprise

Overview

Persona Research

Content Production

Video Intelligence

Strategic Research

Fundamentals

Rate Limits in Production

When to Use This

Rate Limit Recap

Pattern 1: Proactive Throttling from Headers

Pattern 2: Token Bucket (Smooth Request Rate)

Pattern 3: Concurrency Limiting (Semaphore)

Pattern 4: Request Queue for Batch Processing

Pattern 5: Exponential Backoff on 429 (Reactive)

Production Checklist

Metrics to Monitor

See Also

Rate Limits Guide

Error Handling Patterns

Credits

Contact Sales

Overview

Persona Research

Content Production

Video Intelligence

Strategic Research

Fundamentals

Documentation Index

​When to Use This

​Rate Limit Recap

​Pattern 1: Proactive Throttling from Headers

​Pattern 2: Token Bucket (Smooth Request Rate)

​Pattern 3: Concurrency Limiting (Semaphore)

​Pattern 4: Request Queue for Batch Processing

​Pattern 5: Exponential Backoff on 429 (Reactive)

​Production Checklist

​Metrics to Monitor

​See Also

Rate Limits Guide

Error Handling Patterns

Credits

Contact Sales

When to Use This

Rate Limit Recap

Pattern 1: Proactive Throttling from Headers

Pattern 2: Token Bucket (Smooth Request Rate)

Pattern 3: Concurrency Limiting (Semaphore)

Pattern 4: Request Queue for Batch Processing

Pattern 5: Exponential Backoff on 429 (Reactive)

Production Checklist

Metrics to Monitor

See Also