Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.mavera.io/llms.txt

Use this file to discover all available pages before exploring further.

When to Use This

You’re running a production integration that:
  • Makes many requests per minute (near or above your tier limit)
  • Uses concurrent workers (background jobs, async handlers)
  • Hits endpoint-specific limits (e.g. Mave max 10 concurrent, Focus Groups max 5)
  • Needs to avoid 429s proactively instead of only reacting with retries
This cookbook covers:
  • Proactive throttling using X-RateLimit-* headers
  • Token bucket style rate limiting to smooth request rate
  • Concurrency limiting (semaphores) for endpoint-specific caps
  • Request queuing when you must process many items without bursting
  • Production checklist and metrics to monitor

Rate Limit Recap

TierRequests/minute
Starter60
Basic120
Professional240
Enterprise600
Endpoint-specific concurrency limits:
EndpointMax concurrent
/mave/chat10
/focus-groups5
/video-analyses3
Every response includes:
HeaderMeaning
X-RateLimit-LimitMax requests per minute for your tier
X-RateLimit-RemainingRequests left in current 60s window
X-RateLimit-ResetUnix timestamp when window resets
Retry-After(On 429) seconds to wait before retry

Pattern 1: Proactive Throttling from Headers

Don’t burst to the limit. After each request, read X-RateLimit-Remaining. If it’s low (e.g. < 10), slow down before you hit 429.
import time
import requests

API_KEY = "mvra_live_your_key_here"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}
BASE = "https://app.mavera.io/api/v1"

# Safety margin: start backing off when remaining drops below this
LOW_REMAINING_THRESHOLD = 10
MIN_INTERVAL = 0.5  # Minimum seconds between requests

def request_with_throttle(method: str, path: str, **kwargs) -> requests.Response:
    """Make request and throttle if we're approaching the limit."""
    resp = requests.request(method, f"{BASE}{path}", headers=HEADERS, **kwargs)

    remaining = int(resp.headers.get("X-RateLimit-Remaining", 999))
    limit = int(resp.headers.get("X-RateLimit-Limit", 60))

    if remaining < LOW_REMAINING_THRESHOLD and remaining > 0:
        # Spread remaining requests over the reset window
        reset = int(resp.headers.get("X-RateLimit-Reset", time.time() + 60))
        wait = max(0, (reset - time.time()) / max(1, remaining))
        wait = min(wait, 30)  # Cap at 30s
        time.sleep(wait)

    return resp

Pattern 2: Token Bucket (Smooth Request Rate)

A token bucket lets you maintain a steady request rate instead of bursting. Refill tokens over time; consume one per request. If no tokens, wait.
import time
import threading

class TokenBucket:
    """Thread-safe token bucket for rate limiting."""

    def __init__(self, rate: float, capacity: int = None):
        """
        rate: tokens per second (e.g. 2.0 for 120/min)
        capacity: max tokens (defaults to rate * 60 for 1 minute burst)
        """
        self.rate = rate
        self.capacity = capacity or int(rate * 60)
        self.tokens = float(self.capacity)
        self.last_refill = time.monotonic()
        self._lock = threading.Lock()

    def acquire(self, tokens: int = 1) -> float:
        """Consume tokens; block until available. Returns wait time in seconds."""
        with self._lock:
            now = time.monotonic()
            elapsed = now - self.last_refill
            self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
            self.last_refill = now

            if self.tokens >= tokens:
                self.tokens -= tokens
                return 0.0

            need = tokens - self.tokens
            wait = need / self.rate
            self.tokens = 0
            return wait

    def wait_and_acquire(self, tokens: int = 1):
        """Block until tokens available, then consume."""
        wait = self.acquire(tokens)
        if wait > 0:
            time.sleep(wait)


# Usage: 2 req/sec ≈ 120/min (Basic tier)
bucket = TokenBucket(rate=2.0)

def chat_with_bucket(messages, persona_id):
    bucket.wait_and_acquire()
    return client.responses.create(
        model="mavera-1",
        input=messages,
        extra_body={"persona_id": persona_id},
    )

Pattern 3: Concurrency Limiting (Semaphore)

For endpoints with max concurrent limits (Mave: 10, Focus Groups: 5, Video: 3), use a semaphore so you never exceed that many in-flight requests.
import asyncio
import httpx

# Mave allows max 10 concurrent
MAVE_SEMAPHORE = asyncio.Semaphore(10)

async def mave_chat_with_concurrency_limit(message: str, thread_id: str = None):
    async with MAVE_SEMAPHORE:
        async with httpx.AsyncClient(timeout=120.0) as client:
            payload = {"message": message}
            if thread_id:
                payload["thread_id"] = thread_id
            resp = await client.post(
                "https://app.mavera.io/api/v1/mave/chat",
                headers={"Authorization": f"Bearer {API_KEY}"},
                json=payload,
            )
            resp.raise_for_status()
            return resp.json()
Combine semaphore with token bucket: semaphore for concurrency, token bucket for overall request rate. E.g. 10 concurrent Mave requests, but only 4 new Mave requests per minute across all workers.

Pattern 4: Request Queue for Batch Processing

When you have a list of items to process (e.g. 500 chat requests), push them into a queue and process at a controlled rate. Prevents spikes and respects limits.
import asyncio
import queue
import threading

def process_queue_sync(items, process_one, rate_per_minute=60, max_workers=4):
    """
    Process items through a queue with rate limiting.
    process_one(item) -> result for each item.
    """
    q = queue.Queue()
    for item in items:
        q.put(item)

    interval = 60.0 / rate_per_minute
    results = []
    lock = threading.Lock()

    def worker():
        while True:
            try:
                item = q.get_nowait()
            except queue.Empty:
                break

            start = time.time()
            try:
                out = process_one(item)
                with lock:
                    results.append(out)
            except Exception as e:
                with lock:
                    results.append({"error": str(e)})

            elapsed = time.time() - start
            sleep_for = max(0, interval - elapsed)
            time.sleep(sleep_for)
            q.task_done()

    threads = [threading.Thread(target=worker) for _ in range(max_workers)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()

    return results

Pattern 5: Exponential Backoff on 429 (Reactive)

When you do get a 429, respect Retry-After and use exponential backoff. Combine with jitter to avoid thundering herd.
import time
import random

def backoff_on_429(func, max_retries=5):
    """Retry on 429 with exponential backoff and Retry-After."""
    for attempt in range(max_retries + 1):
        try:
            return func()
        except requests.HTTPError as e:
            if e.response.status_code != 429 or attempt == max_retries:
                raise

            retry_after = int(e.response.headers.get("Retry-After", 30))
            base = retry_after * (2 ** attempt)
            jitter = random.uniform(0, 5)
            wait = min(base + jitter, 300)  # Cap 5 min
            time.sleep(wait)

Production Checklist

Confirm your rate limit (60/120/240/600 rpm). Design throttling for 80–90% of that to leave headroom.
Mave (10), Focus Groups (5), Video (3). Use semaphores or equivalent.
Log X-RateLimit-Remaining and X-RateLimit-Reset periodically. Alert when remaining < 5 frequently.
Single chat with full conversation history instead of many single-message calls.
Run load tests at 90% of your limit. Verify you get headers and throttle correctly.

Metrics to Monitor

MetricWhat to track
rate_limit_remainingFrom X-RateLimit-Remaining; alert if often < 5
rate_limit_resetFrom X-RateLimit-Reset; for dashboards
requests_per_minuteYour actual throughput
429_countNumber of rate limit errors; should be near 0 with good throttling
retry_countRetries due to 429; indicates throttling may need tuning

See Also

Rate Limits Guide

Tiers, headers, and basic handling

Error Handling Patterns

Retry logic and backoff

Credits

Credit usage (separate from rate limits)

Contact Sales

Higher limits for Enterprise