Rate Limiting

Rate limiting is a way to control the rate of requests to your AI agent, preventing abuse and managing API budgets.

To demonstrate using the Rate Limiter component, there is an example implementation you can run yourself.

It rate limits the number of messages a user can send in a given time period, as well as the total token usage for a user. When a limit is exceeded, the client can reactively tell the user how long to wait (even if they exceeded the limit in another browser tab!).

For general usage tracking, see Usage Tracking.

Overview

The rate limiting example demonstrates two types of rate limiting:

Message Rate Limiting: Prevents users from sending messages too frequently
Token Usage Rate Limiting: Controls AI model token consumption over time

Running the Example

git clone https://github.com/get-convex/agent.git
cd agent
npm run setup
npm run example

Try sending multiple questions quickly to see the rate limiting in action!

Rate Limiting Strategy

Below we'll go through each configuration. You can also see the full example implementation in rateLimiting.ts.

import { MINUTE, RateLimiter, SECOND } from "@convex-dev/rate-limiter";
import { components } from "./_generated/api";

export const rateLimiter = new RateLimiter(components.rateLimiter, {
  sendMessage: {
    kind: "fixed window",
    period: 5 * SECOND,
    rate: 1,
    capacity: 2,
  },
  globalSendMessage: { kind: "token bucket", period: MINUTE, rate: 1_000 },
  tokenUsagePerUser: {
    kind: "token bucket",
    period: MINUTE,
    rate: 2000,
    capacity: 10000,
  },
  globalTokenUsage: { kind: "token bucket", period: MINUTE, rate: 100_000 },
});

1. Fixed Window Rate Limiting for Messages

// export const rateLimiter = new RateLimiter(components.rateLimiter, {
sendMessage: { kind: "fixed window", period: 5 * SECOND, rate: 1, capacity: 2 }

Allows 1 message every 5 seconds per user.
Prevents spam and rapid-fire requests.
Allows up to a 2 message burst to be sent within 5 seconds via capacity, if they had usage leftover from the previous 5 seconds.

Global limit:

globalSendMessage: { kind: "token bucket", period: MINUTE, rate: 1_000 },

Allows 1000 messages per minute globally, to stay under the API limit.
As a token bucket, it will continuously accrue tokens at the rate of 1000 tokens per minute until it caps out at 1000. All available tokens can be used in quick succession.

2. Token Bucket Rate Limiting for Token Usage

tokenUsage: { kind: "token bucket", period: MINUTE, rate: 1_000 }
globalTokenUsage: { kind: "token bucket", period: MINUTE, rate: 100_000 },

Allows 1000 tokens per minute per user (a userId is provided as the key), and 100k tokens per minute globally.
Provides burst capacity while controlling overall usage. If it hasn't been used in a while, you can consume all tokens at once. However, you'd then need need to wait for tokens to gradually accrue before making more requests.
Having a per-user limit is useful to prevent single users from hogging all of the token bandwidth you have available with your LLM provider, while a global limit helps stay under the API limit without throwing an error midway through a potentially long multi-step request.

How It Works

Step 1: Pre-flight Rate Limit Checks

Before processing a question, the system:

Checks if the user can send another message (frequency limit)
Estimates token usage for the question
Verifies the user has sufficient token allowance
Throws an error if either limit would be exceeded
If the rate limits aren't exceeded, the LLM request is made.

See rateLimiting.ts for the full implementation.

// In the mutation that would start generating a message.
await rateLimiter.limit(ctx, "sendMessage", { key: userId, throws: true });
// Also check global limit.
await rateLimiter.limit(ctx, "globalSendMessage", { throws: true });

// A heuristic based on the previous token usage in the thread + the question.
const count = await estimateTokens(ctx, args.threadId, args.question);
// Check token usage, but don't consume the tokens yet.
await rateLimiter.check(ctx, "tokenUsage", {
  key: userId,
  count: estimateTokens(args.question),
  throws: true,
});
// Also check global limit.
await rateLimiter.check(ctx, "globalTokenUsage", {
  count,
  reserve: true,
  throws: true,
});

If there is not enough allowance, the rate limiter will throw an error that the client can catch and prompt the user to wait a bit before trying again.

The difference between limit and check is that limit will consume the tokens immediately, while check will only check if the limit would be exceeded. We actually mark the tokens as used once the request is complete with the total usage.

Step 2: Post-generation Usage Tracking

While rate limiting message sending frequency is a good way to prevent many messages being sent in a short period of time, each message could generate a very long response or use a lot of context tokens. For this we also track token usage as its own rate limit.

After the AI generates a response, we mark the tokens as used using the total usage. We use reserve: true to allow a (temporary) negative balance, in case the generation used more tokens than estimated. A "reservation" here means allocating tokens beyond what is allowed. Typically this is done ahead of time, to "reserve" capacity for a big request that can be scheduled in advance. In this case, we're marking capacity that has already been consumed. This prevents future requests from starting until the "debt" is paid off.

When using the Agent component, we can do this in the "usageHandler", which is called after the AI generates a response.

import { Agent, type Config } from "@convex-dev/rate-limiter";

const sharedConfig = {
  usageHandler: async (ctx, { usage, userId }) => {
    if (!userId) {
      return;
    }
    // We consume the token usage here, once we know the full usage.
    // This is too late for the first generation, but prevents further requests
    // until we've paid off that debt.
    await rateLimiter.limit(ctx, "tokenUsage", {
      key: userId,
      // You could weight different kinds of tokens differently here.
      count: usage.totalTokens,
      // Reserving the tokens means it won't fail here, but will allow it
      // to go negative, disallowing further requests at the `check` call below.
      reserve: true,
    });
  },
} satisfies Config;

// use it in your agent definitions
const agent = new Agent(components.agent, {
  name,
  languageModel,
  ...sharedConfig,
});

The "trick" here is that, while a user can make a request that exceeds the limit for a single request, they then have to wait longer to accrue the tokens for another request. So averaged over time they can't consume more than the rate limit.

This balances pragmatism of trying to prevent requests ahead of time with an estimate, while also rate limiting the actual usage.

Client-side Handling

See RateLimiting.tsx for the client-side code.

While the client isn't the final authority on whether a request should be allowed, it can still show a waiting message while the rate limit is being checked, and an error message when the rate limit is exceeded. This prevents the user from making attempts that are likely to fail.

It makes use of the useRateLimit hook to check the rate limits. See the full Rate Limiting docs here.

import { useRateLimit } from "@convex-dev/rate-limiter/react";
//...
const { status } = useRateLimit(api.example.getRateLimit);

In convex/example.ts we expose getRateLimit:

export const { getRateLimit, getServerTime } = rateLimiter.hookAPI<DataModel>(
  "sendMessage",
  { key: (ctx) => getAuthUserId(ctx) },
);

Showing a waiting message while the rate limit is being checked:

{status && !status.ok && (
    <div className="text-xs text-gray-500 text-center">
    <p>Message sending rate limit exceeded.</p>
    <p>
        Try again after <Countdown ts={status.retryAt} />
    </p>
    </div>
)}

Showing an error message when the rate limit is exceeded:

import { isRateLimitError } from "@convex-dev/rate-limiter";

// in a button handler
await submitQuestion({ question, threadId }).catch((e) => {
  if (isRateLimitError(e)) {
    toast({
      title: "Rate limit exceeded",
      description: `Rate limit exceeded for ${e.data.name}.
          Try again after ${getRelativeTime(Date.now() + e.data.retryAfter)}`,
    });
  }
});

Token Estimation

The example includes a simple token estimation function:

import { QueryCtx } from "./_generated/server";
import { fetchContextMessages } from "@convex-dev/agent";
import { components } from "./_generated/api";

// This is a rough estimate of the tokens that will be used.
// It's not perfect, but it's a good enough estimate for a pre-generation check.
export async function estimateTokens(
  ctx: QueryCtx,
  threadId: string | undefined,
  question: string,
) {
  // Assume roughly 4 characters per token
  const promptTokens = question.length / 4;
  // Assume a longer non-zero reply
  const estimatedOutputTokens = promptTokens * 3 + 1;
  const latestMessages = await fetchContextMessages(ctx, components.agent, {
    threadId,
    searchText: question,
    contextOptions: { recentMessages: 2 },
  });
  // Our new usage will roughly be the previous tokens + the question.
  // The previous tokens include the tokens for the full message history and
  // output tokens, which will be part of our new history.
  const lastUsageMessage = latestMessages
    .reverse()
    .find((message) => message.usage);
  const lastPromptTokens = lastUsageMessage?.usage?.totalTokens ?? 1;
  return lastPromptTokens + promptTokens + estimatedOutputTokens;
}

Overview​

Running the Example​

Rate Limiting Strategy​

1. Fixed Window Rate Limiting for Messages​

2. Token Bucket Rate Limiting for Token Usage​

How It Works​

Step 1: Pre-flight Rate Limit Checks​

Step 2: Post-generation Usage Tracking​

Client-side Handling​

Token Estimation​