LLM Token Limits — Context Windows, Truncation, and Budgeting

💡An LLM token limit is the maximum combined input and output size a model can process in one request. If your prompt is too large, the model truncates context or rejects the call. Use ToolDock Token Counter to estimate request size before you send it to an API.

Key Concepts

Oversized transcript

❌ Wrong

prompt = fullTranscript + question

✅ Fixed

prompt = summarize(fullTranscript) + lastMessages + question

Summarization keeps the working context smaller.

No output reserve

❌ Wrong

max_input = full_context_window

✅ Fixed

reserveTokensForOutput = 1000

You need room for the model response, not only the prompt.

Too many retrieved chunks

❌ Wrong

chunks = top50Documents

✅ Fixed

chunks = rerank(top8Documents)

RAG systems usually work better with fewer, better chunks.

Single giant diff

❌ Wrong

review(fullRepositoryDiff)

✅ Fixed

review(splitDiffByFile())

Chunking reduces token pressure and often improves answer quality.

Real-World Context

Long support transcript

messages = history + latestQuestion

Chat history can silently push a request over the model context window.

RAG pipeline

prompt = instructions + retrievedChunks.join('\n\n')

Retrieved passages often need ranking and trimming to fit the budget.

Code review bot

prompt = diff + fileContents + rubric

Large diffs can exceed limits unless you chunk the review.

💡 All tools run in your browser. No data is sent to any server.

Related Guides

Frequently Asked Questions

What is an LLM token limit?

An LLM token limit is the size cap for the combined prompt and generated output in a single request. Providers expose it as a context window or maximum token count.

What happens if I exceed the token limit?

The provider may reject the request, truncate context, or force you to reduce output size. The exact behavior depends on the API and model.

Why should I reserve output tokens?

If you use the entire context window for the input, the model has no room left to answer. Reserving output tokens prevents incomplete or failed generations.

All tools run in your browser. Your data never leaves your device.