M02: Tokens — The Atoms of AI Communication
Every interaction with Claude — every cost, every limit, every performance decision — traces back to tokens. This module gives you a hands-on understanding of what tokens are and why they're the single most important unit of measurement in agent engineering.
Learning Objectives
- Explain what tokens are and how text is split into them via byte-pair encoding
- Calculate the cost of an API call based on input and output token counts
- Describe the context window and how it's shared between input and output
- Use the Anthropic SDK to count tokens programmatically
- Build a token budget calculator that prevents context window overflow
What Are Tokens?
BEFORE: Imagine trying to teach a child to read by showing them entire paragraphs at once — no letters, no syllables, just raw walls of text. The child has no way to break the language into manageable pieces, so learning never gets off the ground.
PAIN: Computers face the same problem with raw text. A string like "understanding" is just a sequence of bytes to a machine — it has no inherent structure, no meaning, and no efficient way to represent the millions of possible words across every language.
MAPPING: Tokens solve this exactly the way syllables solve reading for children: they break text into reusable chunks of just the right size. When you read "understanding," your brain splits it into syllables: un-der-stand-ing. When Claude reads it, the tokenizerA preprocessing algorithm that converts raw text into a sequence of token IDs before the language model can process it. Different models use different tokenizers. splits it into tokens: "understand" + "ing". Common words like "the" are a single token. Rare words get split into smaller pieces. An emoji might be 2–3 tokens. This chunking lets the model represent any text using a fixed vocabulary of ~100K pieces — efficient, consistent, and universal.
What this actually looks like in practice: When you send the sentence "Claude is helpful" to the API, the tokenizer converts it into a sequence of numeric IDs before the model ever sees it. Here's the before and after:
# What you send (text):
"Claude is helpful"
# What the model actually receives (token IDs):
[51423, 318, 10950]
# Three tokens: "Claude" → 51423, "is" → 318, "helpful" → 10950
# The model works ONLY with these numbers — never with raw text.
These chunks are created during training by a process called byte-pair encoding (BPE)A compression algorithm that iteratively merges the most common pairs of characters or character sequences in training data. The result is a vocabulary of ~100K subword tokens that efficiently represent any text.. Here is how BPE works in plain English: the algorithm scans billions of words of training text and finds the character pairs that appear most often (like "t" + "h"). It merges those into a single piece ("th"), then repeats — merging "th" + "e" into "the," and so on. After thousands of merges, you get a vocabularyThe complete set of tokens a model knows. Claude's vocabulary contains ~100K tokens, learned during training via byte-pair encoding. Any text can be represented as a sequence of tokens from this fixed set. of ~100K reusable pieces that can represent any text efficiently.
Why this approach? Because a fixed vocabulary means the model always works with the same set of building blocks. Each token maps to a numeric ID (for example, "the" = 487). Before Claude reads your message, the tokenizer converts your text into a sequence of these IDs. After Claude generates a response, the IDs are converted back into text. The model never sees raw characters — only token IDs, from start to finish.
Here's what that means in dollars. A customer-support agent that sends 2,000 input tokens and receives 500 output tokens per interaction costs roughly $0.014 per call on Claude Sonnet 4.6. At 50,000 interactions per day, that adds up to $700/day. A 20% reduction in prompt tokens — achieved by trimming verbose system prompts — saves $140/day, or $4,200/month.
Tokens also determine whether your call succeeds or fails. If your agent loop accumulates 190K tokens of conversation history in a 200K context window, Claude has only ~10K tokens left for its response — about 7,500 words. Exceed that limit, and the API returns a hard error. There is no graceful fallback, no automatic truncation. The call simply fails.
"Tokens are just words, right?" — Not quite. Common words like "the" are a single token, but longer words get split: "understanding" becomes "understand" + "ing" (2 tokens). And short words can merge with surrounding punctuation. The mapping is learned statistically by BPE, not based on dictionary words.
"One token = one character?" — No. In English, one token averages about 4 characters. But this varies wildly: a space is 1 token (1 char), while "Claude" is 1 token (6 chars). Non-English text and emojis use far more tokens per visible character — a single emoji might cost 2–3 tokens.
"Tokens only matter for billing." — Billing is the most visible impact, but tokens also determine whether your call succeeds (context window limits), how fast the response arrives (more tokens = more latency), and how well Claude can reason (information buried in a sea of tokens gets less attention).
"I can just use a bigger context window if I run out." — All current Claude models share the same 200K token limit. There is no "upgrade to a bigger window" option. When you hit the wall, you must actively manage what goes in: summarize history, trim system prompts, or prune irrelevant context.
"Token counts are the same across all models." — Different model families can use different tokenizers. A sentence that's 10 tokens on Claude might be 12 tokens on GPT-4. Always measure with the tokenizer for the specific model you're calling — don't assume counts transfer across providers.
Why Tokens Matter
Tokens impact three things you'll manage constantly as an agent engineer:
- Cost — You pay per token. Output tokensTokens generated by Claude in its response. Output tokens are more expensive than input tokens because they require the full autoregressiveA generation process where each new token depends on all previously generated tokens. Claude must run a full computation for every single output token, making generation much slower and more expensive than reading input. generation process (each token depends on all previous tokens). cost more than input tokensTokens in your message sent to Claude, including the system prompt and conversation history. Input tokens are cheaper because they're processed in parallel by the attention mechanism. because generation is computationally harder than reading.
- Limits — The context windowThe maximum total number of tokens that can fit in a single API call: your system prompt + conversation history + current message + Claude's response must all fit. Claude Sonnet 4.5 has a 200K token context window. is measured in tokens. Everything must fit: system promptA developer-provided instruction sent with every API call that sets Claude's role, behavior, and constraints. It's invisible to the end user but counts toward your input tokens., history, your message, and Claude's response.
- Performance — More tokens = slower response time + higher cost. Token efficiency is a core engineering skill.
These three concerns are deeply interconnected. When you optimize for cost (by writing shorter prompts), you also free up context window space. And shorter prompts mean fewer tokens to process, so responses arrive faster too. In other words, one optimization improves all three dimensions at once.
The reverse is also true. A bloated system prompt costs more per call. It also eats into your context window, leaving less room for conversation history. And it adds latency, because every extra token takes time to process. This is why tokens are the single most important unit of measurement in agent engineering — they're the common currency behind cost, capacity, and speed.
Interactive: Cost Calculator
The Context Window
BEFORE: Imagine working on a project where you could spread out unlimited papers across an infinitely large desk — every email, every note, every reference document visible at once. You would never have to choose what to keep in front of you because everything just fits.
PAIN: In reality, your desk is finite. When it fills up, you cannot just pile more papers on top — the stack topples, you lose track of critical documents, and your work grinds to a halt. You have to actively decide what stays on the desk and what goes back in the filing cabinet.
MAPPING: The context window is Claude's desk, and it has a hard size limit (200K tokens for current Claude models). Everything — your system prompt, the conversation history, the current message, AND Claude's response — has to fit on this desk simultaneously. When the desk is full, something has to go. And unlike a human who can glance at a filing cabinet, Claude can only work with what is on the desk right now — there is no "memory" outside the context window unless you build one yourself.
What this actually looks like: Here's a real API call. Notice how every piece — system prompt, history, user message — shares the same 200K-token budget. Claude's response has to fit in whatever space is left over. And here's the error you'll get when the desk overflows:
# When you exceed the context window, you get this error — no graceful fallback:
# anthropic.BadRequestError: 400 - {"error": {"type": "invalid_request_error",
# "message": "prompt is too long: 203847 tokens > 200000 maximum"}}
Now here's the normal case — everything fits on the desk:
# Everything inside this call shares ONE 200K-token context window:
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096, # ← caps Claude's response
system="You are a helpful assistant.", # ← ~10 tokens
messages=[
{"role": "user", "content": "Hi!"}, # ← ~3 tokens
{"role": "assistant", "content": "Hello!..."}, # ← ~5 tokens
{"role": "user", "content": "Explain RAG."}, # ← ~8 tokens
]
# Total input: ~26 tokens → 199,974 tokens remain for response
# But if history were 195K tokens → only 5K left for response!
)
The key insight is that this window is shared between everything in the call. Your system prompt, the full conversation history, the user's latest message, and Claude's response all draw from the same 200K-token pool:
[system prompt] + [conversation history] + [user message] + [response]. This is why the "budget formula" from the previous section matters so much — every token you spend on input is a token Claude cannot use for output.
What happens if you exceed the limit? The API returns an error immediately. It does not silently truncate your messages or summarize old turns for you. Managing what fits in the window is entirely your responsibility as the developer, and it is one of the most common sources of bugs in agent systems.
How is this different from human memory? When you have a conversation, you naturally forget some details but retain the gist — your brain compresses automatically. Claude has no such mechanism. It remembers everything inside the context window with perfect fidelity, but has zero memory of anything outside it. There is no "I vaguely remember" — it's all or nothing. This is why developers build explicit memory systems (summarization, retrieval, external databases) on top of the context window, which you'll learn in M08 and M11.
"The response gets its own separate context window." — No. Input and output share the same 200K budget. If you've used 195K on input, Claude only has 5K tokens (~3,750 words) left for its response. If you set max_tokensA required API parameter that caps how many tokens Claude can generate in its response. If the response reaches this limit, it's cut off mid-sentence. You only pay for tokens actually generated, not the max_tokens budget. higher than the remaining space, you'll get an error.
"Claude remembers our previous conversations." — It doesn't. Each API call is completely independent. The only reason Claude seems to "remember" is that your application sends the full conversation history as input every time. No history in the request = no memory of past turns.
"The system prompt doesn't count toward the token limit." — It absolutely does. A 2,000-token system prompt is 2,000 tokens of your 200K budget, consumed on every single call. Over a 50-turn conversation, that system prompt is sent 50 times — costing you 100K tokens of billing even though the text never changes.
"I'll just summarize when I get close to the limit." — By then it's often too late. Summarization itself requires an API call that also needs context window space. If you're at 195K tokens, you don't have room to ask Claude to summarize. Start managing your budget proactively at 60–70% utilization, not at 95%.
Token Counting in Practice
To manage your token budget, you need to measure token usage. There are two ways to do this: estimation (fast but approximate) and exact counting via the API (slower but precise).
Estimation is useful when you need a quick sanity check. The "1 token ≈ 4 characters" rule works well for English prose, but it breaks down in predictable ways. Code tends to use more tokens per character because of short variable names and heavy punctuation. Non-English languages — especially CJK scripts — can use 2–3x more tokens per word than English. Emojis and special Unicode characters are surprisingly expensive, sometimes costing 2–3 tokens for a single visible character.
For production systems, estimation isn't enough. The Anthropic SDK gives you two sources of exact counts: the usage field on every API response (which tells you what you just spent) and the dedicated count_tokens endpoint (which tells you what you're about to spend, before making the call). The code walkthrough below shows both approaches.
Rules of Thumb
- ~1 token = ~4 characters in English (or ~0.75 words)
- Code is typically more token-dense than prose (more punctuation, short variable names)
- Non-English text and special characters use more tokens per word
- The
usagefield in every API response gives you exact counts
The Token Budget Formula
available_for_response = context_window - system_prompt_tokens - history_tokens - user_message_tokens
# Example: 200,000 - 800 - 50,000 - 1,200 = 148,000 tokens available for response
Code Walkthrough: Token Counting & Budget Management
Counting Tokens with the SDK
Let's start with the simplest approach: make an API call and read the token counts from the response. Every response from Claude includes a usage field — for free, on every single call — that tells you exactly how many input and output tokens were consumed. No separate endpoint needed, no extra cost. This is the fastest way to monitor spending and debug budget issues in real time.
One thing that trips people up: the usage.input_tokens count includes your system prompt, not just the user message. If your input count seems higher than expected, that's almost certainly why. Check your system prompt length first before looking for other causes.
# pip install anthropic>=0.30.0
import anthropic
client = anthropic.Anthropic()
try:
# Make a call and inspect token usage
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system="You are a helpful assistant.",
messages=[
{"role": "user", "content": "Explain what tokens are in 2 sentences."}
]
)
# The usage field tells you exact token counts
print(f"Input tokens: {message.usage.input_tokens}")
print(f"Output tokens: {message.usage.output_tokens}")
print(f"Total tokens: {message.usage.input_tokens + message.usage.output_tokens}")
print(f"\nResponse: {message.content[0].text}")
except anthropic.APIError as e:
print(f"API error: {e.status_code} - {e.message}")
// npm install @anthropic-ai/sdk@^0.30.0
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
try {
const message = await client.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 1024,
system: 'You are a helpful assistant.',
messages: [
{ role: 'user', content: 'Explain what tokens are in 2 sentences.' }
]
});
// The usage field tells you exact token counts
console.log(`Input tokens: ${message.usage.input_tokens}`);
console.log(`Output tokens: ${message.usage.output_tokens}`);
console.log(`Total tokens: ${message.usage.input_tokens + message.usage.output_tokens}`);
console.log(`\nResponse: ${message.content[0].text}`);
} catch (error) {
if (error instanceof Anthropic.APIError) {
console.error(`API error: ${error.status} - ${error.message}`);
} else {
throw error;
}
}
usage field to see that the short prompt used 26 input tokens and Claude's two-sentence response used 42 output tokens — 68 total. Notice the ratio: 26 tokens of input produced 42 tokens of output. In a real agent loop with tool calls, you might send 5,000 input tokens and get 2,000 output tokens per step — across 6 steps, that is 42,000 tokens for a single user request. This is why monitoring usage on every call matters.
Building a Token Budget Calculator
Now for the more interesting part: checking the budget before you make a call. The previous example told you what you spent after the fact — useful for monitoring, but it can't prevent a failure. In production agents, you need to know whether the call will even fit before you send it. If you blindly send a request that overflows the context window, the API returns an error, and your user sees a failure with no explanation.
The function below uses the SDK's count_tokens endpoint to measure a full conversation (system prompt + message history), then calculates how much room is left. Think of it as a pre-flight check for every API call — the same way a pilot checks fuel before takeoff, not after landing.
Here's a tradeoff worth knowing about: count_tokens is a separate API call, which adds ~50–100ms of latency. For high-throughput systems, that overhead on every call adds up. A practical compromise is to use the rough "1 token ≈ 4 characters" estimate for quick checks. Then only call count_tokens when your estimate suggests you're above 70% utilization — that's when precision matters most and when getting it wrong means a crash.
import anthropic
client = anthropic.Anthropic()
def check_token_budget(
system_prompt: str,
messages: list[dict],
model: str = "claude-sonnet-4-6",
max_context: int = 200_000,
desired_response_tokens: int = 4096,
) -> dict:
"""Check if a conversation fits within the token budget."""
try:
# Count tokens for the full request using the API
count_response = client.messages.count_tokens(
model=model,
system=system_prompt,
messages=messages,
)
input_tokens = count_response.input_tokens
except anthropic.APIError as e:
return {"error": f"Token counting failed: {e.message}"}
available = max_context - input_tokens
fits = available >= desired_response_tokens
return {
"input_tokens": input_tokens,
"available_for_response": available,
"desired_response_tokens": desired_response_tokens,
"fits": fits,
"utilization_pct": round((input_tokens / max_context) * 100, 1),
"warning": None if fits else (
f"Only {available} tokens left for response, "
f"but {desired_response_tokens} requested. "
f"Trim history or reduce max_tokens."
),
}
# Usage example
system = "You are a helpful coding assistant. Always provide complete, runnable examples."
conversation = [
{"role": "user", "content": "Write a Python function to sort a list."},
{"role": "assistant", "content": "def sort_list(items): return sorted(items)"},
{"role": "user", "content": "Now make it handle None values."},
]
budget = check_token_budget(system, conversation)
print(f"Input tokens: {budget['input_tokens']}")
print(f"Available: {budget['available_for_response']}")
print(f"Utilization: {budget['utilization_pct']}%")
print(f"Fits: {budget['fits']}")
if budget.get("warning"):
print(f"WARNING: {budget['warning']}")
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
async function checkTokenBudget({
systemPrompt,
messages,
model = 'claude-sonnet-4-6',
maxContext = 200_000,
desiredResponseTokens = 4096,
}) {
try {
// Count tokens for the full request using the API
const countResponse = await client.messages.countTokens({
model,
system: systemPrompt,
messages,
});
const inputTokens = countResponse.input_tokens;
const available = maxContext - inputTokens;
const fits = available >= desiredResponseTokens;
return {
inputTokens,
availableForResponse: available,
desiredResponseTokens,
fits,
utilizationPct: ((inputTokens / maxContext) * 100).toFixed(1),
warning: fits ? null : (
`Only ${available} tokens left for response, ` +
`but ${desiredResponseTokens} requested. ` +
`Trim history or reduce max_tokens.`
),
};
} catch (error) {
if (error instanceof Anthropic.APIError) {
return { error: `Token counting failed: ${error.message}` };
}
throw error;
}
}
// Usage example
const budget = await checkTokenBudget({
systemPrompt: 'You are a helpful coding assistant. Always provide complete, runnable examples.',
messages: [
{ role: 'user', content: 'Write a function to sort a list.' },
{ role: 'assistant', content: 'function sortList(items) { return [...items].sort(); }' },
{ role: 'user', content: 'Now make it handle null values.' },
],
});
console.log(`Input tokens: ${budget.inputTokens}`);
console.log(`Available: ${budget.availableForResponse}`);
console.log(`Utilization: ${budget.utilizationPct}%`);
console.log(`Fits: ${budget.fits}`);
if (budget.warning) console.log(`WARNING: ${budget.warning}`);
warning field in the return value catches exactly that scenario. In the modules ahead (especially M08: Conversation Management), you will wire this function into your agent loop so it runs automatically before every API call.
Hands-On Exercise: Build a Token-Aware Prompt Function
What You'll Build
A reusable safe_chat() function that checks your token budget before every API call and warns you when space is running low. You'll use this utility throughout the rest of the course.
Time estimate: 20–30 minutes • Prerequisites: Completed M01 lab (API key set, SDK installed) • Files you'll create: token_tools.py (or token_tools.mjs)
Environment Setup
If you completed the M01 lab, you're already set up. Otherwise, run this first:
# Python
pip install "anthropic>=0.30.0"
export ANTHROPIC_API_KEY="your-key-here"
# Node.js
npm install @anthropic-ai/sdk@^0.30.0
export ANTHROPIC_API_KEY="your-key-here"
Step 1: Count Tokens for a System Prompt
Before you can manage a budget, you need to measure what you're spending. This step builds a small helper that takes a system prompt and a list of messages, and returns the exact token count using the SDK's count_tokens endpoint. This is the foundation for all budget logic.
Create a new file called token_tools.py (or token_tools.mjs):
import anthropic
client = anthropic.Anthropic()
def count_input_tokens(
system_prompt: str,
messages: list[dict],
model: str = "claude-sonnet-4-6",
) -> int:
"""Count the exact number of input tokens for a request."""
result = client.messages.count_tokens(
model=model,
system=system_prompt,
messages=messages,
)
return result.input_tokens
# Test it
system = "You are a helpful assistant."
msgs = [{"role": "user", "content": "Hello, how are you?"}]
count = count_input_tokens(system, msgs)
print(f"Input tokens: {count}")
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
async function countInputTokens(systemPrompt, messages, model = 'claude-sonnet-4-6') {
const result = await client.messages.countTokens({
model,
system: systemPrompt,
messages,
});
return result.input_tokens;
}
// Test it
const count = await countInputTokens(
'You are a helpful assistant.',
[{ role: 'user', content: 'Hello, how are you?' }]
);
console.log(`Input tokens: ${count}`);
Run it: python token_tools.py (or node token_tools.mjs)
Troubleshooting — Step 1
ModuleNotFoundError: No module named 'anthropic'— Runpip install "anthropic>=0.30.0"(Python) ornpm install @anthropic-ai/sdk(Node.js).AuthenticationError: Invalid API Key— Check that yourANTHROPIC_API_KEYenvironment variable is set. Runecho $ANTHROPIC_API_KEYto verify it's not empty.AttributeError: 'Messages' object has no attribute 'count_tokens'— Your SDK is too old. Runpip install --upgrade anthropicto get version 0.30.0+.
Step 2: Calculate Remaining Budget
Now let's build the budget calculator. This function uses the token count from Step 1 and subtracts it from the 200K context window to tell you how much room is left for Claude's response. This is the "pre-flight check" you'll wire into every agent loop.
Add the following function to token_tools.py (after the count_input_tokens function from Step 1):
def check_budget(
system_prompt: str,
messages: list[dict],
desired_output: int = 4096,
max_context: int = 200_000,
) -> dict:
"""Check if a conversation fits within the token budget."""
input_tokens = count_input_tokens(system_prompt, messages)
available = max_context - input_tokens
fits = available >= desired_output
utilization = round((input_tokens / max_context) * 100, 1)
return {
"input_tokens": input_tokens,
"available": available,
"fits": fits,
"utilization_pct": utilization,
"warning": None if fits else (
f"Only {available} tokens left, but {desired_output} requested. "
f"Trim history or reduce max_tokens."
),
}
# Test with a short conversation
budget = check_budget(
"You are a helpful assistant.",
[
{"role": "user", "content": "What are tokens?"},
{"role": "assistant", "content": "Tokens are subword units that LLMs process."},
{"role": "user", "content": "How do they affect cost?"},
]
)
print(f"Input tokens: {budget['input_tokens']}")
print(f"Available: {budget['available']}")
print(f"Utilization: {budget['utilization_pct']}%")
print(f"Fits: {budget['fits']}")
async function checkBudget(systemPrompt, messages, desiredOutput = 4096, maxContext = 200_000) {
const inputTokens = await countInputTokens(systemPrompt, messages);
const available = maxContext - inputTokens;
const fits = available >= desiredOutput;
const utilization = ((inputTokens / maxContext) * 100).toFixed(1);
return {
inputTokens,
available,
fits,
utilizationPct: utilization,
warning: fits ? null : (
`Only ${available} tokens left, but ${desiredOutput} requested. ` +
`Trim history or reduce max_tokens.`
),
};
}
// Test with a short conversation
const budget = await checkBudget(
'You are a helpful assistant.',
[
{ role: 'user', content: 'What are tokens?' },
{ role: 'assistant', content: 'Tokens are subword units that LLMs process.' },
{ role: 'user', content: 'How do they affect cost?' },
]
);
console.log(`Input tokens: ${budget.inputTokens}`);
console.log(`Available: ${budget.available}`);
console.log(`Utilization: ${budget.utilizationPct}%`);
console.log(`Fits: ${budget.fits}`);
Run it: python token_tools.py (or node token_tools.mjs)
Fits: True with a low utilization percentage, Step 2 is working. The 3-turn conversation uses less than 0.1% of the context window — plenty of room. But imagine this at turn 200 of a long support chat.
Troubleshooting — Step 2
NameError: name 'count_input_tokens' is not defined— Make sure bothcount_input_tokens(from Step 1) andcheck_budgetare in the same file, and that Step 1's code appears above Step 2's.- Output shows
Fits: Falseunexpectedly — Check that yourmax_contextparameter is 200,000 (not 200). Python uses underscores in number literals:200_000. - Token count seems too high — Remember that
count_tokensincludes the system prompt tokens, not just the messages. A system prompt of "You are a helpful assistant." adds ~10 tokens.
Step 3: Build the Safe Chat Function
Now let's combine everything into a single safe_chat() function that checks the budget, warns if space is tight, and refuses to make the call if it would overflow. This is the utility you'll reuse throughout the course. It uses check_budget from Step 2.
Add this function to token_tools.py:
def safe_chat(
system_prompt: str,
messages: list[dict],
max_tokens: int = 1024,
warn_threshold: float = 0.7,
) -> str:
"""Make an API call with automatic token budget checking."""
budget = check_budget(system_prompt, messages, desired_output=max_tokens)
# Warn if utilization is high
if budget["utilization_pct"] > warn_threshold * 100:
print(f"⚠️ Token budget warning: {budget['utilization_pct']}% used")
print(f" Only {budget['available']} tokens left for response")
# Refuse if it won't fit
if not budget["fits"]:
raise ValueError(
f"Token budget exceeded: {budget['input_tokens']} input tokens, "
f"only {budget['available']} left. {budget['warning']}"
)
# Make the call
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=max_tokens,
system=system_prompt,
messages=messages,
)
print(f"📊 Tokens: {response.usage.input_tokens} in, {response.usage.output_tokens} out "
f"({budget['utilization_pct']}% window used)")
return response.content[0].text
# Test it
result = safe_chat(
"You are a helpful assistant. Keep answers brief.",
[{"role": "user", "content": "What's the tallest mountain on Earth?"}],
)
print(f"\nResponse: {result}")
async function safeChat(systemPrompt, messages, maxTokens = 1024, warnThreshold = 0.7) {
const budget = await checkBudget(systemPrompt, messages, maxTokens);
// Warn if utilization is high
if (parseFloat(budget.utilizationPct) > warnThreshold * 100) {
console.log(`⚠️ Token budget warning: ${budget.utilizationPct}% used`);
console.log(` Only ${budget.available} tokens left for response`);
}
// Refuse if it won't fit
if (!budget.fits) {
throw new Error(
`Token budget exceeded: ${budget.inputTokens} input tokens, ` +
`only ${budget.available} left. ${budget.warning}`
);
}
// Make the call
const response = await client.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: maxTokens,
system: systemPrompt,
messages,
});
console.log(`📊 Tokens: ${response.usage.input_tokens} in, ${response.usage.output_tokens} out ` +
`(${budget.utilizationPct}% window used)`);
return response.content[0].text;
}
// Test it
const result = await safeChat(
'You are a helpful assistant. Keep answers brief.',
[{ role: 'user', content: "What's the tallest mountain on Earth?" }],
);
console.log(`\nResponse: ${result}`);
Run it: python token_tools.py (or node token_tools.mjs)
safe_chat function now automatically checks the budget, warns at 70% utilization, and refuses calls that would overflow the context window.
Troubleshooting
NameError: name 'count_input_tokens' is not defined— Make sure all three functions (count_input_tokens,check_budget,safe_chat) are in the same file, and theclient = anthropic.Anthropic()line is at the top.AttributeError: 'Messages' object has no attribute 'count_tokens'— You need SDK version 0.30.0 or later. Runpip install --upgrade anthropic(ornpm install @anthropic-ai/sdk@latest).- Warning shows even for small conversations — Check that your
warn_thresholdis 0.7 (70%), not 0.007. It's a fraction, not a percentage.
Step 4 (Stretch Goal): Fill the Context Window
Nothing teaches token budget management like watching it happen in real time. The code below runs a loop that sends messages to Claude, one after another, and prints the token count after each turn. You'll see input_tokens climb from hundreds to tens of thousands — and eventually hit the 150K warning threshold.
The interesting part is the growth pattern. Each turn adds both the new user message and Claude's previous response to the conversation history. So the input tokens don't grow linearly — they accelerate. By turn 30, you might be consuming 10x more input tokens per call than you were at turn 5.
A fair warning: this exercise makes many API calls in a loop, so it will cost real money (~$0.50–$2.00 depending on how far it runs). If you'd rather observe the pattern without the cost, just read the expected output below instead of running the code.
Save this as a separate file (e.g., fill_window.py or fill_window.mjs) — it has its own import and client setup, so do not paste it into token_tools.py from the previous steps.
import anthropic
client = anthropic.Anthropic()
# Build a conversation that grows until the budget runs out
conversation = []
system = "You are a helpful assistant. Keep your answers brief (under 50 words)."
for i in range(1, 100):
conversation.append({
"role": "user",
"content": f"Tell me fact #{i} about space exploration."
})
try:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=200,
system=system,
messages=conversation,
)
assistant_msg = response.content[0].text
conversation.append({"role": "assistant", "content": assistant_msg})
total = response.usage.input_tokens + response.usage.output_tokens
print(f"Turn {i}: {response.usage.input_tokens} in + {response.usage.output_tokens} out = {total} total")
# Watch the input tokens grow with each turn
if response.usage.input_tokens > 150_000:
print(f"\n--- Budget warning! Input alone is {response.usage.input_tokens} tokens ---")
break
except anthropic.APIError as e:
print(f"\nHit the limit at turn {i}: {e.message}")
break
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
const conversation = [];
const system = 'You are a helpful assistant. Keep your answers brief (under 50 words).';
for (let i = 1; i <= 100; i++) {
conversation.push({
role: 'user',
content: `Tell me fact #${i} about space exploration.`,
});
try {
const response = await client.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 200,
system,
messages: conversation,
});
const assistantMsg = response.content[0].text;
conversation.push({ role: 'assistant', content: assistantMsg });
const total = response.usage.input_tokens + response.usage.output_tokens;
console.log(`Turn ${i}: ${response.usage.input_tokens} in + ${response.usage.output_tokens} out = ${total} total`);
if (response.usage.input_tokens > 150_000) {
console.log(`\n--- Budget warning! Input alone is ${response.usage.input_tokens} tokens ---`);
break;
}
} catch (error) {
if (error instanceof Anthropic.APIError) {
console.log(`\nHit the limit at turn ${i}: ${error.message}`);
} else {
throw error;
}
break;
}
}
Verify Everything Works
Run your complete token_tools.py file to confirm all three functions work together:
python token_tools.py # or: node token_tools.mjs
You should see output from all three steps: a token count, a budget check, and a successful safe_chat response with usage stats.
safe_chat() function that prevents context window overflow. You'll use this pattern in M08 (Conversation Management) and in every capstone project. The key skill you've built: never make an API call without knowing your token budget first.
Knowledge Check
Test your understanding of tokens, costs, and context windows.
Q1: Approximately how many tokens is the sentence "Hello, how are you today?"
Q2: Why are output tokens more expensive than input tokens?
Q3: Your system prompt is 1,000 tokens, conversation history is 150,000 tokens, and your user message is 2,000 tokens. Using a 200K context window, how many tokens are available for Claude's response?
Q4: What happens when your total tokens (input + requested output) exceed the context window?
Q5: Which of these strategies helps manage token budget? (Select the best answer)
Q6: Fill in the blank to access the token count from an API response:
response = client.messages.create(...)
input_count = response._______.input_tokens
tokens
metadata
usage
content
usage field on the response object contains input_tokens and output_tokens. This is the same pattern you used in M01's first API call.Module Summary
Key Takeaways
- Tokens are subword units — produced by byte-pair encoding, they're the atomic unit of LLM input and output.
- Three impacts: cost, limits, performance — every token costs money, fills the context window, and adds latency.
- Output tokens cost more — because autoregressive generation is more compute-intensive than parallel input processing.
- The context window is shared — system prompt + history + message + response must all fit. Overflow = error.
- Token budget management is essential — count before you call, and you'll never be surprised by overflow or runaway costs.
Next Module Preview: M03 — Prompts
Now that you understand tokens and context windows, you're ready to learn how to use that space effectively. In Module 3, you'll master prompt engineering — system prompts, few-shot examples, chain-of-thought reasoning, and the art of getting Claude to do exactly what you need.