M02: Tokens — The Atoms of AI Communication

Every interaction with Claude — every cost, every limit, every performance decision — traces back to tokens. This module gives you a hands-on understanding of what tokens are and why they're the single most important unit of measurement in agent engineering.

Learning Objectives

  • Explain what tokens are and how text is split into them via byte-pair encoding
  • Calculate the cost of an API call based on input and output token counts
  • Describe the context window and how it's shared between input and output
  • Use the Anthropic SDK to count tokens programmatically
  • Build a token budget calculator that prevents context window overflow

What Are Tokens?

Everyday Analogy

BEFORE: Imagine trying to teach a child to read by showing them entire paragraphs at once — no letters, no syllables, just raw walls of text. The child has no way to break the language into manageable pieces, so learning never gets off the ground.

PAIN: Computers face the same problem with raw text. A string like "understanding" is just a sequence of bytes to a machine — it has no inherent structure, no meaning, and no efficient way to represent the millions of possible words across every language.

MAPPING: Tokens solve this exactly the way syllables solve reading for children: they break text into reusable chunks of just the right size. When you read "understanding," your brain splits it into syllables: un-der-stand-ing. When Claude reads it, the tokenizerA preprocessing algorithm that converts raw text into a sequence of token IDs before the language model can process it. Different models use different tokenizers. splits it into tokens: "understand" + "ing". Common words like "the" are a single token. Rare words get split into smaller pieces. An emoji might be 2–3 tokens. This chunking lets the model represent any text using a fixed vocabulary of ~100K pieces — efficient, consistent, and universal.

What this actually looks like in practice: When you send the sentence "Claude is helpful" to the API, the tokenizer converts it into a sequence of numeric IDs before the model ever sees it. Here's the before and after:

# What you send (text):
"Claude is helpful"

# What the model actually receives (token IDs):
[51423, 318, 10950]

# Three tokens: "Claude" → 51423, "is" → 318, "helpful" → 10950
# The model works ONLY with these numbers — never with raw text.
Technical Definition A tokenThe smallest unit of text that an LLM processes. Tokens are subword units — they can be whole words, word fragments, or individual characters. Each token maps to a numeric ID in the model's vocabulary. is a piece of text that is smaller than (or equal to) a whole word. Think of it as a subwordA fragment smaller than a whole word. Tokenizers split rare or long words into subword pieces so every text can be represented using a fixed-size vocabulary. chunk — "playing" might become two tokens: "play" and "ing."

These chunks are created during training by a process called byte-pair encoding (BPE)A compression algorithm that iteratively merges the most common pairs of characters or character sequences in training data. The result is a vocabulary of ~100K subword tokens that efficiently represent any text.. Here is how BPE works in plain English: the algorithm scans billions of words of training text and finds the character pairs that appear most often (like "t" + "h"). It merges those into a single piece ("th"), then repeats — merging "th" + "e" into "the," and so on. After thousands of merges, you get a vocabularyThe complete set of tokens a model knows. Claude's vocabulary contains ~100K tokens, learned during training via byte-pair encoding. Any text can be represented as a sequence of tokens from this fixed set. of ~100K reusable pieces that can represent any text efficiently.

Why this approach? Because a fixed vocabulary means the model always works with the same set of building blocks. Each token maps to a numeric ID (for example, "the" = 487). Before Claude reads your message, the tokenizer converts your text into a sequence of these IDs. After Claude generates a response, the IDs are converted back into text. The model never sees raw characters — only token IDs, from start to finish.
Diagram: How a Sentence Becomes Tokens
Understanding tokenization is fundamental tokenizer splits Understand ID: 48583 ing ID: 278 token ID: 5765 ization ID: 2065 is ID: 318 fundamental ID: 14363 Whole-word token Subword token Single-char token (rare words, emoji) 4 words → 6 tokens — common words stay whole, long words get split "Understanding" = 2 tokens "tokenization" = 2 tokens
Interactive: Live Tokenizer
Why It Matters As you learned in M01, Claude predicts the next token. So tokens aren't just a billing unit — they're the literal building blocks of every response.

Here's what that means in dollars. A customer-support agent that sends 2,000 input tokens and receives 500 output tokens per interaction costs roughly $0.014 per call on Claude Sonnet 4.6. At 50,000 interactions per day, that adds up to $700/day. A 20% reduction in prompt tokens — achieved by trimming verbose system prompts — saves $140/day, or $4,200/month.

Tokens also determine whether your call succeeds or fails. If your agent loop accumulates 190K tokens of conversation history in a 200K context window, Claude has only ~10K tokens left for its response — about 7,500 words. Exceed that limit, and the API returns a hard error. There is no graceful fallback, no automatic truncation. The call simply fails.
⚠️ Common Misconceptions

"Tokens are just words, right?" — Not quite. Common words like "the" are a single token, but longer words get split: "understanding" becomes "understand" + "ing" (2 tokens). And short words can merge with surrounding punctuation. The mapping is learned statistically by BPE, not based on dictionary words.

"One token = one character?" — No. In English, one token averages about 4 characters. But this varies wildly: a space is 1 token (1 char), while "Claude" is 1 token (6 chars). Non-English text and emojis use far more tokens per visible character — a single emoji might cost 2–3 tokens.

"Tokens only matter for billing." — Billing is the most visible impact, but tokens also determine whether your call succeeds (context window limits), how fast the response arrives (more tokens = more latency), and how well Claude can reason (information buried in a sea of tokens gets less attention).

"I can just use a bigger context window if I run out." — All current Claude models share the same 200K token limit. There is no "upgrade to a bigger window" option. When you hit the wall, you must actively manage what goes in: summarize history, trim system prompts, or prune irrelevant context.

"Token counts are the same across all models." — Different model families can use different tokenizers. A sentence that's 10 tokens on Claude might be 12 tokens on GPT-4. Always measure with the tokenizer for the specific model you're calling — don't assume counts transfer across providers.

🎓 Cert Tip — Domain 5.1 The certification exam tests your understanding of the "lost in the middle" effect: information placed in the middle of a long context window gets lower recall than information at the start or end. When managing tokens, position critical data (case facts, key instructions) at the beginning of the context (system prompt) or at the end (latest user message). You'll explore this in depth in M08.

Why Tokens Matter

Tokens impact three things you'll manage constantly as an agent engineer:

  1. Cost — You pay per token. Output tokensTokens generated by Claude in its response. Output tokens are more expensive than input tokens because they require the full autoregressiveA generation process where each new token depends on all previously generated tokens. Claude must run a full computation for every single output token, making generation much slower and more expensive than reading input. generation process (each token depends on all previous tokens). cost more than input tokensTokens in your message sent to Claude, including the system prompt and conversation history. Input tokens are cheaper because they're processed in parallel by the attention mechanism. because generation is computationally harder than reading.
  2. Limits — The context windowThe maximum total number of tokens that can fit in a single API call: your system prompt + conversation history + current message + Claude's response must all fit. Claude Sonnet 4.5 has a 200K token context window. is measured in tokens. Everything must fit: system promptA developer-provided instruction sent with every API call that sets Claude's role, behavior, and constraints. It's invisible to the end user but counts toward your input tokens., history, your message, and Claude's response.
  3. Performance — More tokens = slower response time + higher cost. Token efficiency is a core engineering skill.

These three concerns are deeply interconnected. When you optimize for cost (by writing shorter prompts), you also free up context window space. And shorter prompts mean fewer tokens to process, so responses arrive faster too. In other words, one optimization improves all three dimensions at once.

The reverse is also true. A bloated system prompt costs more per call. It also eats into your context window, leaving less room for conversation history. And it adds latency, because every extra token takes time to process. This is why tokens are the single most important unit of measurement in agent engineering — they're the common currency behind cost, capacity, and speed.

Diagram: How Token Costs Add Up
Input Tokens 2,000 tokens/call + Output Tokens 500 tokens/call = Total Cost per call ↓ Cost per 1M tokens (input / output) — example: 2K in + 500 out per call Haiku 4.5 $1.00 / $5.00 $0.0045/call $6.75/day @ 1,500 calls Sonnet 4.6 $3.00 / $15.00 $0.0135/call $20.25/day @ 1,500 calls Opus 4.7 $15.00 / $75.00 $0.0675/call $101.25/day @ 1,500 calls Input cost Output cost (3–5x more expensive) Key insight: Output tokens dominate cost despite being fewer. Reducing response length (via max_tokens or prompt instructions) saves more than trimming input.

Interactive: Cost Calculator

Interactive: API Cost Calculator
$0.00
estimated daily cost
Cost Insight A typical agent loop makes 3–8 API calls per user request. That includes tool calls, retries, and validation steps. If each call averages 1,000 input tokens and 500 output tokens on Sonnet, a single user interaction costs ~$0.01–$0.03. That sounds tiny — until you multiply it out. At 10,000 users per day, that's $100–$300/day, or $3,000–$9,000/month. Token management isn't premature optimization — it's table-stakes engineering.

The Context Window

Everyday Analogy

BEFORE: Imagine working on a project where you could spread out unlimited papers across an infinitely large desk — every email, every note, every reference document visible at once. You would never have to choose what to keep in front of you because everything just fits.

PAIN: In reality, your desk is finite. When it fills up, you cannot just pile more papers on top — the stack topples, you lose track of critical documents, and your work grinds to a halt. You have to actively decide what stays on the desk and what goes back in the filing cabinet.

MAPPING: The context window is Claude's desk, and it has a hard size limit (200K tokens for current Claude models). Everything — your system prompt, the conversation history, the current message, AND Claude's response — has to fit on this desk simultaneously. When the desk is full, something has to go. And unlike a human who can glance at a filing cabinet, Claude can only work with what is on the desk right now — there is no "memory" outside the context window unless you build one yourself.

What this actually looks like: Here's a real API call. Notice how every piece — system prompt, history, user message — shares the same 200K-token budget. Claude's response has to fit in whatever space is left over. And here's the error you'll get when the desk overflows:

# When you exceed the context window, you get this error — no graceful fallback:
# anthropic.BadRequestError: 400 - {"error": {"type": "invalid_request_error",
#   "message": "prompt is too long: 203847 tokens > 200000 maximum"}}

Now here's the normal case — everything fits on the desk:

# Everything inside this call shares ONE 200K-token context window:
message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,                           # ← caps Claude's response
    system="You are a helpful assistant.",      # ← ~10 tokens
    messages=[
        {"role": "user", "content": "Hi!"},           # ← ~3 tokens
        {"role": "assistant", "content": "Hello!..."},  # ← ~5 tokens
        {"role": "user", "content": "Explain RAG."},   # ← ~8 tokens
    ]
    # Total input: ~26 tokens → 199,974 tokens remain for response
    # But if history were 195K tokens → only 5K left for response!
)
Technical Definition The context window is the total number of tokens that Claude can "see" in a single API call — think of it as the model's working memory for that request. All current Claude models (Haiku 4.5, Sonnet 4.6, and Opus 4.7) have a 200K-token context window.

The key insight is that this window is shared between everything in the call. Your system prompt, the full conversation history, the user's latest message, and Claude's response all draw from the same 200K-token pool: [system prompt] + [conversation history] + [user message] + [response]. This is why the "budget formula" from the previous section matters so much — every token you spend on input is a token Claude cannot use for output.

What happens if you exceed the limit? The API returns an error immediately. It does not silently truncate your messages or summarize old turns for you. Managing what fits in the window is entirely your responsibility as the developer, and it is one of the most common sources of bugs in agent systems.

How is this different from human memory? When you have a conversation, you naturally forget some details but retain the gist — your brain compresses automatically. Claude has no such mechanism. It remembers everything inside the context window with perfect fidelity, but has zero memory of anything outside it. There is no "I vaguely remember" — it's all or nothing. This is why developers build explicit memory systems (summarization, retrieval, external databases) on top of the context window, which you'll learn in M08 and M11.
Animation: Context Window Allocation
Context Window: 200K tokens
System Prompt
800 tokens
History
12,000 tokens
User Message
1,200 tokens
Response Space
4,096 tokens
18,096 / 200,000 tokens used
⚠️ Common Misconceptions

"The response gets its own separate context window." — No. Input and output share the same 200K budget. If you've used 195K on input, Claude only has 5K tokens (~3,750 words) left for its response. If you set max_tokensA required API parameter that caps how many tokens Claude can generate in its response. If the response reaches this limit, it's cut off mid-sentence. You only pay for tokens actually generated, not the max_tokens budget. higher than the remaining space, you'll get an error.

"Claude remembers our previous conversations." — It doesn't. Each API call is completely independent. The only reason Claude seems to "remember" is that your application sends the full conversation history as input every time. No history in the request = no memory of past turns.

"The system prompt doesn't count toward the token limit." — It absolutely does. A 2,000-token system prompt is 2,000 tokens of your 200K budget, consumed on every single call. Over a 50-turn conversation, that system prompt is sent 50 times — costing you 100K tokens of billing even though the text never changes.

"I'll just summarize when I get close to the limit." — By then it's often too late. Summarization itself requires an API call that also needs context window space. If you're at 195K tokens, you don't have room to ask Claude to summarize. Start managing your budget proactively at 60–70% utilization, not at 95%.

Token Counting in Practice

To manage your token budget, you need to measure token usage. There are two ways to do this: estimation (fast but approximate) and exact counting via the API (slower but precise).

Estimation is useful when you need a quick sanity check. The "1 token ≈ 4 characters" rule works well for English prose, but it breaks down in predictable ways. Code tends to use more tokens per character because of short variable names and heavy punctuation. Non-English languages — especially CJK scripts — can use 2–3x more tokens per word than English. Emojis and special Unicode characters are surprisingly expensive, sometimes costing 2–3 tokens for a single visible character.

For production systems, estimation isn't enough. The Anthropic SDK gives you two sources of exact counts: the usage field on every API response (which tells you what you just spent) and the dedicated count_tokens endpoint (which tells you what you're about to spend, before making the call). The code walkthrough below shows both approaches.

Rules of Thumb

  • ~1 token = ~4 characters in English (or ~0.75 words)
  • Code is typically more token-dense than prose (more punctuation, short variable names)
  • Non-English text and special characters use more tokens per word
  • The usage field in every API response gives you exact counts

The Token Budget Formula

available_for_response = context_window - system_prompt_tokens - history_tokens - user_message_tokens
# Example: 200,000 - 800 - 50,000 - 1,200 = 148,000 tokens available for response

Code Walkthrough: Token Counting & Budget Management

From concept to code: You now know the rules of thumb (1 token ~ 4 characters) and the budget formula (available = context_window - input). But rules of thumb are approximations — they can be off by 10-20% depending on language, punctuation, and formatting. The code below shows you how to get exact token counts from the Anthropic SDK, so you can make precise budget decisions in production rather than guessing.

Counting Tokens with the SDK

Let's start with the simplest approach: make an API call and read the token counts from the response. Every response from Claude includes a usage field — for free, on every single call — that tells you exactly how many input and output tokens were consumed. No separate endpoint needed, no extra cost. This is the fastest way to monitor spending and debug budget issues in real time.

One thing that trips people up: the usage.input_tokens count includes your system prompt, not just the user message. If your input count seems higher than expected, that's almost certainly why. Check your system prompt length first before looking for other causes.

# pip install anthropic>=0.30.0
import anthropic

client = anthropic.Anthropic()

try:
    # Make a call and inspect token usage
    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="You are a helpful assistant.",
        messages=[
            {"role": "user", "content": "Explain what tokens are in 2 sentences."}
        ]
    )

    # The usage field tells you exact token counts
    print(f"Input tokens:  {message.usage.input_tokens}")
    print(f"Output tokens: {message.usage.output_tokens}")
    print(f"Total tokens:  {message.usage.input_tokens + message.usage.output_tokens}")
    print(f"\nResponse: {message.content[0].text}")

except anthropic.APIError as e:
    print(f"API error: {e.status_code} - {e.message}")
// npm install @anthropic-ai/sdk@^0.30.0
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

try {
  const message = await client.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    system: 'You are a helpful assistant.',
    messages: [
      { role: 'user', content: 'Explain what tokens are in 2 sentences.' }
    ]
  });

  // The usage field tells you exact token counts
  console.log(`Input tokens:  ${message.usage.input_tokens}`);
  console.log(`Output tokens: ${message.usage.output_tokens}`);
  console.log(`Total tokens:  ${message.usage.input_tokens + message.usage.output_tokens}`);
  console.log(`\nResponse: ${message.content[0].text}`);

} catch (error) {
  if (error instanceof Anthropic.APIError) {
    console.error(`API error: ${error.status} - ${error.message}`);
  } else {
    throw error;
  }
}
Expected Output:
Input tokens: 26 Output tokens: 42 Total tokens: 68 Response: Tokens are the smallest units of text that a language model processes, similar to subword fragments that can represent whole words, parts of words, or punctuation. They serve as the fundamental building blocks through which AI models read input and generate output.
What Just Happened? You made an API call and inspected the usage field to see that the short prompt used 26 input tokens and Claude's two-sentence response used 42 output tokens — 68 total. Notice the ratio: 26 tokens of input produced 42 tokens of output. In a real agent loop with tool calls, you might send 5,000 input tokens and get 2,000 output tokens per step — across 6 steps, that is 42,000 tokens for a single user request. This is why monitoring usage on every call matters.

Building a Token Budget Calculator

Now for the more interesting part: checking the budget before you make a call. The previous example told you what you spent after the fact — useful for monitoring, but it can't prevent a failure. In production agents, you need to know whether the call will even fit before you send it. If you blindly send a request that overflows the context window, the API returns an error, and your user sees a failure with no explanation.

The function below uses the SDK's count_tokens endpoint to measure a full conversation (system prompt + message history), then calculates how much room is left. Think of it as a pre-flight check for every API call — the same way a pilot checks fuel before takeoff, not after landing.

Here's a tradeoff worth knowing about: count_tokens is a separate API call, which adds ~50–100ms of latency. For high-throughput systems, that overhead on every call adds up. A practical compromise is to use the rough "1 token ≈ 4 characters" estimate for quick checks. Then only call count_tokens when your estimate suggests you're above 70% utilization — that's when precision matters most and when getting it wrong means a crash.

import anthropic

client = anthropic.Anthropic()

def check_token_budget(
    system_prompt: str,
    messages: list[dict],
    model: str = "claude-sonnet-4-6",
    max_context: int = 200_000,
    desired_response_tokens: int = 4096,
) -> dict:
    """Check if a conversation fits within the token budget."""

    try:
        # Count tokens for the full request using the API
        count_response = client.messages.count_tokens(
            model=model,
            system=system_prompt,
            messages=messages,
        )
        input_tokens = count_response.input_tokens
    except anthropic.APIError as e:
        return {"error": f"Token counting failed: {e.message}"}

    available = max_context - input_tokens
    fits = available >= desired_response_tokens

    return {
        "input_tokens": input_tokens,
        "available_for_response": available,
        "desired_response_tokens": desired_response_tokens,
        "fits": fits,
        "utilization_pct": round((input_tokens / max_context) * 100, 1),
        "warning": None if fits else (
            f"Only {available} tokens left for response, "
            f"but {desired_response_tokens} requested. "
            f"Trim history or reduce max_tokens."
        ),
    }

# Usage example
system = "You are a helpful coding assistant. Always provide complete, runnable examples."
conversation = [
    {"role": "user", "content": "Write a Python function to sort a list."},
    {"role": "assistant", "content": "def sort_list(items): return sorted(items)"},
    {"role": "user", "content": "Now make it handle None values."},
]

budget = check_token_budget(system, conversation)
print(f"Input tokens:    {budget['input_tokens']}")
print(f"Available:       {budget['available_for_response']}")
print(f"Utilization:     {budget['utilization_pct']}%")
print(f"Fits:            {budget['fits']}")
if budget.get("warning"):
    print(f"WARNING: {budget['warning']}")
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

async function checkTokenBudget({
  systemPrompt,
  messages,
  model = 'claude-sonnet-4-6',
  maxContext = 200_000,
  desiredResponseTokens = 4096,
}) {
  try {
    // Count tokens for the full request using the API
    const countResponse = await client.messages.countTokens({
      model,
      system: systemPrompt,
      messages,
    });
    const inputTokens = countResponse.input_tokens;

    const available = maxContext - inputTokens;
    const fits = available >= desiredResponseTokens;

    return {
      inputTokens,
      availableForResponse: available,
      desiredResponseTokens,
      fits,
      utilizationPct: ((inputTokens / maxContext) * 100).toFixed(1),
      warning: fits ? null : (
        `Only ${available} tokens left for response, ` +
        `but ${desiredResponseTokens} requested. ` +
        `Trim history or reduce max_tokens.`
      ),
    };
  } catch (error) {
    if (error instanceof Anthropic.APIError) {
      return { error: `Token counting failed: ${error.message}` };
    }
    throw error;
  }
}

// Usage example
const budget = await checkTokenBudget({
  systemPrompt: 'You are a helpful coding assistant. Always provide complete, runnable examples.',
  messages: [
    { role: 'user', content: 'Write a function to sort a list.' },
    { role: 'assistant', content: 'function sortList(items) { return [...items].sort(); }' },
    { role: 'user', content: 'Now make it handle null values.' },
  ],
});

console.log(`Input tokens:    ${budget.inputTokens}`);
console.log(`Available:       ${budget.availableForResponse}`);
console.log(`Utilization:     ${budget.utilizationPct}%`);
console.log(`Fits:            ${budget.fits}`);
if (budget.warning) console.log(`WARNING: ${budget.warning}`);
Expected Output:
Input tokens: 58 Available: 199942 Utilization: 0.0% Fits: true
What Just Happened? You built a pre-flight budget checker. The short 3-turn conversation used only 58 input tokens (0.03% of the 200K window), leaving 199,942 tokens available — plenty of room. But imagine this was turn 200 of a long support conversation: the input could easily be 150K tokens, leaving only 50K for the response. The warning field in the return value catches exactly that scenario. In the modules ahead (especially M08: Conversation Management), you will wire this function into your agent loop so it runs automatically before every API call.

Hands-On Exercise: Build a Token-Aware Prompt Function

What You'll Build

A reusable safe_chat() function that checks your token budget before every API call and warns you when space is running low. You'll use this utility throughout the rest of the course.

Time estimate: 20–30 minutes • Prerequisites: Completed M01 lab (API key set, SDK installed) • Files you'll create: token_tools.py (or token_tools.mjs)

Environment Setup

If you completed the M01 lab, you're already set up. Otherwise, run this first:

# Python
pip install "anthropic>=0.30.0"
export ANTHROPIC_API_KEY="your-key-here"

# Node.js
npm install @anthropic-ai/sdk@^0.30.0
export ANTHROPIC_API_KEY="your-key-here"

Step 1: Count Tokens for a System Prompt

Before you can manage a budget, you need to measure what you're spending. This step builds a small helper that takes a system prompt and a list of messages, and returns the exact token count using the SDK's count_tokens endpoint. This is the foundation for all budget logic.

Create a new file called token_tools.py (or token_tools.mjs):

import anthropic

client = anthropic.Anthropic()

def count_input_tokens(
    system_prompt: str,
    messages: list[dict],
    model: str = "claude-sonnet-4-6",
) -> int:
    """Count the exact number of input tokens for a request."""
    result = client.messages.count_tokens(
        model=model,
        system=system_prompt,
        messages=messages,
    )
    return result.input_tokens

# Test it
system = "You are a helpful assistant."
msgs = [{"role": "user", "content": "Hello, how are you?"}]
count = count_input_tokens(system, msgs)
print(f"Input tokens: {count}")
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

async function countInputTokens(systemPrompt, messages, model = 'claude-sonnet-4-6') {
  const result = await client.messages.countTokens({
    model,
    system: systemPrompt,
    messages,
  });
  return result.input_tokens;
}

// Test it
const count = await countInputTokens(
  'You are a helpful assistant.',
  [{ role: 'user', content: 'Hello, how are you?' }]
);
console.log(`Input tokens: ${count}`);

Run it: python token_tools.py (or node token_tools.mjs)

Expected Output:
Input tokens: 18
✅ Checkpoint: If you see a number between 15 and 25, Step 1 is working. The exact count may vary slightly depending on the model version, but a short system prompt plus a short message should be under 30 tokens.
Troubleshooting — Step 1
  • ModuleNotFoundError: No module named 'anthropic' — Run pip install "anthropic>=0.30.0" (Python) or npm install @anthropic-ai/sdk (Node.js).
  • AuthenticationError: Invalid API Key — Check that your ANTHROPIC_API_KEY environment variable is set. Run echo $ANTHROPIC_API_KEY to verify it's not empty.
  • AttributeError: 'Messages' object has no attribute 'count_tokens' — Your SDK is too old. Run pip install --upgrade anthropic to get version 0.30.0+.

Step 2: Calculate Remaining Budget

Now let's build the budget calculator. This function uses the token count from Step 1 and subtracts it from the 200K context window to tell you how much room is left for Claude's response. This is the "pre-flight check" you'll wire into every agent loop.

Add the following function to token_tools.py (after the count_input_tokens function from Step 1):

def check_budget(
    system_prompt: str,
    messages: list[dict],
    desired_output: int = 4096,
    max_context: int = 200_000,
) -> dict:
    """Check if a conversation fits within the token budget."""
    input_tokens = count_input_tokens(system_prompt, messages)
    available = max_context - input_tokens
    fits = available >= desired_output
    utilization = round((input_tokens / max_context) * 100, 1)

    return {
        "input_tokens": input_tokens,
        "available": available,
        "fits": fits,
        "utilization_pct": utilization,
        "warning": None if fits else (
            f"Only {available} tokens left, but {desired_output} requested. "
            f"Trim history or reduce max_tokens."
        ),
    }

# Test with a short conversation
budget = check_budget(
    "You are a helpful assistant.",
    [
        {"role": "user", "content": "What are tokens?"},
        {"role": "assistant", "content": "Tokens are subword units that LLMs process."},
        {"role": "user", "content": "How do they affect cost?"},
    ]
)
print(f"Input tokens:  {budget['input_tokens']}")
print(f"Available:     {budget['available']}")
print(f"Utilization:   {budget['utilization_pct']}%")
print(f"Fits:          {budget['fits']}")
async function checkBudget(systemPrompt, messages, desiredOutput = 4096, maxContext = 200_000) {
  const inputTokens = await countInputTokens(systemPrompt, messages);
  const available = maxContext - inputTokens;
  const fits = available >= desiredOutput;
  const utilization = ((inputTokens / maxContext) * 100).toFixed(1);

  return {
    inputTokens,
    available,
    fits,
    utilizationPct: utilization,
    warning: fits ? null : (
      `Only ${available} tokens left, but ${desiredOutput} requested. ` +
      `Trim history or reduce max_tokens.`
    ),
  };
}

// Test with a short conversation
const budget = await checkBudget(
  'You are a helpful assistant.',
  [
    { role: 'user', content: 'What are tokens?' },
    { role: 'assistant', content: 'Tokens are subword units that LLMs process.' },
    { role: 'user', content: 'How do they affect cost?' },
  ]
);
console.log(`Input tokens:  ${budget.inputTokens}`);
console.log(`Available:     ${budget.available}`);
console.log(`Utilization:   ${budget.utilizationPct}%`);
console.log(`Fits:          ${budget.fits}`);

Run it: python token_tools.py (or node token_tools.mjs)

Expected Output:
Input tokens: 42 Available: 199958 Utilization: 0.0% Fits: True
✅ Checkpoint: If you see Fits: True with a low utilization percentage, Step 2 is working. The 3-turn conversation uses less than 0.1% of the context window — plenty of room. But imagine this at turn 200 of a long support chat.
Troubleshooting — Step 2
  • NameError: name 'count_input_tokens' is not defined — Make sure both count_input_tokens (from Step 1) and check_budget are in the same file, and that Step 1's code appears above Step 2's.
  • Output shows Fits: False unexpectedly — Check that your max_context parameter is 200,000 (not 200). Python uses underscores in number literals: 200_000.
  • Token count seems too high — Remember that count_tokens includes the system prompt tokens, not just the messages. A system prompt of "You are a helpful assistant." adds ~10 tokens.

Step 3: Build the Safe Chat Function

Now let's combine everything into a single safe_chat() function that checks the budget, warns if space is tight, and refuses to make the call if it would overflow. This is the utility you'll reuse throughout the course. It uses check_budget from Step 2.

Add this function to token_tools.py:

def safe_chat(
    system_prompt: str,
    messages: list[dict],
    max_tokens: int = 1024,
    warn_threshold: float = 0.7,
) -> str:
    """Make an API call with automatic token budget checking."""
    budget = check_budget(system_prompt, messages, desired_output=max_tokens)

    # Warn if utilization is high
    if budget["utilization_pct"] > warn_threshold * 100:
        print(f"⚠️  Token budget warning: {budget['utilization_pct']}% used")
        print(f"   Only {budget['available']} tokens left for response")

    # Refuse if it won't fit
    if not budget["fits"]:
        raise ValueError(
            f"Token budget exceeded: {budget['input_tokens']} input tokens, "
            f"only {budget['available']} left. {budget['warning']}"
        )

    # Make the call
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=max_tokens,
        system=system_prompt,
        messages=messages,
    )

    print(f"📊 Tokens: {response.usage.input_tokens} in, {response.usage.output_tokens} out "
          f"({budget['utilization_pct']}% window used)")

    return response.content[0].text

# Test it
result = safe_chat(
    "You are a helpful assistant. Keep answers brief.",
    [{"role": "user", "content": "What's the tallest mountain on Earth?"}],
)
print(f"\nResponse: {result}")
async function safeChat(systemPrompt, messages, maxTokens = 1024, warnThreshold = 0.7) {
  const budget = await checkBudget(systemPrompt, messages, maxTokens);

  // Warn if utilization is high
  if (parseFloat(budget.utilizationPct) > warnThreshold * 100) {
    console.log(`⚠️  Token budget warning: ${budget.utilizationPct}% used`);
    console.log(`   Only ${budget.available} tokens left for response`);
  }

  // Refuse if it won't fit
  if (!budget.fits) {
    throw new Error(
      `Token budget exceeded: ${budget.inputTokens} input tokens, ` +
      `only ${budget.available} left. ${budget.warning}`
    );
  }

  // Make the call
  const response = await client.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: maxTokens,
    system: systemPrompt,
    messages,
  });

  console.log(`📊 Tokens: ${response.usage.input_tokens} in, ${response.usage.output_tokens} out ` +
    `(${budget.utilizationPct}% window used)`);

  return response.content[0].text;
}

// Test it
const result = await safeChat(
  'You are a helpful assistant. Keep answers brief.',
  [{ role: 'user', content: "What's the tallest mountain on Earth?" }],
);
console.log(`\nResponse: ${result}`);

Run it: python token_tools.py (or node token_tools.mjs)

Expected Output:
📊 Tokens: 22 in, 35 out (0.0% window used) Response: Mount Everest is the tallest mountain on Earth, standing at 8,849 meters (29,032 feet) above sea level.
✅ Checkpoint: If you see the token usage line followed by a response, Step 3 is working. The safe_chat function now automatically checks the budget, warns at 70% utilization, and refuses calls that would overflow the context window.
Troubleshooting
  • NameError: name 'count_input_tokens' is not defined — Make sure all three functions (count_input_tokens, check_budget, safe_chat) are in the same file, and the client = anthropic.Anthropic() line is at the top.
  • AttributeError: 'Messages' object has no attribute 'count_tokens' — You need SDK version 0.30.0 or later. Run pip install --upgrade anthropic (or npm install @anthropic-ai/sdk@latest).
  • Warning shows even for small conversations — Check that your warn_threshold is 0.7 (70%), not 0.007. It's a fraction, not a percentage.

Step 4 (Stretch Goal): Fill the Context Window

Nothing teaches token budget management like watching it happen in real time. The code below runs a loop that sends messages to Claude, one after another, and prints the token count after each turn. You'll see input_tokens climb from hundreds to tens of thousands — and eventually hit the 150K warning threshold.

The interesting part is the growth pattern. Each turn adds both the new user message and Claude's previous response to the conversation history. So the input tokens don't grow linearly — they accelerate. By turn 30, you might be consuming 10x more input tokens per call than you were at turn 5.

A fair warning: this exercise makes many API calls in a loop, so it will cost real money (~$0.50–$2.00 depending on how far it runs). If you'd rather observe the pattern without the cost, just read the expected output below instead of running the code.

Save this as a separate file (e.g., fill_window.py or fill_window.mjs) — it has its own import and client setup, so do not paste it into token_tools.py from the previous steps.

import anthropic

client = anthropic.Anthropic()

# Build a conversation that grows until the budget runs out
conversation = []
system = "You are a helpful assistant. Keep your answers brief (under 50 words)."

for i in range(1, 100):
    conversation.append({
        "role": "user",
        "content": f"Tell me fact #{i} about space exploration."
    })

    try:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=200,
            system=system,
            messages=conversation,
        )
        assistant_msg = response.content[0].text
        conversation.append({"role": "assistant", "content": assistant_msg})

        total = response.usage.input_tokens + response.usage.output_tokens
        print(f"Turn {i}: {response.usage.input_tokens} in + {response.usage.output_tokens} out = {total} total")

        # Watch the input tokens grow with each turn
        if response.usage.input_tokens > 150_000:
            print(f"\n--- Budget warning! Input alone is {response.usage.input_tokens} tokens ---")
            break

    except anthropic.APIError as e:
        print(f"\nHit the limit at turn {i}: {e.message}")
        break
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();
const conversation = [];
const system = 'You are a helpful assistant. Keep your answers brief (under 50 words).';

for (let i = 1; i <= 100; i++) {
  conversation.push({
    role: 'user',
    content: `Tell me fact #${i} about space exploration.`,
  });

  try {
    const response = await client.messages.create({
      model: 'claude-sonnet-4-6',
      max_tokens: 200,
      system,
      messages: conversation,
    });
    const assistantMsg = response.content[0].text;
    conversation.push({ role: 'assistant', content: assistantMsg });

    const total = response.usage.input_tokens + response.usage.output_tokens;
    console.log(`Turn ${i}: ${response.usage.input_tokens} in + ${response.usage.output_tokens} out = ${total} total`);

    if (response.usage.input_tokens > 150_000) {
      console.log(`\n--- Budget warning! Input alone is ${response.usage.input_tokens} tokens ---`);
      break;
    }
  } catch (error) {
    if (error instanceof Anthropic.APIError) {
      console.log(`\nHit the limit at turn ${i}: ${error.message}`);
    } else {
      throw error;
    }
    break;
  }
}
What Just Happened? You watched a conversation grow turn by turn. Each turn adds the new user message plus Claude's previous response to the history, so input tokens grow roughly quadratically — not linearly. After ~30-40 turns (depending on response length), the input alone crosses 150K tokens, triggering the budget warning. This is exactly the scenario that real-world agents face in long customer support sessions, multi-step coding tasks, or research workflows. The takeaway: without active conversation management (summarization, pruning, or sliding windows), any agent will hit the context wall surprisingly fast.

Verify Everything Works

Run your complete token_tools.py file to confirm all three functions work together:

python token_tools.py    # or: node token_tools.mjs

You should see output from all three steps: a token count, a budget check, and a successful safe_chat response with usage stats.

🎉 Congratulations! You now have a reusable safe_chat() function that prevents context window overflow. You'll use this pattern in M08 (Conversation Management) and in every capstone project. The key skill you've built: never make an API call without knowing your token budget first.

Knowledge Check

Test your understanding of tokens, costs, and context windows.

Q1: Approximately how many tokens is the sentence "Hello, how are you today?"

A2 tokens
B5 tokens (one per word)
C7-8 tokens (words + punctuation + spaces)
D26 tokens (one per character)
Correct! Tokens don't map 1:1 to words or characters. Common words like "Hello" and "how" are single tokens, but punctuation and less common words may be separate tokens. The rule of thumb is ~1 token per 4 characters.

Q2: Why are output tokens more expensive than input tokens?

AOutput tokens are larger and take up more memory
BOutput generation is autoregressive (each token depends on all previous), requiring more computation than parallel input processing
CIt's just Anthropic's pricing strategy; the computation is the same
DOutput tokens need to be higher quality than input tokens
Correct! As you learned in M01, input is processed in parallel via attention, but output is generated one token at a time (autoregressive). Each output token requires a full forward pass through the model, making generation significantly more compute-intensive.

Q3: Your system prompt is 1,000 tokens, conversation history is 150,000 tokens, and your user message is 2,000 tokens. Using a 200K context window, how many tokens are available for Claude's response?

A200,000 tokens (the response has its own window)
B50,000 tokens
C48,000 tokens
D47,000 tokens (200K - 1K - 150K - 2K)
Correct! The context window is shared: 200,000 - 1,000 (system) - 150,000 (history) - 2,000 (user message) = 47,000 tokens available for the response. This is the token budget formula in action.

Q4: What happens when your total tokens (input + requested output) exceed the context window?

AThe API returns an error
BClaude silently truncates the oldest messages to make room
CClaude summarizes the conversation automatically
DThe response quality degrades but the call succeeds
Correct! The API will return an error if input tokens exceed the context window limit. Claude does NOT silently truncate or summarize — that's YOUR job to manage. This is why the token budget calculator from this module is so important.

Q5: Which of these strategies helps manage token budget? (Select the best answer)

AAlways set max_tokens to the maximum possible value
BCount tokens before each call, summarize old history when the budget gets tight
CUse the longest possible system prompt to give Claude more context
DThere's no need to manage budget — 200K tokens is more than enough
Correct! Proactive token counting + conversation summarization is the standard pattern for long-running agents. You'll build this exact system in M08 (Conversation Management). Setting max_tokens too high wastes budget, and even 200K fills up fast in agent loops with tool calls.

Q6: Fill in the blank to access the token count from an API response:

response = client.messages.create(...)
input_count = response._______.input_tokens
Atokens
Bmetadata
Cusage
Dcontent
Correct! The usage field on the response object contains input_tokens and output_tokens. This is the same pattern you used in M01's first API call.

Module Summary

Key Takeaways

  • Tokens are subword units — produced by byte-pair encoding, they're the atomic unit of LLM input and output.
  • Three impacts: cost, limits, performance — every token costs money, fills the context window, and adds latency.
  • Output tokens cost more — because autoregressive generation is more compute-intensive than parallel input processing.
  • The context window is shared — system prompt + history + message + response must all fit. Overflow = error.
  • Token budget management is essential — count before you call, and you'll never be surprised by overflow or runaway costs.

Next Module Preview: M03 — Prompts

Now that you understand tokens and context windows, you're ready to learn how to use that space effectively. In Module 3, you'll master prompt engineering — system prompts, few-shot examples, chain-of-thought reasoning, and the art of getting Claude to do exactly what you need.