← M01: LLM Mental Model 🏠 Home M03: Prompts →

M02: Tokens — The Atoms of AI Communication

Every interaction with Claude — every cost, every limit, every performance decision — traces back to tokens. This module gives you a hands-on understanding of what tokens are and why they're the single most important unit of measurement in agent engineering.

Learning Objectives

Explain what tokens are and how text is split into them via byte-pair encoding
Calculate the cost of an API call based on input and output token counts
Describe the context window and how it's shared between input and output
Use the Anthropic SDK to count tokens programmatically
Build a token budget calculator that prevents context window overflow

What Are Tokens?

Everyday Analogy

BEFORE: Imagine trying to teach a child to read by showing them entire paragraphs at once — no letters, no syllables, just raw walls of text. The child has no way to break the language into manageable pieces, so learning never gets off the ground.

PAIN: Computers face the same problem with raw text. A string like "understanding" is just a sequence of bytes to a machine — it has no inherent structure, no meaning, and no efficient way to represent the millions of possible words across every language.

MAPPING: Tokens solve this exactly the way syllables solve reading for children: they break text into reusable chunks of just the right size. When you read "understanding," your brain splits it into syllables: un-der-stand-ing. When Claude reads it, the tokenizer splits it into tokens: "understand" + "ing". Common words like "the" are a single token. Rare words get split into smaller pieces. An emoji might be 2–3 tokens. This chunking lets the model represent any text using a fixed vocabulary of ~100K pieces — efficient, consistent, and universal.

What this actually looks like in practice: When you send the sentence "Claude is helpful" to the API, the tokenizer converts it into a sequence of numeric IDs before the model ever sees it. Here's the before and after:

# What you send (text):
"Claude is helpful"

# What the model actually receives (token IDs):
[51423, 318, 10950]

# Three tokens: "Claude" → 51423, "is" → 318, "helpful" → 10950
# The model works ONLY with these numbers — never with raw text.

Technical Definition A is a piece of text that is smaller than (or equal to) a whole word. Think of it as a chunk — "playing" might become two tokens: "play" and "ing."

These chunks are created during training by a process called . Here is how BPE works in plain English: the algorithm scans billions of words of training text and finds the character pairs that appear most often (like "t" + "h"). It merges those into a single piece ("th"), then repeats — merging "th" + "e" into "the," and so on. After thousands of merges, you get a of ~100K reusable pieces that can represent any text efficiently.

Why this approach? Because a fixed vocabulary means the model always works with the same set of building blocks. Each token maps to a numeric ID (for example, "the" = 487). Before Claude reads your message, the tokenizer converts your text into a sequence of these IDs. After Claude generates a response, the IDs are converted back into text. The model never sees raw characters — only token IDs, from start to finish.

Diagram: How a Sentence Becomes Tokens

Interactive: Live Tokenizer

Type or paste any text to see how Claude splits it into tokens:

Why It Matters As you learned in M01, Claude predicts the next token. So tokens aren't just a billing unit — they're the literal building blocks of every response.

Here's what that means in dollars. A customer-support agent that sends 2,000 input tokens and receives 500 output tokens per interaction costs roughly $0.014 per call on Claude Sonnet 4.6. At 50,000 interactions per day, that adds up to $700/day. A 20% reduction in prompt tokens — achieved by trimming verbose system prompts — saves $140/day, or $4,200/month.

Tokens also determine whether your call succeeds or fails. If your agent loop accumulates 190K tokens of conversation history in a 200K context window, Claude has only ~10K tokens left for its response — about 7,500 words. Exceed that limit, and the API returns a hard error. There is no graceful fallback, no automatic truncation. The call simply fails.

Why Tokens Matter

Tokens impact three things you'll manage constantly as an agent engineer:

Cost — You pay per token. Output tokens cost more than input tokens because generation is computationally harder than reading.
Limits — The context window is measured in tokens. Everything must fit: system prompt, history, your message, and Claude's response.
Performance — More tokens = slower response time + higher cost. Token efficiency is a core engineering skill.

These three concerns are deeply interconnected. When you optimize for cost (by writing shorter prompts), you also free up context window space. And shorter prompts mean fewer tokens to process, so responses arrive faster too. In other words, one optimization improves all three dimensions at once.

The reverse is also true. A bloated system prompt costs more per call. It also eats into your context window, leaving less room for conversation history. And it adds latency, because every extra token takes time to process. This is why tokens are the single most important unit of measurement in agent engineering — they're the common currency behind cost, capacity, and speed.

Diagram: How Token Costs Add Up

Interactive: Cost Calculator

Interactive: API Cost Calculator

Model

API calls per day

Avg input tokens per call

Avg output tokens per call

$0.00

estimated daily cost

Cost Insight A typical agent loop makes 3–8 API calls per user request. That includes tool calls, retries, and validation steps. If each call averages 1,000 input tokens and 500 output tokens on Sonnet, a single user interaction costs ~$0.01–$0.03. That sounds tiny — until you multiply it out. At 10,000 users per day, that's $100–$300/day, or $3,000–$9,000/month. Token management isn't premature optimization — it's table-stakes engineering.

The Context Window

Everyday Analogy

BEFORE: Imagine working on a project where you could spread out unlimited papers across an infinitely large desk — every email, every note, every reference document visible at once. You would never have to choose what to keep in front of you because everything just fits.

PAIN: In reality, your desk is finite. When it fills up, you cannot just pile more papers on top — the stack topples, you lose track of critical documents, and your work grinds to a halt. You have to actively decide what stays on the desk and what goes back in the filing cabinet.

MAPPING: The context window is Claude's desk, and it has a hard size limit (200K tokens for current Claude models). Everything — your system prompt, the conversation history, the current message, AND Claude's response — has to fit on this desk simultaneously. When the desk is full, something has to go. And unlike a human who can glance at a filing cabinet, Claude can only work with what is on the desk right now — there is no "memory" outside the context window unless you build one yourself.

What this actually looks like: Here's a real API call. Notice how every piece — system prompt, history, user message — shares the same 200K-token budget. Claude's response has to fit in whatever space is left over. And here's the error you'll get when the desk overflows:

# When you exceed the context window, you get this error — no graceful fallback:
# anthropic.BadRequestError: 400 - {"error": {"type": "invalid_request_error",
#   "message": "prompt is too long: 203847 tokens > 200000 maximum"}}

Now here's the normal case — everything fits on the desk:

# Everything inside this call shares ONE 200K-token context window:
message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,                           # ← caps Claude's response
    system="You are a helpful assistant.",      # ← ~10 tokens
    messages=[
        {"role": "user", "content": "Hi!"},           # ← ~3 tokens
        {"role": "assistant", "content": "Hello!..."},  # ← ~5 tokens
        {"role": "user", "content": "Explain RAG."},   # ← ~8 tokens
    ]
    # Total input: ~26 tokens → 199,974 tokens remain for response
    # But if history were 195K tokens → only 5K left for response!
)

Technical Definition The context window is the total number of tokens that Claude can "see" in a single API call — think of it as the model's working memory for that request. All current Claude models (Haiku 4.5, Sonnet 4.6, and Opus 4.7) have a 200K-token context window.

The key insight is that this window is shared between everything in the call. Your system prompt, the full conversation history, the user's latest message, and Claude's response all draw from the same 200K-token pool: [system prompt] + [conversation history] + [user message] + [response]. This is why the "budget formula" from the previous section matters so much — every token you spend on input is a token Claude cannot use for output.

What happens if you exceed the limit? The API returns an error immediately. It does not silently truncate your messages or summarize old turns for you. Managing what fits in the window is entirely your responsibility as the developer, and it is one of the most common sources of bugs in agent systems.

How is this different from human memory? When you have a conversation, you naturally forget some details but retain the gist — your brain compresses automatically. Claude has no such mechanism. It remembers everything inside the context window with perfect fidelity, but has zero memory of anything outside it. There is no "I vaguely remember" — it's all or nothing. This is why developers build explicit memory systems (summarization, retrieval, external databases) on top of the context window, which you'll learn in M08 and M11.

Animation: Context Window Allocation

Context Window: 200K tokens

System Prompt

800 tokens

History

12,000 tokens

User Message

1,200 tokens

Response Space

4,096 tokens

18,096 / 200,000 tokens used

⚠️ Common Misconceptions

"The response gets its own separate context window." — No. Input and output share the same 200K budget. If you've used 195K on input, Claude only has 5K tokens (~3,750 words) left for its response. If you set max_tokens higher than the remaining space, you'll get an error.

"Claude remembers our previous conversations." — It doesn't. Each API call is completely independent. The only reason Claude seems to "remember" is that your application sends the full conversation history as input every time. No history in the request = no memory of past turns.

"The system prompt doesn't count toward the token limit." — It absolutely does. A 2,000-token system prompt is 2,000 tokens of your 200K budget, consumed on every single call. Over a 50-turn conversation, that system prompt is sent 50 times — costing you 100K tokens of billing even though the text never changes.

"I'll just summarize when I get close to the limit." — By then it's often too late. Summarization itself requires an API call that also needs context window space. If you're at 195K tokens, you don't have room to ask Claude to summarize. Start managing your budget proactively at 60–70% utilization, not at 95%.

Token Counting in Practice

To manage your token budget, you need to measure token usage. There are two ways to do this: estimation (fast but approximate) and exact counting via the API (slower but precise).

Estimation is useful when you need a quick sanity check. The "1 token ≈ 4 characters" rule works well for English prose, but it breaks down in predictable ways. Code tends to use more tokens per character because of short variable names and heavy punctuation. Non-English languages — especially CJK scripts — can use 2–3x more tokens per word than English. Emojis and special Unicode characters are surprisingly expensive, sometimes costing 2–3 tokens for a single visible character.

For production systems, estimation isn't enough. The Anthropic SDK gives you two sources of exact counts: the usage field on every API response (which tells you what you just spent) and the dedicated count_tokens endpoint (which tells you what you're about to spend, before making the call). The code walkthrough below shows both approaches.

Rules of Thumb

~1 token = ~4 characters in English (or ~0.75 words)
Code is typically more token-dense than prose (more punctuation, short variable names)
Non-English text and special characters use more tokens per word
The usage field in every API response gives you exact counts

The Token Budget Formula

available_for_response = context_window - system_prompt_tokens - history_tokens - user_message_tokens
# Example: 200,000 - 800 - 50,000 - 1,200 = 148,000 tokens available for response

Code Walkthrough: Token Counting & Budget Management

From concept to code: You now know the rules of thumb (1 token ~ 4 characters) and the budget formula (available = context_window - input). But rules of thumb are approximations — they can be off by 10-20% depending on language, punctuation, and formatting. The code below shows you how to get exact token counts from the Anthropic SDK, so you can make precise budget decisions in production rather than guessing.

Counting Tokens with the SDK

Let's start with the simplest approach: make an API call and read the token counts from the response. Every response from Claude includes a usage field — for free, on every single call — that tells you exactly how many input and output tokens were consumed. No separate endpoint needed, no extra cost. This is the fastest way to monitor spending and debug budget issues in real time.

One thing that trips people up: the usage.input_tokens count includes your system prompt, not just the user message. If your input count seems higher than expected, that's almost certainly why. Check your system prompt length first before looking for other causes.

# pip install anthropic>=0.30.0
import anthropic

client = anthropic.Anthropic()

try:
    # Make a call and inspect token usage
    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="You are a helpful assistant.",
        messages=[
            {"role": "user", "content": "Explain what tokens are in 2 sentences."}
        ]
    )

    # The usage field tells you exact token counts
    print(f"Input tokens:  {message.usage.input_tokens}")
    print(f"Output tokens: {message.usage.output_tokens}")
    print(f"Total tokens:  {message.usage.input_tokens + message.usage.output_tokens}")
    print(f"\nResponse: {message.content[0].text}")

except anthropic.APIError as e:
    print(f"API error: {e.status_code} - {e.message}")

// npm install @anthropic-ai/sdk@^0.30.0
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

try {
  const message = await client.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    system: 'You are a helpful assistant.',
    messages: [
      { role: 'user', content: 'Explain what tokens are in 2 sentences.' }
    ]
  });

  // The usage field tells you exact token counts
  console.log(`Input tokens:  ${message.usage.input_tokens}`);
  console.log(`Output tokens: ${message.usage.output_tokens}`);
  console.log(`Total tokens:  ${message.usage.input_tokens + message.usage.output_tokens}`);
  console.log(`\nResponse: ${message.content[0].text}`);

} catch (error) {
  if (error instanceof Anthropic.APIError) {
    console.error(`API error: ${error.status} - ${error.message}`);
  } else {
    throw error;
  }
}

Expected Output:

Input tokens: 26 Output tokens: 42 Total tokens: 68 Response: Tokens are the smallest units of text that a language model processes, similar to subword fragments that can represent whole words, parts of words, or punctuation. They serve as the fundamental building blocks through which AI models read input and generate output.

What Just Happened? You made an API call and inspected the usage field to see that the short prompt used 26 input tokens and Claude's two-sentence response used 42 output tokens — 68 total. Notice the ratio: 26 tokens of input produced 42 tokens of output. In a real agent loop with tool calls, you might send 5,000 input tokens and get 2,000 output tokens per step — across 6 steps, that is 42,000 tokens for a single user request. This is why monitoring usage on every call matters.

Building a Token Budget Calculator

Now for the more interesting part: checking the budget before you make a call. The previous example told you what you spent after the fact — useful for monitoring, but it can't prevent a failure. In production agents, you need to know whether the call will even fit before you send it. If you blindly send a request that overflows the context window, the API returns an error, and your user sees a failure with no explanation.

The function below uses the SDK's count_tokens endpoint to measure a full conversation (system prompt + message history), then calculates how much room is left. Think of it as a pre-flight check for every API call — the same way a pilot checks fuel before takeoff, not after landing.

Here's a tradeoff worth knowing about: count_tokens is a separate API call, which adds ~50–100ms of latency. For high-throughput systems, that overhead on every call adds up. A practical compromise is to use the rough "1 token ≈ 4 characters" estimate for quick checks. Then only call count_tokens when your estimate suggests you're above 70% utilization — that's when precision matters most and when getting it wrong means a crash.

import anthropic

client = anthropic.Anthropic()

def check_token_budget(
    system_prompt: str,
    messages: list[dict],
    model: str = "claude-sonnet-4-6",
    max_context: int = 200_000,
    desired_response_tokens: int = 4096,
) -> dict:
    """Check if a conversation fits within the token budget."""

    try:
        # Count tokens for the full request using the API
        count_response = client.messages.count_tokens(
            model=model,
            system=system_prompt,
            messages=messages,
        )
        input_tokens = count_response.input_tokens
    except anthropic.APIError as e:
        return {"error": f"Token counting failed: {e.message}"}

    available = max_context - input_tokens
    fits = available >= desired_response_tokens

    return {
        "input_tokens": input_tokens,
        "available_for_response": available,
        "desired_response_tokens": desired_response_tokens,
        "fits": fits,
        "utilization_pct": round((input_tokens / max_context) * 100, 1),
        "warning": None if fits else (
            f"Only {available} tokens left for response, "
            f"but {desired_response_tokens} requested. "
            f"Trim history or reduce max_tokens."
        ),
    }

# Usage example
system = "You are a helpful coding assistant. Always provide complete, runnable examples."
conversation = [
    {"role": "user", "content": "Write a Python function to sort a list."},
    {"role": "assistant", "content": "def sort_list(items): return sorted(items)"},
    {"role": "user", "content": "Now make it handle None values."},
]

budget = check_token_budget(system, conversation)
print(f"Input tokens:    {budget['input_tokens']}")
print(f"Available:       {budget['available_for_response']}")
print(f"Utilization:     {budget['utilization_pct']}%")
print(f"Fits:            {budget['fits']}")
if budget.get("warning"):
    print(f"WARNING: {budget['warning']}")

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

async function checkTokenBudget({
  systemPrompt,
  messages,
  model = 'claude-sonnet-4-6',
  maxContext = 200_000,
  desiredResponseTokens = 4096,
}) {
  try {
    // Count tokens for the full request using the API
    const countResponse = await client.messages.countTokens({
      model,
      system: systemPrompt,
      messages,
    });
    const inputTokens = countResponse.input_tokens;

    const available = maxContext - inputTokens;
    const fits = available >= desiredResponseTokens;

    return {
      inputTokens,
      availableForResponse: available,
      desiredResponseTokens,
      fits,
      utilizationPct: ((inputTokens / maxContext) * 100).toFixed(1),
      warning: fits ? null : (
        `Only ${available} tokens left for response, ` +
        `but ${desiredResponseTokens} requested. ` +
        `Trim history or reduce max_tokens.`
      ),
    };
  } catch (error) {
    if (error instanceof Anthropic.APIError) {
      return { error: `Token counting failed: ${error.message}` };
    }
    throw error;
  }
}

// Usage example
const budget = await checkTokenBudget({
  systemPrompt: 'You are a helpful coding assistant. Always provide complete, runnable examples.',
  messages: [
    { role: 'user', content: 'Write a function to sort a list.' },
    { role: 'assistant', content: 'function sortList(items) { return [...items].sort(); }' },
    { role: 'user', content: 'Now make it handle null values.' },
  ],
});

console.log(`Input tokens:    ${budget.inputTokens}`);
console.log(`Available:       ${budget.availableForResponse}`);
console.log(`Utilization:     ${budget.utilizationPct}%`);
console.log(`Fits:            ${budget.fits}`);
if (budget.warning) console.log(`WARNING: ${budget.warning}`);

Expected Output:

Input tokens: 58 Available: 199942 Utilization: 0.0% Fits: true

What Just Happened? You built a pre-flight budget checker. The short 3-turn conversation used only 58 input tokens (0.03% of the 200K window), leaving 199,942 tokens available — plenty of room. But imagine this was turn 200 of a long support conversation: the input could easily be 150K tokens, leaving only 50K for the response. The warning field in the return value catches exactly that scenario. In the modules ahead (especially M08: Conversation Management), you will wire this function into your agent loop so it runs automatically before every API call.

Hands-On Exercise: Build a Token-Aware Prompt Function

What You'll Build

A reusable safe_chat() function that checks your token budget before every API call and warns you when space is running low. You'll use this utility throughout the rest of the course.

Time estimate: 20–30 minutes • Prerequisites: Completed M01 lab (API key set, SDK installed) • Files you'll create: token_tools.py (or token_tools.mjs)

Environment Setup

If you completed the M01 lab, you're already set up. Otherwise, run this first:

# Python
pip install "anthropic>=0.30.0"
export ANTHROPIC_API_KEY="your-key-here"

# Node.js
npm install @anthropic-ai/sdk@^0.30.0
export ANTHROPIC_API_KEY="your-key-here"

Step 1: Count Tokens for a System Prompt

Before you can manage a budget, you need to measure what you're spending. This step builds a small helper that takes a system prompt and a list of messages, and returns the exact token count using the SDK's count_tokens endpoint. This is the foundation for all budget logic.

Create a new file called token_tools.py (or token_tools.mjs):

import anthropic

client = anthropic.Anthropic()

def count_input_tokens(
    system_prompt: str,
    messages: list[dict],
    model: str = "claude-sonnet-4-6",
) -> int:
    """Count the exact number of input tokens for a request."""
    result = client.messages.count_tokens(
        model=model,
        system=system_prompt,
        messages=messages,
    )
    return result.input_tokens

# Test it
system = "You are a helpful assistant."
msgs = [{"role": "user", "content": "Hello, how are you?"}]
count = count_input_tokens(system, msgs)
print(f"Input tokens: {count}")

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

async function countInputTokens(systemPrompt, messages, model = 'claude-sonnet-4-6') {
  const result = await client.messages.countTokens({
    model,
    system: systemPrompt,
    messages,
  });
  return result.input_tokens;
}

// Test it
const count = await countInputTokens(
  'You are a helpful assistant.',
  [{ role: 'user', content: 'Hello, how are you?' }]
);
console.log(`Input tokens: ${count}`);

Run it: python token_tools.py (or node token_tools.mjs)

Expected Output:

Input tokens: 18

✅ Checkpoint: If you see a number between 15 and 25, Step 1 is working. The exact count may vary slightly depending on the model version, but a short system prompt plus a short message should be under 30 tokens.

Troubleshooting — Step 1

ModuleNotFoundError: No module named 'anthropic' — Run pip install "anthropic>=0.30.0" (Python) or npm install @anthropic-ai/sdk (Node.js).
AuthenticationError: Invalid API Key — Check that your ANTHROPIC_API_KEY environment variable is set. Run echo $ANTHROPIC_API_KEY to verify it's not empty.
AttributeError: 'Messages' object has no attribute 'count_tokens' — Your SDK is too old. Run pip install --upgrade anthropic to get version 0.30.0+.

Step 2: Calculate Remaining Budget

Now let's build the budget calculator. This function uses the token count from Step 1 and subtracts it from the 200K context window to tell you how much room is left for Claude's response. This is the "pre-flight check" you'll wire into every agent loop.

Add the following function to token_tools.py (after the count_input_tokens function from Step 1):

def check_budget(
    system_prompt: str,
    messages: list[dict],
    desired_output: int = 4096,
    max_context: int = 200_000,
) -> dict:
    """Check if a conversation fits within the token budget."""
    input_tokens = count_input_tokens(system_prompt, messages)
    available = max_context - input_tokens
    fits = available >= desired_output
    utilization = round((input_tokens / max_context) * 100, 1)

    return {
        "input_tokens": input_tokens,
        "available": available,
        "fits": fits,
        "utilization_pct": utilization,
        "warning": None if fits else (
            f"Only {available} tokens left, but {desired_output} requested. "
            f"Trim history or reduce max_tokens."
        ),
    }

# Test with a short conversation
budget = check_budget(
    "You are a helpful assistant.",
    [
        {"role": "user", "content": "What are tokens?"},
        {"role": "assistant", "content": "Tokens are subword units that LLMs process."},
        {"role": "user", "content": "How do they affect cost?"},
    ]
)
print(f"Input tokens:  {budget['input_tokens']}")
print(f"Available:     {budget['available']}")
print(f"Utilization:   {budget['utilization_pct']}%")
print(f"Fits:          {budget['fits']}")

async function checkBudget(systemPrompt, messages, desiredOutput = 4096, maxContext = 200_000) {
  const inputTokens = await countInputTokens(systemPrompt, messages);
  const available = maxContext - inputTokens;
  const fits = available >= desiredOutput;
  const utilization = ((inputTokens / maxContext) * 100).toFixed(1);

  return {
    inputTokens,
    available,
    fits,
    utilizationPct: utilization,
    warning: fits ? null : (
      `Only ${available} tokens left, but ${desiredOutput} requested. ` +
      `Trim history or reduce max_tokens.`
    ),
  };
}

// Test with a short conversation
const budget = await checkBudget(
  'You are a helpful assistant.',
  [
    { role: 'user', content: 'What are tokens?' },
    { role: 'assistant', content: 'Tokens are subword units that LLMs process.' },
    { role: 'user', content: 'How do they affect cost?' },
  ]
);
console.log(`Input tokens:  ${budget.inputTokens}`);
console.log(`Available:     ${budget.available}`);
console.log(`Utilization:   ${budget.utilizationPct}%`);
console.log(`Fits:          ${budget.fits}`);

Run it: python token_tools.py (or node token_tools.mjs)

Expected Output:

Input tokens: 42 Available: 199958 Utilization: 0.0% Fits: True

✅ Checkpoint: If you see Fits: True with a low utilization percentage, Step 2 is working. The 3-turn conversation uses less than 0.1% of the context window — plenty of room. But imagine this at turn 200 of a long support chat.

Troubleshooting — Step 2

NameError: name 'count_input_tokens' is not defined — Make sure both count_input_tokens (from Step 1) and check_budget are in the same file, and that Step 1's code appears above Step 2's.
Output shows Fits: False unexpectedly — Check that your max_context parameter is 200,000 (not 200). Python uses underscores in number literals: 200_000.
Token count seems too high — Remember that count_tokens includes the system prompt tokens, not just the messages. A system prompt of "You are a helpful assistant." adds ~10 tokens.

Step 3: Build the Safe Chat Function

Now let's combine everything into a single safe_chat() function that checks the budget, warns if space is tight, and refuses to make the call if it would overflow. This is the utility you'll reuse throughout the course. It uses check_budget from Step 2.

Add this function to token_tools.py:

def safe_chat(
    system_prompt: str,
    messages: list[dict],
    max_tokens: int = 1024,
    warn_threshold: float = 0.7,
) -> str:
    """Make an API call with automatic token budget checking."""
    budget = check_budget(system_prompt, messages, desired_output=max_tokens)

    # Warn if utilization is high
    if budget["utilization_pct"] > warn_threshold * 100:
        print(f"⚠️  Token budget warning: {budget['utilization_pct']}% used")
        print(f"   Only {budget['available']} tokens left for response")

    # Refuse if it won't fit
    if not budget["fits"]:
        raise ValueError(
            f"Token budget exceeded: {budget['input_tokens']} input tokens, "
            f"only {budget['available']} left. {budget['warning']}"
        )

    # Make the call
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=max_tokens,
        system=system_prompt,
        messages=messages,
    )

    print(f"📊 Tokens: {response.usage.input_tokens} in, {response.usage.output_tokens} out "
          f"({budget['utilization_pct']}% window used)")

    return response.content[0].text

# Test it
result = safe_chat(
    "You are a helpful assistant. Keep answers brief.",
    [{"role": "user", "content": "What's the tallest mountain on Earth?"}],
)
print(f"\nResponse: {result}")

async function safeChat(systemPrompt, messages, maxTokens = 1024, warnThreshold = 0.7) {
  const budget = await checkBudget(systemPrompt, messages, maxTokens);

  // Warn if utilization is high
  if (parseFloat(budget.utilizationPct) > warnThreshold * 100) {
    console.log(`⚠️  Token budget warning: ${budget.utilizationPct}% used`);
    console.log(`   Only ${budget.available} tokens left for response`);
  }

  // Refuse if it won't fit
  if (!budget.fits) {
    throw new Error(
      `Token budget exceeded: ${budget.inputTokens} input tokens, ` +
      `only ${budget.available} left. ${budget.warning}`
    );
  }

  // Make the call
  const response = await client.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: maxTokens,
    system: systemPrompt,
    messages,
  });

  console.log(`📊 Tokens: ${response.usage.input_tokens} in, ${response.usage.output_tokens} out ` +
    `(${budget.utilizationPct}% window used)`);

  return response.content[0].text;
}

// Test it
const result = await safeChat(
  'You are a helpful assistant. Keep answers brief.',
  [{ role: 'user', content: "What's the tallest mountain on Earth?" }],
);
console.log(`\nResponse: ${result}`);

Run it: python token_tools.py (or node token_tools.mjs)

Expected Output:

📊 Tokens: 22 in, 35 out (0.0% window used) Response: Mount Everest is the tallest mountain on Earth, standing at 8,849 meters (29,032 feet) above sea level.

✅ Checkpoint: If you see the token usage line followed by a response, Step 3 is working. The safe_chat function now automatically checks the budget, warns at 70% utilization, and refuses calls that would overflow the context window.

Troubleshooting

NameError: name 'count_input_tokens' is not defined — Make sure all three functions (count_input_tokens, check_budget, safe_chat) are in the same file, and the client = anthropic.Anthropic() line is at the top.
AttributeError: 'Messages' object has no attribute 'count_tokens' — You need SDK version 0.30.0 or later. Run pip install --upgrade anthropic (or npm install @anthropic-ai/sdk@latest).
Warning shows even for small conversations — Check that your warn_threshold is 0.7 (70%), not 0.007. It's a fraction, not a percentage.

Step 4 (Stretch Goal): Fill the Context Window

Nothing teaches token budget management like watching it happen in real time. The code below runs a loop that sends messages to Claude, one after another, and prints the token count after each turn. You'll see input_tokens climb from hundreds to tens of thousands — and eventually hit the 150K warning threshold.

The interesting part is the growth pattern. Each turn adds both the new user message and Claude's previous response to the conversation history. So the input tokens don't grow linearly — they accelerate. By turn 30, you might be consuming 10x more input tokens per call than you were at turn 5.

A fair warning: this exercise makes many API calls in a loop, so it will cost real money (~$0.50–$2.00 depending on how far it runs). If you'd rather observe the pattern without the cost, just read the expected output below instead of running the code.

Save this as a separate file (e.g., fill_window.py or fill_window.mjs) — it has its own import and client setup, so do not paste it into token_tools.py from the previous steps.

import anthropic

client = anthropic.Anthropic()

# Build a conversation that grows until the budget runs out
conversation = []
system = "You are a helpful assistant. Keep your answers brief (under 50 words)."

for i in range(1, 100):
    conversation.append({
        "role": "user",
        "content": f"Tell me fact #{i} about space exploration."
    })

    try:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=200,
            system=system,
            messages=conversation,
        )
        assistant_msg = response.content[0].text
        conversation.append({"role": "assistant", "content": assistant_msg})

        total = response.usage.input_tokens + response.usage.output_tokens
        print(f"Turn {i}: {response.usage.input_tokens} in + {response.usage.output_tokens} out = {total} total")

        # Watch the input tokens grow with each turn
        if response.usage.input_tokens > 150_000:
            print(f"\n--- Budget warning! Input alone is {response.usage.input_tokens} tokens ---")
            break

    except anthropic.APIError as e:
        print(f"\nHit the limit at turn {i}: {e.message}")
        break

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();
const conversation = [];
const system = 'You are a helpful assistant. Keep your answers brief (under 50 words).';

for (let i = 1; i <= 100; i++) {
  conversation.push({
    role: 'user',
    content: `Tell me fact #${i} about space exploration.`,
  });

  try {
    const response = await client.messages.create({
      model: 'claude-sonnet-4-6',
      max_tokens: 200,
      system,
      messages: conversation,
    });
    const assistantMsg = response.content[0].text;
    conversation.push({ role: 'assistant', content: assistantMsg });

    const total = response.usage.input_tokens + response.usage.output_tokens;
    console.log(`Turn ${i}: ${response.usage.input_tokens} in + ${response.usage.output_tokens} out = ${total} total`);

    if (response.usage.input_tokens > 150_000) {
      console.log(`\n--- Budget warning! Input alone is ${response.usage.input_tokens} tokens ---`);
      break;
    }
  } catch (error) {
    if (error instanceof Anthropic.APIError) {
      console.log(`\nHit the limit at turn ${i}: ${error.message}`);
    } else {
      throw error;
    }
    break;
  }
}

What Just Happened? You watched a conversation grow turn by turn. Each turn adds the new user message plus Claude's previous response to the history, so input tokens grow roughly quadratically — not linearly. After ~30-40 turns (depending on response length), the input alone crosses 150K tokens, triggering the budget warning. This is exactly the scenario that real-world agents face in long customer support sessions, multi-step coding tasks, or research workflows. The takeaway: without active conversation management (summarization, pruning, or sliding windows), any agent will hit the context wall surprisingly fast.

Verify Everything Works

Run your complete token_tools.py file to confirm all three functions work together:

python token_tools.py    # or: node token_tools.mjs

You should see output from all three steps: a token count, a budget check, and a successful safe_chat response with usage stats.

🎉 Congratulations! You now have a reusable safe_chat() function that prevents context window overflow. You'll use this pattern in M08 (Conversation Management) and in every capstone project. The key skill you've built: never make an API call without knowing your token budget first.

Knowledge Check

Test your understanding of tokens, costs, and context windows.

Q1: Approximately how many tokens is the sentence "Hello, how are you today?"

A2 tokens

B5 tokens (one per word)

C7-8 tokens (words + punctuation + spaces)

D26 tokens (one per character)

Correct! Tokens don't map 1:1 to words or characters. Common words like "Hello" and "how" are single tokens, but punctuation and less common words may be separate tokens. The rule of thumb is ~1 token per 4 characters.

Q2: Why are output tokens more expensive than input tokens?

AOutput tokens are larger and take up more memory

BOutput generation is autoregressive (each token depends on all previous), requiring more computation than parallel input processing

CIt's just Anthropic's pricing strategy; the computation is the same

DOutput tokens need to be higher quality than input tokens

Correct! As you learned in M01, input is processed in parallel via attention, but output is generated one token at a time (autoregressive). Each output token requires a full forward pass through the model, making generation significantly more compute-intensive.

Q3: Your system prompt is 1,000 tokens, conversation history is 150,000 tokens, and your user message is 2,000 tokens. Using a 200K context window, how many tokens are available for Claude's response?

A200,000 tokens (the response has its own window)

B50,000 tokens

C48,000 tokens

D47,000 tokens (200K - 1K - 150K - 2K)

Correct! The context window is shared: 200,000 - 1,000 (system) - 150,000 (history) - 2,000 (user message) = 47,000 tokens available for the response. This is the token budget formula in action.

Q4: What happens when your total tokens (input + requested output) exceed the context window?

AThe API returns an error

BClaude silently truncates the oldest messages to make room

CClaude summarizes the conversation automatically

DThe response quality degrades but the call succeeds

Correct! The API will return an error if input tokens exceed the context window limit. Claude does NOT silently truncate or summarize — that's YOUR job to manage. This is why the token budget calculator from this module is so important.

Q5: Which of these strategies helps manage token budget? (Select the best answer)

AAlways set max_tokens to the maximum possible value

BCount tokens before each call, summarize old history when the budget gets tight

CUse the longest possible system prompt to give Claude more context

DThere's no need to manage budget — 200K tokens is more than enough

Correct! Proactive token counting + conversation summarization is the standard pattern for long-running agents. You'll build this exact system in M08 (Conversation Management). Setting max_tokens too high wastes budget, and even 200K fills up fast in agent loops with tool calls.

Q6: Fill in the blank to access the token count from an API response:

response = client.messages.create(...)
input_count = response._______.input_tokens

Atokens

Bmetadata

Cusage

Dcontent

Correct! The usage field on the response object contains input_tokens and output_tokens. This is the same pattern you used in M01's first API call.

Module Summary

Key Takeaways

Tokens are subword units — produced by byte-pair encoding, they're the atomic unit of LLM input and output.
Three impacts: cost, limits, performance — every token costs money, fills the context window, and adds latency.
Output tokens cost more — because autoregressive generation is more compute-intensive than parallel input processing.
The context window is shared — system prompt + history + message + response must all fit. Overflow = error.
Token budget management is essential — count before you call, and you'll never be surprised by overflow or runaway costs.

Next Module Preview: M03 — Prompts

Now that you understand tokens and context windows, you're ready to learn how to use that space effectively. In Module 3, you'll master prompt engineering — system prompts, few-shot examples, chain-of-thought reasoning, and the art of getting Claude to do exactly what you need.