← M00: Course Overview 🏠 Home M02: Tokens →

M01: The LLM Mental Model

Understand what a Large Language Model really is, how Claude processes your text, and why the right mental model changes everything you'll build in this course.

Learning Objectives

Explain what a Large Language Model is and how it generates text one token at a time
Describe the difference between how Claude reads input (all at once) and writes output (one token at a time)
Use temperature, top-p, and top-k controls and predict how they change output
Make your first Claude API call using Python and Node.js
Adopt the "thinker, not calculator" mental model for working with LLMs

What Is a Large Language Model?

Everyday Analogy

BEFORE: Before LLMs, if you wanted a computer to answer a question, someone had to explicitly program every possible answer — think of old-school chatbots with giant lists of if/then rules, or search engines that could only find pages containing your exact keywords.

PAIN: That approach broke down constantly. Ask the chatbot something the programmer didn't anticipate, and you'd get "I don't understand." Ask a search engine a nuanced question, and you'd sift through ten blue links hoping one had your answer.

MAPPING: An LLM like Claude is the world's most well-read autocomplete. Your phone's autocomplete has read your messages; Claude has read billions of documents — books, code, conversations, scientific papers — and uses all of that to predict what comes next. Instead of following hand-written rules, it learned patterns from that mountain of text, so it can handle questions and tasks nobody explicitly programmed it for. It's autocomplete that went to every university, read every manual, and practiced every writing style.

What this actually looks like: When you type "The capital of France is", the model doesn't look up "France" in a table. Instead, it computes a probability for every possible next token. Here's a simplified version of what that prediction looks like internally:

Simplified next-token probabilities

Input: "The capital of France is"

Next token predictions:
  " Paris"    → 0.92  (92% probability)
  " the"      → 0.03  (3%)
  " located"  → 0.02  (2%)
  " a"        → 0.01  (1%)
  " Lyon"     → 0.005 (0.5%)
  ... thousands more tokens with tiny probabilities

Technical Definition A is a trained on vast amounts of text data. Let's unpack that piece by piece.

First, "neural network" means a mathematical system that learns by example rather than by following hand-written rules. You show it billions of text samples, and it gradually adjusts billions of internal numbers (called parameters) until it gets good at one specific job.

That one job is: given a sequence of tokens (small chunks of text — roughly words or word-pieces), predict the most likely next token. That's it. Every impressive thing an LLM does — writing code, answering questions, translating languages — is a side effect of getting extremely good at next-token prediction.

Now for the "Large" part. Claude-class models have hundreds of billions of parameters and were trained on terabytes of text. It doesn't "understand" language the way humans do — it finds statistical patterns at a scale that produces remarkably useful results. The reason this matters for you as a builder is that the model's power and its failure modes both come from this prediction mechanism.

Diagram: How an LLM Generates Text

Animation: Token-by-Token Generation

Why It Matters Understanding that LLMs predict tokens — not "think" about answers — is the foundation for everything in this course. Here's a concrete example: a production agent processing 10,000 customer support tickets per day at temperature 0.0 will still produce roughly 2–5% hallucinated facts (200–500 tickets with incorrect information) if it has no tool access or guardrails. When your agent gives a wrong answer, it's because token prediction chose a plausible-but-incorrect continuation, not because it's "confused." This distinction shapes how you write prompts, design tools, and build guardrails — and it's why Modules 5 (Tool Use) and 16–17 (Guardrails) exist in this course.

How Claude Processes Text

Everyday Analogy

BEFORE: Before transformer-based models, older AI systems (like recurrent neural networks) had to read text word-by-word in order, like a person reading a sentence while covering up the words ahead of them. This made them slow and forgetful — by the time they reached the end of a long paragraph, they'd already started "forgetting" the beginning.

PAIN: That sequential reading created a bottleneck: longer inputs meant worse comprehension, because the model couldn't hold everything in mind at once.

MAPPING: Claude works like a speed reader who absorbs an entire page in one glance — every word attended to simultaneously, related to every other word. Then it writes its response one word at a time, each word influenced by everything it read plus everything it has written so far. The reading is instant and parallel; the writing is careful and sequential.

What this actually looks like: Here's a real API response showing the timing asymmetry. Notice how the input (your prompt) is processed almost instantly, but the output tokens trickle out one by one:

API timing example

Request: 850 tokens of input
Response: 120 tokens of output

Timeline:
  0ms    → Request sent
  180ms  → First output token arrives (all 850 input tokens processed)
  180ms  → "The"
  210ms  → " best"
  240ms  → " approach"
  ...      (each token ~30ms apart)
  3780ms → Final token generated

Input processing:  180ms  (850 tokens, all at once)
Output generation: 3600ms (120 tokens, one at a time)

Technical Definition Claude processes input using parallel attention through the . Here's what that means in plain English: every token in your prompt is compared against every other token at the same time. The model doesn't read left-to-right — it sees the whole message at once.

This is why sending a 1,000-token prompt to Claude is nearly as fast as sending a 100-token prompt: the input processing step happens in parallel.

Output generation, however, works completely differently. It's autoregressive, meaning "each step feeds into the next." Each new token is predicted based on all input tokens plus all previously generated output tokens. This is why output is the slow part — each token must wait for the previous one to be generated first.

Animation: Input Processing vs. Output Generation

📄

Your Message

→

✏

Tokenize

→

🧠

Attention
(all at once)

→

✍

Generate
(one by one)

→

💬

Response

How Inference Actually Works

Everyday Analogy

BEFORE: You probably picture an LLM as a function: question goes in, full answer comes out. That mental model is wrong in a specific way that matters once you start optimizing latency and cost.

PAIN: Without the right picture, you can’t reason about why the first token takes 800 ms but the next 200 tokens stream out at 60 ms each. You can’t explain why a 50K-token prompt is expensive even before Claude writes a single word back. You can’t budget for production.

MAPPING: Inference is autoregressive — Claude generates output one token at a time, and each new token is conditioned on every token before it (yours and its own). Picture someone typing a long reply on a phone: they read what they’ve typed so far, pick the most likely next letter, append it, re-read, pick the next letter, append, and so on. That’s inference. There’s no “full answer” sitting in the model waiting to be unwrapped — the answer is constructed token-by-token, in real time, in the same call.

Technical Definition

Inference is what happens when Claude is running, not learning. Every call to the API kicks off two distinct phases:

1. Prefill (a.k.a. the “forward pass” over your prompt). Your prompt — system + messages + tool definitions — is tokenized (M02), converted to embeddings, and pushed through every transformer layer in parallel. The model computes attention across all input tokens at once, which is fast per-token but heavy: cost is roughly O(N²) in prompt length. The output of prefill is one set of logits — a probability distribution over the entire vocabulary — for the next token to generate.

2. Decode (a.k.a. token-by-token generation). The model samples one token from the prefill logits, appends it to the running sequence, and runs just that new token through the transformer (reusing cached attention values for everything before it — the KV cache). That produces the next set of logits. Sample, append, repeat — until the model emits a stop token or hits max_tokens. Decode is sequential by construction: token N can’t start until token N−1 exists.

Two phases, two costs. Prefill latency is paid once per request and dominates “time to first token.” Decode latency is paid per output token and dominates “tokens per second.” This split is why a 50K-token prompt with a 100-token answer feels slow to start but finishes quickly — and why a 200-token prompt with a 4000-token answer feels snappy at first but takes forever.

Prefill (once) → Decode (token-by-token, N times)

1. PREFILL — runs once

Tokenize prompt (N tokens)
Embed all N tokens
Run through every transformer layer in parallel
Build the KV cache for every token
Output: logits for token N+1

Cost: ~O(N²) compute, but parallelizable. Latency: drives time-to-first-token.

2. DECODE — runs N times

Sample one token from current logits
Append it to the running sequence
Run just the new token through layers, reading KV cache
Get logits for the next token
Stop if token == <end> or budget hit; else go to step 1

Cost: ~O(1) per token (memory-bound). Latency: drives tokens-per-second.

Sampling — What “Logits” Become a Token

The transformer doesn’t pick a token; it outputs logits — a raw score per token in the vocabulary (~200K entries for Claude). To turn logits into a chosen token, three steps run:

Softmax with temperature. Logits — divide by temperature T — then softmax to a probability distribution. T = 0 collapses to one-hot (always pick the argmax); T = 1 keeps the model’s native distribution; T > 1 flattens it (more surprise).
Top-k / Top-p filter. Optionally truncate the distribution to either the top-k highest-probability tokens, or the smallest set whose cumulative probability exceeds p (nucleus sampling). This blocks the long tail of nonsense tokens.
Sample. Draw one token from the (renormalized) filtered distribution. This single token becomes the “next token,” gets appended, and the loop continues.

Important consequence: the model’s output is non-deterministic even at temperature=1 because of the sampling step. Set temperature=0 if you need reproducibility — that picks the argmax with no sampling. The next H2 on Temperature/Top-p/Top-k goes deeper on when to use each.

Latency Anatomy — Where Your Seconds Go

Real numbers, on Claude Sonnet 4.6 (typical 2026 production load, single-region):

Phase	What dominates	Order of magnitude
Network	TLS, request routing, region distance	30–150 ms
Prefill (TTFT)	Prompt length; quadratic-ish in N	~50 ms / 1K tokens (uncached); <5 ms / 1K cached
Decode	Output length; linear in tokens generated	~50–100 tokens/s (Sonnet); higher with speculative decoding
Server queue	Concurrent traffic, rate-limit tier	0–500 ms p99

Practical Consequences

Time-to-first-token (TTFT) is set almost entirely by prompt length and whether the prefix is cached. Prompt caching (M22) turns “5 seconds of prefill” into “200 ms of prefill” on repeat-heavy prompts.
Tokens-per-second (TPS) is set by the decode step. Output length is the cost driver here — doubling max_tokens doubles decode time, regardless of how long the prompt was.
Streaming (next subsection) doesn’t change total wall-clock latency; it just delivers tokens as they’re produced so users see something happening at TTFT instead of waiting for the full decode.
Extended thinking (M22) and reasoning models add a hidden third phase — thinking tokens are decoded before any visible response. We’ll connect the dots in M22.

Streaming vs Batch — Same Inference, Different Delivery

One request, two ways to receive the answer:

Non-streaming — you wait for the full decode, then receive the entire response in one chunk. Simple; the agent code in M05–M07 uses this shape.
Streaming — the server sends each decoded token (or small group) as Server-Sent Events. Same total time, but the user sees the first tokens immediately. Essential for chat UIs and any agent loop where you want to surface progress before the response is complete.

from anthropic import Anthropic
import time

client = Anthropic()

# Streaming: each token (or small chunk) arrives as it’s decoded.
t0 = time.perf_counter()
first_token_t = None
total_tokens = 0

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=512,
    messages=[{"role": "user", "content": "Explain inference in one paragraph."}],
) as stream:
    for chunk in stream.text_stream:
        if first_token_t is None:
            first_token_t = time.perf_counter() - t0
            print(f"[TTFT: {first_token_t*1000:.0f} ms]")
        print(chunk, end="", flush=True)
        total_tokens += 1   # text chunks, not perfect token counts — close enough for ops

elapsed = time.perf_counter() - t0
print(f"\n[total: {elapsed:.2f}s, decode rate ~ {total_tokens/max(elapsed-first_token_t, 0.001):.0f} chunks/s]")

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();
const t0 = performance.now();
let firstTokenMs: number | null = null;
let chunks = 0;

const stream = client.messages.stream({
  model: "claude-sonnet-4-6",
  max_tokens: 512,
  messages: [{ role: "user", content: "Explain inference in one paragraph." }],
});

for await (const event of stream) {
  if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
    if (firstTokenMs === null) {
      firstTokenMs = performance.now() - t0;
      console.log(`[TTFT: ${firstTokenMs.toFixed(0)} ms]`);
    }
    process.stdout.write(event.delta.text);
    chunks++;
  }
}

const elapsed = performance.now() - t0;
console.log(`\n[total: ${(elapsed / 1000).toFixed(2)}s, decode rate ~ ${(chunks / Math.max((elapsed - (firstTokenMs ?? 0)) / 1000, 0.001)).toFixed(0)} chunks/s]`);

✔ What Just Happened?

You watched the two-phase model in real time. The TTFT print fires exactly when prefill finishes (and the first decoded token lands); the “decode rate” reports how fast subsequent tokens stream in. Run this against a short prompt and a 10K-token prompt back-to-back — you’ll see TTFT scale roughly with prompt length while decode rate stays roughly constant. That’s the prefill/decode split made visible.

Common Misconceptions

“Inference and training are the same thing.” — They’re not. Training updates billions of weights using gradient descent across millions of examples; inference reads those frozen weights to predict the next token. You only ever do inference when calling the API; training happened at Anthropic before the model shipped.

“Streaming is faster than non-streaming.” — Same total time. Streaming just shows tokens as they decode rather than buffering them. Use streaming for UX (perceived latency); use non-streaming when you need the full response before doing anything (parsing JSON, tool dispatch).

“Big prompts are slow because Claude has to read them all.” — Half-right. Prefill processes the prompt in parallel, but the work scales roughly with N² due to attention. Long prompts hurt TTFT, not decode rate. Prompt caching (M22) collapses that cost on repeat-heavy prompts.

“temperature=0 means deterministic.” — Mostly true, with caveats. At 0 the sampling step becomes argmax (no randomness in the model). But tie-breaking, server-side batching, and floating-point non-determinism on GPUs can still produce different outputs across runs. For strict reproducibility, also pin the model snapshot and seed if the API supports it.

Now that you know how Claude turns logits into tokens, the next section explains why the temperature knob exists and when to turn it.

Temperature, Top-p & Top-k

Everyday Analogy

BEFORE: Without sampling controls, a language model would always pick the single highest-probability next word — like a restaurant that only ever serves the most popular dish, regardless of what you're in the mood for. Every sentence would sound the same, mechanical and repetitive.

PAIN: That's terrible for creative tasks (bland writing), but also bad for technical tasks where multiple phrasings are equally correct — the model would get stuck in ruts, always producing identical outputs.

MAPPING: Temperature is a creativity dial. At 0, Claude always picks the safest, most predictable next word — like a cautious writer sticking to cliches. At 1.0, Claude is willing to take risks and surprise you, choosing less-probable but more interesting words. Top-p and top-k are like narrowing the menu of options Claude considers before making a choice — top-p says "only consider words that make up the top 90% of the probability," and top-k says "only consider the top 50 most likely words."

What this actually looks like: Here's the same set of next-token probabilities at three different temperature settings. Watch how the distribution shifts:

Temperature effect on probabilities

Prompt: "The best way to learn programming is"

Temperature 0.0 (greedy — always pick the top word):
  " to"       → 99.8%  ← always chosen
  " by"       →  0.1%
  " through"  →  0.1%

Temperature 0.5 (moderate — top words dominate but others have a chance):
  " to"       → 58%
  " by"       → 22%
  " through"  → 12%
  " with"     →  5%
  " from"     →  3%

Temperature 1.0 (creative — spread across many options):
  " to"       → 30%
  " by"       → 22%
  " through"  → 15%
  " with"     → 10%
  " from"     →  6%
  " when"     →  4%
  ... more words now have a real chance

Technical Definition Here's how these three controls work under the hood, step by step.

Step 1 — Temperature: The model produces logits — raw prediction scores, think of them as "confidence points" for every possible next word. Temperature divides all logits by the temperature value before they're converted to probabilities via softmax (a function that turns numbers into percentages that add up to 100%). A low temperature like 0.1 makes the top word's probability dominate (say, 95%). A high temperature like 1.0 keeps the distribution spread out (maybe 30%, 20%, 15%...).

Step 2 — Top-p: Top-p (also called nucleus sampling) then trims the menu. It sorts words by probability and keeps only the smallest set whose probabilities add up to p. For example, 0.9 means "keep the top 90% of probability mass, discard the rest."

Step 3 — Top-k: Top-k is a simpler filter — it limits consideration to the k most likely tokens regardless of their probabilities. For example, top-k=50 means only the top 50 words can be chosen.

In practice, temperature is the one you'll adjust most often. Top-p and top-k are fine-tuning knobs for when you need precise control.

Interactive: Temperature & Sampling Playground

Temperature 0.5

Top-p 0.9

Prompt: "The best way to learn programming is"

Sampled output:

Click "Generate" to see a response at the current settings.

Why It Matters For agent work, you'll almost always want low temperature (0.0–0.3). Here's a real scenario: imagine an agent that routes 5,000 customer emails per day to 8 different tool functions (refund, escalate, FAQ lookup, etc.). At temperature 0.0, the same email consistently triggers the same tool — you get reproducible behavior that's easy to test and debug. At temperature 0.8, about 3–8% of borderline emails will randomly route to different tools on each run (150–400 unpredictable decisions per day), making your system unreliable and your logs a nightmare to audit. Save higher temperatures for creative writing tasks where variety is the goal, not a bug.

The "Calculator vs. Thinker" Mental Model

Everyday Analogy

BEFORE: Before LLMs, most software was deterministic — a calculator gives you the same answer every time: 2 + 2 = 4, always. Developers built their entire workflow around this certainty: write code, run tests, expect exact outputs.

PAIN: When teams first adopt LLMs, they instinctively treat the model like a calculator. They write a prompt, get a great answer, ship it — then are shocked when the same prompt gives a subtly different (or wrong) answer the next day. Bug reports pile up, tests fail intermittently, and trust erodes.

MAPPING: The fix is a mental model shift: Claude is a thinker, not a calculator. A thinker gives you their best answer, which can vary, can be wrong, and can surprise you with insight. Treat it like a very knowledgeable colleague who sometimes needs to be double-checked, not a database that returns exact facts. Once you internalize this, you'll naturally build verification steps, add guardrails, and design your agent for graceful handling of imperfect outputs.

What this actually looks like: Here's a real-world example of why the "thinker" model matters. Same prompt, two runs, both at temperature 0.0:

Non-deterministic behavior example

Prompt: "What is the population of Tokyo?"

Run 1: "Tokyo has a population of approximately 13.96 million
        people in the city proper as of 2023."

Run 2: "The population of Tokyo is about 14 million in the
        city proper, or roughly 37 million in the greater
        metropolitan area."

Both are reasonable. Neither is "wrong." But a calculator
would give you the exact same answer every time.
An agent that routes decisions based on this output
needs to handle BOTH variations gracefully.

This mental model is the single most important idea in this course. In plain English: a calculator always gives you 2 + 2 = 4. A thinker gives you their best reasoning, which is usually excellent but occasionally off. When you build an agent, you're building around a thinker — so you design for "usually right" rather than "always right."

How does this work internally? When Claude generates a response, it's making thousands of probabilistic choices (one per token). Each choice has some chance of going a different direction. Even at temperature 0.0, server-side implementation details like floating-point rounding and batch scheduling can cause tiny variations. The result is that outputs are highly consistent but not identical — and when the model is uncertain (borderline cases, ambiguous questions, math), those small variations can compound into meaningfully different answers.

How is this different from traditional software? In conventional programming, if a function returns the wrong result, it's a bug — you fix the code and it works. With LLMs, variation isn't a bug; it's a fundamental property of how the system works. This means your job as an agent builder shifts from "make it correct" to "make it reliably useful despite occasional imperfection." That's a completely different engineering discipline, and it's what the rest of this course teaches.

Here's how this mental model affects every decision you'll make:

Prompts (M03): You're giving instructions to a thinker, not writing code for a machine
Tool Use (M05): You give the thinker tools to compensate for what prediction can't do (real-time data, calculations, database lookups)
Guardrails (M16–M17): You build checks because thinkers can make mistakes
Evaluation (M18): You measure quality probabilistically, not as pass/fail

⚠️ Common Misconceptions

"LLMs are basically a smarter search engine / database, right?" — No. A database stores facts and retrieves them exactly. Claude doesn't store or retrieve anything — it generates new text by predicting tokens. When it gives you a correct fact, it's because its training patterns lead to that prediction, not because it "looked it up." This is why it can produce plausible-sounding facts that are completely wrong — there's no lookup step to fail; it just predicts what sounds right.

"If I use temperature 0, the output is deterministic." — Almost, but not quite. Temperature 0 makes Claude pick the highest-probability token each time, which is highly consistent. But in practice, minor server-side differences (floating-point math, batching) can occasionally produce slightly different outputs. Design your agents for "extremely consistent," not "bit-for-bit identical every time."

"Claude understands what I'm saying." — It's more accurate to say Claude is extremely good at pattern matching over language. It processes the statistical relationships between tokens in ways that produce remarkably useful results, but it has no internal model of truth, no beliefs, and no comprehension in the human sense. This matters because it means Claude can confidently produce incorrect information — it doesn't "know" it's wrong.

"More parameters = more accurate." — Bigger models are generally more capable, but "capable" and "accurate" are different things. A larger model can handle more complex reasoning and nuanced prompts, but it can still hallucinate, and it may do so more convincingly. Size doesn't eliminate the need for verification and guardrails.

"If Claude gets something wrong, I should just ask again." — Retrying the same prompt is a lottery, not a strategy. If the model's training patterns lead it toward a wrong answer, it will likely give the same wrong answer again. The fix is to change the approach: rephrase the prompt, provide examples, add context, or use a tool to fetch the correct information. You'll learn all these techniques in Modules 3 through 5.

Code Walkthrough: Your First Claude API Call

Concept → Code Bridge: You now understand the theory — LLMs predict tokens, process input in parallel, and generate output sequentially. The next step is to see this in action. The code below sends a prompt to Claude's API and gets back a response, letting you observe token prediction firsthand (including how many tokens went in and came out).

Let's make your first call to the Claude API. You'll need an API key from console.anthropic.com.

Setup: API Key as Environment Variable

Security: Never Hardcode API Keys Store your key as an environment variable. Never put it directly in code, never commit it to Git.

bash

# Set your API key as an environment variable
export ANTHROPIC_API_KEY="your-key-here"

Making the Call

Let's start with the simplest possible Claude API call. The code below creates an Anthropic client, sends a single message, and prints the response. This three-step pattern — create client, call messages.create, read the content — is the foundation for every API call you'll make in this course. Once you internalize this structure, adding tools, streaming, and multi-turn conversations in later modules will feel like natural extensions.

Here's the one thing that trips up almost everyone on their first try: the response is message.content[0].text, not message.text. Why the extra [0]? Because content is a list of content blocks — Claude can return text, images, and tool calls in the same response. Even for a simple text reply, you need [0] to grab the first block. Forget this and you'll get a confusing list object instead of a string.

# pip install anthropic>=0.30.0
import anthropic

# The client reads ANTHROPIC_API_KEY from your environment
client = anthropic.Anthropic()

try:
    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="You are a helpful assistant who explains things clearly.",
        messages=[
            {"role": "user", "content": "What is a large language model? Explain in 2 sentences."}
        ]
    )
    # The response content is a list of content blocks
    print(message.content[0].text)
    print(f"\nTokens used: {message.usage.input_tokens} in, {message.usage.output_tokens} out")
except anthropic.AuthenticationError:
    print("Invalid API key. Check your ANTHROPIC_API_KEY environment variable.")
except anthropic.APIError as e:
    print(f"API error: {e.message}")

// npm install @anthropic-ai/sdk@^0.30.0
import Anthropic from '@anthropic-ai/sdk';

// The client reads ANTHROPIC_API_KEY from your environment
const client = new Anthropic();

try {
  const message = await client.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    system: 'You are a helpful assistant who explains things clearly.',
    messages: [
      { role: 'user', content: 'What is a large language model? Explain in 2 sentences.' }
    ]
  });

  // The response content is a list of content blocks
  console.log(message.content[0].text);
  console.log(`\nTokens used: ${message.usage.input_tokens} in, ${message.usage.output_tokens} out`);
} catch (error) {
  if (error instanceof Anthropic.APIError) {
    console.error(`API error: ${error.status} - ${error.message}`);
  } else {
    throw error;
  }
}

curl https://api.anthropic.com/v1/messages \
  -H "content-type: application/json" \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "claude-sonnet-4-6",
    "max_tokens": 1024,
    "system": "You are a helpful assistant who explains things clearly.",
    "messages": [
      {"role": "user", "content": "What is a large language model? Explain in 2 sentences."}
    ]
  }'

Expected Output:

A large language model is a neural network trained on vast amounts of text data to predict and generate human-like text. It works by learning statistical patterns in language, enabling it to answer questions, write content, and assist with a wide range of language tasks. Tokens used: 35 in, 48 out

What Just Happened? You sent 35 tokens of input (your prompt + system message) to Claude. Claude's transformer read all 35 tokens in parallel, then generated 48 tokens of output one at a time. Each output token was chosen as the most likely next token given everything before it. The usage object tells you exactly how many tokens were consumed — this will matter for cost tracking (Module 2) and context window management (Module 4).

Diagram: The Context Window

Experimenting with Temperature

The interesting part of this next example is that it sends the exact same prompt three times, each with a different temperature value (0.0, 0.5, and 1.0). The results will be noticeably different — that's the whole point. Seeing identical input produce varied output is the fastest way to feel that LLMs are stochastic, not deterministic. This is the kind of experiment worth running yourself, because reading about probability distributions is one thing — watching Claude give you three different answers to the same question makes it click.

import anthropic

client = anthropic.Anthropic()

prompt = "Write a one-sentence description of the moon."

for temp in [0.0, 0.5, 1.0]:
    try:
        message = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=100,
            temperature=temp,
            messages=[{"role": "user", "content": prompt}]
        )
        print(f"Temperature {temp}: {message.content[0].text}")
    except anthropic.APIError as e:
        print(f"Error at temperature {temp}: {e.message}")

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();
const prompt = 'Write a one-sentence description of the moon.';

for (const temp of [0.0, 0.5, 1.0]) {
  try {
    const message = await client.messages.create({
      model: 'claude-sonnet-4-6',
      max_tokens: 100,
      temperature: temp,
      messages: [{ role: 'user', content: prompt }]
    });
    console.log(`Temperature ${temp}: ${message.content[0].text}`);
  } catch (error) {
    if (error instanceof Anthropic.APIError) {
      console.error(`Error at temp ${temp}: ${error.message}`);
    } else {
      throw error;
    }
  }
}

Expected Output (results will vary at higher temps):

Temperature 0.0: The moon is Earth's only natural satellite, orbiting at an average distance of about 384,400 kilometers. Temperature 0.5: The moon is a celestial companion to Earth, illuminating our night sky as it reflects the sun's light. Temperature 1.0: Our luminous neighbor drifts through the cosmic dark, a pale stone mirror catching sunbeams for the sleeping world below.

What Just Happened? At temperature 0.0, Claude picked the highest-probability token at every step, producing a factual, encyclopedic sentence. At 0.5, it occasionally chose second- or third-ranked tokens, yielding a slightly more varied but still grounded response. At 1.0, the probability distribution was flattened enough that lower-probability — but more creative — words got selected, producing poetic and surprising language. Same model, same prompt, three very different outputs — all controlled by one number.

Hands-On Exercise: Hello Claude

What You'll Build

A series of small scripts that call Claude's API, experiment with temperature, and culminate in a working CLI chatbot. By the end you'll have made your first API call, observed how temperature changes output, and built a multi-turn conversation loop.

Time estimate: 20–30 minutes • Prerequisites: Python 3.9+ or Node.js 18+ • An Anthropic API key from console.anthropic.com

Environment Setup

Copy and paste this entire block into your terminal to create a project folder and install the SDK:

mkdir hello-claude && cd hello-claude
python -m venv venv
# macOS/Linux:
source venv/bin/activate
# Windows:
# venv\Scripts\activate
pip install "anthropic>=0.30.0"
export ANTHROPIC_API_KEY="your-key-here"   # Windows: set ANTHROPIC_API_KEY=your-key-here

mkdir hello-claude && cd hello-claude
npm init -y
npm install @anthropic-ai/sdk@^0.30.0
export ANTHROPIC_API_KEY="your-key-here"   # Windows: set ANTHROPIC_API_KEY=your-key-here

Step 1: Make Your First API Call

This step verifies that your API key works and that you can communicate with Claude. It's the "hello world" of agent development — if this works, everything else in the course will build on it.

Create a new file called hello.py (or hello.mjs for Node.js):

import anthropic

client = anthropic.Anthropic()

try:
    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="You are a helpful assistant who explains things clearly.",
        messages=[
            {"role": "user", "content": "What is a large language model? Explain in 2 sentences."}
        ]
    )
    print(message.content[0].text)
    print(f"\nTokens used: {message.usage.input_tokens} in, {message.usage.output_tokens} out")
except anthropic.AuthenticationError:
    print("Invalid API key. Check your ANTHROPIC_API_KEY environment variable.")
except anthropic.APIError as e:
    print(f"API error: {e.message}")

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

try {
  const message = await client.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    system: 'You are a helpful assistant who explains things clearly.',
    messages: [
      { role: 'user', content: 'What is a large language model? Explain in 2 sentences.' }
    ]
  });
  console.log(message.content[0].text);
  console.log(`\nTokens used: ${message.usage.input_tokens} in, ${message.usage.output_tokens} out`);
} catch (error) {
  if (error instanceof Anthropic.APIError) {
    console.error(`API error: ${error.status} - ${error.message}`);
  } else {
    throw error;
  }
}

Run it: python hello.py (or node hello.mjs)

Expected Output:

A large language model is a neural network trained on vast amounts of text data to predict and generate human-like text. It works by learning statistical patterns in language, enabling it to answer questions, write content, and assist with a wide range of language tasks. Tokens used: 35 in, 48 out

✅ Checkpoint: If you see a 2-sentence explanation followed by a token count, Step 1 is working. Your exact wording will differ — that's expected (remember, Claude is a thinker, not a calculator).

Troubleshooting

AuthenticationError — Your API key is missing or invalid. Run echo $ANTHROPIC_API_KEY (or echo %ANTHROPIC_API_KEY% on Windows) to check it's set.
ModuleNotFoundError: No module named 'anthropic' — You're not in the virtual environment. Run source venv/bin/activate first, then pip install anthropic.
Connection error / timeout — Check your internet connection. The API endpoint is api.anthropic.com — make sure it's not blocked by a firewall or VPN.

Step 2: Experiment with System Prompts

System prompts shape Claude's personality and behavior. This step shows you how much control a single string gives you over the output. You'll use the same user message but swap the system prompt to see wildly different responses.

Create a new file called system_prompts.py (or system_prompts.mjs):

import anthropic

client = anthropic.Anthropic()

prompts = [
    "You are a pirate. Respond in pirate speak.",
    "You are a formal academic. Use precise, scholarly language.",
    "Respond only in haiku format (5-7-5 syllables).",
]

for system_prompt in prompts:
    try:
        message = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=200,
            system=system_prompt,
            messages=[{"role": "user", "content": "What is the moon?"}]
        )
        print(f"System: {system_prompt}")
        print(f"Response: {message.content[0].text}\n")
    except anthropic.APIError as e:
        print(f"Error: {e.message}")

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

const prompts = [
  'You are a pirate. Respond in pirate speak.',
  'You are a formal academic. Use precise, scholarly language.',
  'Respond only in haiku format (5-7-5 syllables).',
];

for (const systemPrompt of prompts) {
  try {
    const message = await client.messages.create({
      model: 'claude-sonnet-4-6',
      max_tokens: 200,
      system: systemPrompt,
      messages: [{ role: 'user', content: 'What is the moon?' }]
    });
    console.log(`System: ${systemPrompt}`);
    console.log(`Response: ${message.content[0].text}\n`);
  } catch (error) {
    console.error(`Error: ${error.message}`);
  }
}

Run it: python system_prompts.py (or node system_prompts.mjs)

Expected Output (your wording will vary):

System: You are a pirate. Respond in pirate speak. Response: Arrr, the moon be that great glowing orb in the night sky, matey! She guides us pirates across the seven seas... System: You are a formal academic. Use precise, scholarly language. Response: The Moon is Earth's sole natural satellite, orbiting at a mean distance of approximately 384,400 kilometres... System: Respond only in haiku format (5-7-5 syllables). Response: Silver orb above / Pulling tides across the sea / Night's faithful lantern

✅ Checkpoint: If you see three distinctly different responses to the same question, Step 2 is working. The system prompt is the primary way you'll control agent behavior throughout this course.

Step 3: Temperature Experiment

This step makes the "thinker, not calculator" concept visceral. You'll run the exact same prompt multiple times at different temperatures and compare how consistent the outputs are. This is the experiment that makes non-determinism click.

Create a new file called temperature.py (or temperature.mjs):

import anthropic

client = anthropic.Anthropic()
prompt = "Write a one-sentence description of the moon."

for temp in [0.0, 1.0]:
    print(f"\n--- Temperature {temp} ---")
    for i in range(3):
        try:
            message = client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=100,
                temperature=temp,
                messages=[{"role": "user", "content": prompt}]
            )
            print(f"  Run {i+1}: {message.content[0].text}")
        except anthropic.APIError as e:
            print(f"  Error: {e.message}")

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();
const prompt = 'Write a one-sentence description of the moon.';

for (const temp of [0.0, 1.0]) {
  console.log(`\n--- Temperature ${temp} ---`);
  for (let i = 0; i < 3; i++) {
    try {
      const message = await client.messages.create({
        model: 'claude-sonnet-4-6',
        max_tokens: 100,
        temperature: temp,
        messages: [{ role: 'user', content: prompt }]
      });
      console.log(`  Run ${i + 1}: ${message.content[0].text}`);
    } catch (error) {
      console.error(`  Error: ${error.message}`);
    }
  }
}

Run it: python temperature.py (or node temperature.mjs)

Expected Output:

--- Temperature 0.0 --- Run 1: The moon is Earth's only natural satellite, orbiting at an average distance of about 384,400 kilometers. Run 2: The moon is Earth's only natural satellite, orbiting at an average distance of about 384,400 kilometers. Run 3: The moon is Earth's only natural satellite, orbiting at an average distance of about 384,400 kilometers. --- Temperature 1.0 --- Run 1: Our luminous neighbor drifts through the cosmic dark, a pale stone mirror catching sunbeams for the sleeping world below. Run 2: The moon is a celestial companion that has silently witnessed every chapter of Earth's long story. Run 3: Hanging in the night sky like a worn silver coin, the moon pulls our tides and stirs our oldest myths.

✅ Checkpoint: At temperature 0.0, all three runs should be nearly or exactly identical. At temperature 1.0, each run should be noticeably different. If you see this pattern, you've just observed the core difference between deterministic and stochastic generation.

Step 4: Observe Token Usage

Token counts drive API costs and context limits — two concepts you'll use throughout the entire course. This step trains you to notice how different prompts and settings affect token consumption, so it becomes instinctive. This uses the message.usage object from Step 1.

Create a new file called token_usage.py (or token_usage.mjs):

import anthropic

client = anthropic.Anthropic()

tests = [
    ("Short prompt", "Hi!", 50),
    ("Medium prompt", "Explain what a large language model is in detail.", 200),
    ("Long prompt with constraint", "Write a 3-paragraph essay about the history of computing.", 1024),
]

for label, prompt, max_tok in tests:
    try:
        message = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=max_tok,
            messages=[{"role": "user", "content": prompt}]
        )
        u = message.usage
        print(f"{label}:")
        print(f"  Input tokens:  {u.input_tokens}")
        print(f"  Output tokens: {u.output_tokens}")
        print(f"  Total tokens:  {u.input_tokens + u.output_tokens}\n")
    except anthropic.APIError as e:
        print(f"Error: {e.message}")

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

const tests = [
  ['Short prompt', 'Hi!', 50],
  ['Medium prompt', 'Explain what a large language model is in detail.', 200],
  ['Long prompt with constraint', 'Write a 3-paragraph essay about the history of computing.', 1024],
];

for (const [label, prompt, maxTok] of tests) {
  try {
    const message = await client.messages.create({
      model: 'claude-sonnet-4-6',
      max_tokens: maxTok,
      messages: [{ role: 'user', content: prompt }]
    });
    const u = message.usage;
    console.log(`${label}:`);
    console.log(`  Input tokens:  ${u.input_tokens}`);
    console.log(`  Output tokens: ${u.output_tokens}`);
    console.log(`  Total tokens:  ${u.input_tokens + u.output_tokens}\n`);
  } catch (error) {
    console.error(`Error: ${error.message}`);
  }
}

Run it: python token_usage.py (or node token_usage.mjs)

Expected Output (numbers will vary slightly):

Short prompt: Input tokens: 9 Output tokens: 32 Total tokens: 41 Medium prompt: Input tokens: 16 Output tokens: 187 Total tokens: 203 Long prompt with constraint: Input tokens: 18 Output tokens: 412 Total tokens: 430

✅ Checkpoint: Notice how short prompts use few input tokens but the output length varies based on what you asked for. The max_tokens parameter is a ceiling, not a target — Claude often uses fewer. You'll explore tokens deeply in M02.

Step 5 (Stretch Goal): Build a CLI Chat

Concept → Code Bridge: The previous examples were single-turn: one message in, one response out. Real agents have multi-turn conversations. The key difference is the conversation list — you accumulate messages so Claude can see the full history, just like how each output token depends on all previous tokens.

Now let's build something you can actually interact with: a terminal chatbot that remembers your conversation. This matters because multi-turn conversation is the foundation of every agent — agents don't just answer one question, they maintain context across a sequence of actions. The pattern you'll see here (accumulate messages in a list, send the full list on each call, append the response) is the exact same loop that powers production agent frameworks.

Here's the dilemma to watch for: what happens when an API call fails mid-conversation? Notice the conversation.pop() in the error handler. If you skip that, you'd have a user message sitting in the list with no assistant response after it. The next API call would fail because Claude's API requires strict alternating user/assistant messages. That one line of cleanup prevents a cascade of confusing errors.

import anthropic

client = anthropic.Anthropic()
conversation = []

print("Chat with Claude! (type 'quit' to exit)\n")

while True:
    user_input = input("You: ").strip()
    if user_input.lower() in ("quit", "exit"):
        break
    if not user_input:
        continue

    conversation.append({"role": "user", "content": user_input})

    try:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=512,
            system="You are a friendly, helpful assistant.",
            messages=conversation
        )
        assistant_msg = response.content[0].text
        conversation.append({"role": "assistant", "content": assistant_msg})
        print(f"\nClaude: {assistant_msg}\n")
    except anthropic.APIError as e:
        print(f"\nError: {e.message}\n")
        # Remove the failed user message so conversation stays consistent
        conversation.pop()

import Anthropic from '@anthropic-ai/sdk';
import * as readline from 'readline';

const client = new Anthropic();
const conversation = [];

const rl = readline.createInterface({
  input: process.stdin,
  output: process.stdout
});

console.log("Chat with Claude! (type 'quit' to exit)\n");

function ask() {
  rl.question('You: ', async (userInput) => {
    userInput = userInput.trim();
    if (['quit', 'exit'].includes(userInput.toLowerCase())) {
      rl.close();
      return;
    }
    if (!userInput) { ask(); return; }

    conversation.push({ role: 'user', content: userInput });

    try {
      const response = await client.messages.create({
        model: 'claude-sonnet-4-6',
        max_tokens: 512,
        system: 'You are a friendly, helpful assistant.',
        messages: conversation
      });
      const assistantMsg = response.content[0].text;
      conversation.push({ role: 'assistant', content: assistantMsg });
      console.log(`\nClaude: ${assistantMsg}\n`);
    } catch (error) {
      console.error(`\nError: ${error.message}\n`);
      conversation.pop();
    }
    ask();
  });
}
ask();

What Just Happened? You built a multi-turn chat loop. Each time you type a message, it's appended to the conversation array, sent to Claude along with the full history, and the response is appended back. Claude sees every previous exchange on each call — that's how it "remembers" the conversation. This pattern (accumulate messages → send all → append response) is the exact loop that powers agent frameworks you'll see in Modules 7–9.

✅ Checkpoint: If you can have a multi-turn conversation where Claude remembers what you said earlier (e.g., "My name is Alex" followed by "What's my name?"), Step 5 is working. Type quit to exit.

Troubleshooting

TypeError: Cannot read properties of undefined (Node.js) — Make sure your file has the .mjs extension or your package.json contains "type": "module" for ES module imports.
Claude doesn't remember previous messages — Check that you're appending both the user message and the assistant response to the conversation list. Both must be present.
Rate limit errors after many messages — Add a short delay between calls or reduce your conversation length. Long conversations also consume more tokens per call.

Verify Everything Works

Run all four scripts in sequence to confirm your setup is complete:

bash

# Python
python hello.py && python system_prompts.py && python temperature.py && python token_usage.py

# Node.js
node hello.mjs && node system_prompts.mjs && node temperature.mjs && node token_usage.mjs

🎉 Congratulations! If all four scripts ran without errors, you've completed M01. You can make API calls, control output with system prompts and temperature, and track token usage. These are the building blocks for everything that follows. Onward to M02!

Knowledge Check

Test your understanding of the concepts from this module. Select the best answer for each question.

Q1: What does a Large Language Model fundamentally do?

A Searches a database of pre-written answers and returns the best match

B Predicts the most likely next token in a sequence based on patterns learned from training data

C Executes code instructions to compute the correct answer to any question

D Understands the meaning of text the same way humans do, then formulates a response

Correct! LLMs are next-token predictors. They don't search databases, execute code, or understand language in a human sense — they predict what text comes next based on statistical patterns.

Q2: What does the temperature parameter control?

A The maximum length of the response

B How fast the model generates tokens

C The randomness of token selection — lower means more deterministic, higher means more creative

D The accuracy of the model's responses

Correct! Temperature scales the probability distribution. At 0, the model always picks the highest-probability token. At higher values, lower-probability tokens get a better chance of being selected.

Q3: Why is Claude described as a "thinker" rather than a "calculator"?

A Because Claude is slow at math

B Because its outputs can vary, can be wrong, and require verification — unlike deterministic computation

C Because Claude has consciousness and feelings

D Because Claude cannot do any calculations at all

Correct! The "thinker" mental model reminds us that LLM outputs are probabilistic, not deterministic. This is why agents need guardrails, evaluation, and human oversight.

Q4: How does Claude process input differently from how it generates output?

A Input is processed all at once (parallel); output is generated one token at a time (autoregressive)

B Both input and output are processed one token at a time

C Input is processed one word at a time; output is generated all at once

D Both are processed in parallel simultaneously

Correct! The transformer architecture processes all input tokens simultaneously using attention, but output must be generated one token at a time because each token depends on the ones before it.

Q5: Fill in the blank to complete a valid Claude API call:

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello!"}]
)
print(message.______[0].text)

A response

B text

C content

D choices

Correct! The Messages API returns a content array. Each element is a content block with a text property. So it's message.content[0].text.

Q6: What is the recommended temperature setting for an agent that calls tools and makes decisions?

A 1.0 — Maximum creativity for best results

B 0.7 — A balanced middle ground

C It doesn't matter; temperature has no effect on tool use

D 0.0–0.3 — Low temperature for reliable, consistent behavior

Correct! Agents need predictable, consistent outputs. Low temperature ensures the model picks the most likely tokens, reducing the chance of unexpected tool calls or erratic behavior.

Module Summary

Key Takeaways

LLMs are next-token predictors — they generate text by predicting the most likely continuation of a sequence.
Input = parallel, Output = sequential — Claude reads everything at once but writes one token at a time.
Temperature controls randomness — low for agents (reliability), high for creative tasks (variety).
Thinker, not calculator — outputs are probabilistic and need verification. This mental model guides every agent design decision.
The API is simple — create a client, send messages, get a response with content blocks and usage stats.

Next Module Preview: M02 — Tokens

Now that you know Claude predicts tokens, the natural question is: what exactly is a token? In Module 2, you'll build an interactive tokenizer, understand how tokens affect cost and context limits, and create a token budget calculator — a tool you'll use throughout the entire course.