M01: The LLM Mental Model

Understand what a Large Language Model really is, how Claude processes your text, and why the right mental model changes everything you'll build in this course.

Learning Objectives

  • Explain what a Large Language Model is and how it generates text one token at a time
  • Describe the difference between how Claude reads input (all at once) and writes output (one token at a time)
  • Use temperature, top-p, and top-k controls and predict how they change output
  • Make your first Claude API call using Python and Node.js
  • Adopt the "thinker, not calculator" mental model for working with LLMs

What Is a Large Language Model?

Everyday Analogy

BEFORE: Before LLMs, if you wanted a computer to answer a question, someone had to explicitly program every possible answer — think of old-school chatbots with giant lists of if/then rules, or search engines that could only find pages containing your exact keywords.

PAIN: That approach broke down constantly. Ask the chatbot something the programmer didn't anticipate, and you'd get "I don't understand." Ask a search engine a nuanced question, and you'd sift through ten blue links hoping one had your answer.

MAPPING: An LLM like Claude is the world's most well-read autocomplete. Your phone's autocomplete has read your messages; Claude has read billions of documents — books, code, conversations, scientific papers — and uses all of that to predict what comes next. Instead of following hand-written rules, it learned patterns from that mountain of text, so it can handle questions and tasks nobody explicitly programmed it for. It's autocomplete that went to every university, read every manual, and practiced every writing style.

What this actually looks like: When you type "The capital of France is", the model doesn't look up "France" in a table. Instead, it computes a probability for every possible next token. Here's a simplified version of what that prediction looks like internally:

Simplified next-token probabilities
Input: "The capital of France is"

Next token predictions:
  " Paris"    → 0.92  (92% probability)
  " the"      → 0.03  (3%)
  " located"  → 0.02  (2%)
  " a"        → 0.01  (1%)
  " Lyon"     → 0.005 (0.5%)
  ... thousands more tokens with tiny probabilities
Technical Definition A Large Language Model (LLM)A neural network with billions of parameters, trained on massive text datasets to predict the next token in a sequence. "Large" refers to both the training data and the number of parameters. is a neural networkA computing system inspired by biological neurons. It learns patterns by adjusting millions or billions of numerical weights during training. trained on vast amounts of text data. Let's unpack that piece by piece.

First, "neural network" means a mathematical system that learns by example rather than by following hand-written rules. You show it billions of text samples, and it gradually adjusts billions of internal numbers (called parameters) until it gets good at one specific job.

That one job is: given a sequence of tokensThe smallest units of text that an LLM works with. A token can be a word, part of a word, or a punctuation mark. We'll explore tokens deeply in Module 2. (small chunks of text — roughly words or word-pieces), predict the most likely next token. That's it. Every impressive thing an LLM does — writing code, answering questions, translating languages — is a side effect of getting extremely good at next-token prediction.

Now for the "Large" part. Claude-class models have hundreds of billions of parameters and were trained on terabytes of text. It doesn't "understand" language the way humans do — it finds statistical patterns at a scale that produces remarkably useful results. The reason this matters for you as a builder is that the model's power and its failure modes both come from this prediction mechanism.

Diagram: How an LLM Generates Text
Input Tokens "The capital of" Transformer Layers Attention + Feed-forward Probability Distribution " Paris" 92% " the" 3% " located" 2% Output Token " Paris" ✓ Feeds back as next input (autoregressive loop) Step 1 Step 2 Step 3 Step 4 Repeat until stop
Animation: Token-by-Token Generation
Why It Matters Understanding that LLMs predict tokens — not "think" about answers — is the foundation for everything in this course. Here's a concrete example: a production agent processing 10,000 customer support tickets per day at temperature 0.0 will still produce roughly 2–5% hallucinated facts (200–500 tickets with incorrect information) if it has no tool access or guardrails. When your agent gives a wrong answer, it's because token prediction chose a plausible-but-incorrect continuation, not because it's "confused." This distinction shapes how you write prompts, design tools, and build guardrails — and it's why Modules 5 (Tool Use) and 16–17 (Guardrails) exist in this course.

How Claude Processes Text

Everyday Analogy

BEFORE: Before transformer-based models, older AI systems (like recurrent neural networks) had to read text word-by-word in order, like a person reading a sentence while covering up the words ahead of them. This made them slow and forgetful — by the time they reached the end of a long paragraph, they'd already started "forgetting" the beginning.

PAIN: That sequential reading created a bottleneck: longer inputs meant worse comprehension, because the model couldn't hold everything in mind at once.

MAPPING: Claude works like a speed reader who absorbs an entire page in one glance — every word attended to simultaneously, related to every other word. Then it writes its response one word at a time, each word influenced by everything it read plus everything it has written so far. The reading is instant and parallel; the writing is careful and sequential.

What this actually looks like: Here's a real API response showing the timing asymmetry. Notice how the input (your prompt) is processed almost instantly, but the output tokens trickle out one by one:

API timing example
Request: 850 tokens of input
Response: 120 tokens of output

Timeline:
  0ms    → Request sent
  180ms  → First output token arrives (all 850 input tokens processed)
  180ms  → "The"
  210ms  → " best"
  240ms  → " approach"
  ...      (each token ~30ms apart)
  3780ms → Final token generated

Input processing:  180ms  (850 tokens, all at once)
Output generation: 3600ms (120 tokens, one at a time)
Technical Definition Claude processes input using parallel attention through the transformer architectureThe neural network design behind all modern LLMs. It uses an "attention mechanism" to relate every word to every other word in the input. Introduced in the 2017 paper "Attention Is All You Need.". Here's what that means in plain English: every token in your prompt is compared against every other token at the same time. The model doesn't read left-to-right — it sees the whole message at once.

This is why sending a 1,000-token prompt to Claude is nearly as fast as sending a 100-token prompt: the input processing step happens in parallel.

Output generation, however, works completely differently. It's autoregressiveA process where each output depends on all previous outputs. Claude generates token 5 by looking at tokens 1-4 plus the entire input. This is why generation is slower than reading., meaning "each step feeds into the next." Each new token is predicted based on all input tokens plus all previously generated output tokens. This is why output is the slow part — each token must wait for the previous one to be generated first.

Animation: Input Processing vs. Output Generation
📄
Your Message
Tokenize
🧠
Attention
(all at once)
Generate
(one by one)
💬
Response

How Inference Actually Works

Everyday Analogy

BEFORE: You probably picture an LLM as a function: question goes in, full answer comes out. That mental model is wrong in a specific way that matters once you start optimizing latency and cost.

PAIN: Without the right picture, you can’t reason about why the first token takes 800 ms but the next 200 tokens stream out at 60 ms each. You can’t explain why a 50K-token prompt is expensive even before Claude writes a single word back. You can’t budget for production.

MAPPING: Inference is autoregressive — Claude generates output one token at a time, and each new token is conditioned on every token before it (yours and its own). Picture someone typing a long reply on a phone: they read what they’ve typed so far, pick the most likely next letter, append it, re-read, pick the next letter, append, and so on. That’s inference. There’s no “full answer” sitting in the model waiting to be unwrapped — the answer is constructed token-by-token, in real time, in the same call.

Technical Definition

InferenceThe process of running a trained LLM to produce output. As opposed to training (which adjusts the model’s weights), inference uses the frozen weights to predict tokens. The cost you pay per API call is inference cost. is what happens when Claude is running, not learning. Every call to the API kicks off two distinct phases:

1. Prefill (a.k.a. the “forward pass” over your prompt). Your prompt — system + messages + tool definitions — is tokenized (M02), converted to embeddingsHigh-dimensional vectors (4096 numbers, in many models) that represent a token in a meaning-space. Tokens that are semantically related land near each other in that space. We’ll dig into embeddings in M09., and pushed through every transformer layer in parallel. The model computes attention across all input tokens at once, which is fast per-token but heavy: cost is roughly O(N²) in prompt length. The output of prefill is one set of logits — a probability distribution over the entire vocabulary — for the next token to generate.

2. Decode (a.k.a. token-by-token generation). The model samples one token from the prefill logits, appends it to the running sequence, and runs just that new token through the transformer (reusing cached attention values for everything before it — the KV cacheKey/Value cache. During decode, the attention computations for already-processed tokens are kept in memory so each new token only needs to compute its own attention against the cache, not re-run the whole prompt. This is what makes streaming cheap per-token after the first one.). That produces the next set of logits. Sample, append, repeat — until the model emits a stop token or hits max_tokens. Decode is sequential by construction: token N can’t start until token N−1 exists.

Two phases, two costs. Prefill latency is paid once per request and dominates “time to first token.” Decode latency is paid per output token and dominates “tokens per second.” This split is why a 50K-token prompt with a 100-token answer feels slow to start but finishes quickly — and why a 200-token prompt with a 4000-token answer feels snappy at first but takes forever.

Prefill (once) → Decode (token-by-token, N times)
1. PREFILL — runs once
  1. Tokenize prompt (N tokens)
  2. Embed all N tokens
  3. Run through every transformer layer in parallel
  4. Build the KV cache for every token
  5. Output: logits for token N+1

Cost: ~O(N²) compute, but parallelizable. Latency: drives time-to-first-token.

2. DECODE — runs N times
  1. Sample one token from current logits
  2. Append it to the running sequence
  3. Run just the new token through layers, reading KV cache
  4. Get logits for the next token
  5. Stop if token == <end> or budget hit; else go to step 1

Cost: ~O(1) per token (memory-bound). Latency: drives tokens-per-second.

Sampling — What “Logits” Become a Token

The transformer doesn’t pick a token; it outputs logits — a raw score per token in the vocabulary (~200K entries for Claude). To turn logits into a chosen token, three steps run:

  1. Softmax with temperature. Logits — divide by temperature T — then softmax to a probability distribution. T = 0 collapses to one-hot (always pick the argmax); T = 1 keeps the model’s native distribution; T > 1 flattens it (more surprise).
  2. Top-k / Top-p filter. Optionally truncate the distribution to either the top-k highest-probability tokens, or the smallest set whose cumulative probability exceeds p (nucleus sampling). This blocks the long tail of nonsense tokens.
  3. Sample. Draw one token from the (renormalized) filtered distribution. This single token becomes the “next token,” gets appended, and the loop continues.

Important consequence: the model’s output is non-deterministic even at temperature=1 because of the sampling step. Set temperature=0 if you need reproducibility — that picks the argmax with no sampling. The next H2 on Temperature/Top-p/Top-k goes deeper on when to use each.

Latency Anatomy — Where Your Seconds Go

Real numbers, on Claude Sonnet 4.6 (typical 2026 production load, single-region):

Phase What dominates Order of magnitude
NetworkTLS, request routing, region distance30–150 ms
Prefill (TTFT)Prompt length; quadratic-ish in N~50 ms / 1K tokens (uncached); <5 ms / 1K cached
DecodeOutput length; linear in tokens generated~50–100 tokens/s (Sonnet); higher with speculative decoding
Server queueConcurrent traffic, rate-limit tier0–500 ms p99
Practical Consequences
  • Time-to-first-token (TTFT) is set almost entirely by prompt length and whether the prefix is cached. Prompt caching (M22) turns “5 seconds of prefill” into “200 ms of prefill” on repeat-heavy prompts.
  • Tokens-per-second (TPS) is set by the decode step. Output length is the cost driver here — doubling max_tokens doubles decode time, regardless of how long the prompt was.
  • Streaming (next subsection) doesn’t change total wall-clock latency; it just delivers tokens as they’re produced so users see something happening at TTFT instead of waiting for the full decode.
  • Extended thinking (M22) and reasoning models add a hidden third phase — thinking tokens are decoded before any visible response. We’ll connect the dots in M22.

Streaming vs Batch — Same Inference, Different Delivery

One request, two ways to receive the answer:

  • Non-streaming — you wait for the full decode, then receive the entire response in one chunk. Simple; the agent code in M05–M07 uses this shape.
  • Streaming — the server sends each decoded token (or small group) as Server-Sent Events. Same total time, but the user sees the first tokens immediately. Essential for chat UIs and any agent loop where you want to surface progress before the response is complete.
from anthropic import Anthropic
import time

client = Anthropic()

# Streaming: each token (or small chunk) arrives as it’s decoded.
t0 = time.perf_counter()
first_token_t = None
total_tokens = 0

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=512,
    messages=[{"role": "user", "content": "Explain inference in one paragraph."}],
) as stream:
    for chunk in stream.text_stream:
        if first_token_t is None:
            first_token_t = time.perf_counter() - t0
            print(f"[TTFT: {first_token_t*1000:.0f} ms]")
        print(chunk, end="", flush=True)
        total_tokens += 1   # text chunks, not perfect token counts — close enough for ops

elapsed = time.perf_counter() - t0
print(f"\n[total: {elapsed:.2f}s, decode rate ~ {total_tokens/max(elapsed-first_token_t, 0.001):.0f} chunks/s]")
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();
const t0 = performance.now();
let firstTokenMs: number | null = null;
let chunks = 0;

const stream = client.messages.stream({
  model: "claude-sonnet-4-6",
  max_tokens: 512,
  messages: [{ role: "user", content: "Explain inference in one paragraph." }],
});

for await (const event of stream) {
  if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
    if (firstTokenMs === null) {
      firstTokenMs = performance.now() - t0;
      console.log(`[TTFT: ${firstTokenMs.toFixed(0)} ms]`);
    }
    process.stdout.write(event.delta.text);
    chunks++;
  }
}

const elapsed = performance.now() - t0;
console.log(`\n[total: ${(elapsed / 1000).toFixed(2)}s, decode rate ~ ${(chunks / Math.max((elapsed - (firstTokenMs ?? 0)) / 1000, 0.001)).toFixed(0)} chunks/s]`);
✔ What Just Happened?

You watched the two-phase model in real time. The TTFT print fires exactly when prefill finishes (and the first decoded token lands); the “decode rate” reports how fast subsequent tokens stream in. Run this against a short prompt and a 10K-token prompt back-to-back — you’ll see TTFT scale roughly with prompt length while decode rate stays roughly constant. That’s the prefill/decode split made visible.

Common Misconceptions

“Inference and training are the same thing.” — They’re not. Training updates billions of weights using gradient descent across millions of examples; inference reads those frozen weights to predict the next token. You only ever do inference when calling the API; training happened at Anthropic before the model shipped.

“Streaming is faster than non-streaming.” — Same total time. Streaming just shows tokens as they decode rather than buffering them. Use streaming for UX (perceived latency); use non-streaming when you need the full response before doing anything (parsing JSON, tool dispatch).

“Big prompts are slow because Claude has to read them all.” — Half-right. Prefill processes the prompt in parallel, but the work scales roughly with N² due to attention. Long prompts hurt TTFT, not decode rate. Prompt caching (M22) collapses that cost on repeat-heavy prompts.

temperature=0 means deterministic.” — Mostly true, with caveats. At 0 the sampling step becomes argmax (no randomness in the model). But tie-breaking, server-side batching, and floating-point non-determinism on GPUs can still produce different outputs across runs. For strict reproducibility, also pin the model snapshot and seed if the API supports it.

Now that you know how Claude turns logits into tokens, the next section explains why the temperature knob exists and when to turn it.

Temperature, Top-p & Top-k

Everyday Analogy

BEFORE: Without sampling controls, a language model would always pick the single highest-probability next word — like a restaurant that only ever serves the most popular dish, regardless of what you're in the mood for. Every sentence would sound the same, mechanical and repetitive.

PAIN: That's terrible for creative tasks (bland writing), but also bad for technical tasks where multiple phrasings are equally correct — the model would get stuck in ruts, always producing identical outputs.

MAPPING: Temperature is a creativity dial. At 0, Claude always picks the safest, most predictable next word — like a cautious writer sticking to cliches. At 1.0, Claude is willing to take risks and surprise you, choosing less-probable but more interesting words. Top-p and top-k are like narrowing the menu of options Claude considers before making a choice — top-p says "only consider words that make up the top 90% of the probability," and top-k says "only consider the top 50 most likely words."

What this actually looks like: Here's the same set of next-token probabilities at three different temperature settings. Watch how the distribution shifts:

Temperature effect on probabilities
Prompt: "The best way to learn programming is"

Temperature 0.0 (greedy — always pick the top word):
  " to"       → 99.8%  ← always chosen
  " by"       →  0.1%
  " through"  →  0.1%

Temperature 0.5 (moderate — top words dominate but others have a chance):
  " to"       → 58%
  " by"       → 22%
  " through"  → 12%
  " with"     →  5%
  " from"     →  3%

Temperature 1.0 (creative — spread across many options):
  " to"       → 30%
  " by"       → 22%
  " through"  → 15%
  " with"     → 10%
  " from"     →  6%
  " when"     →  4%
  ... more words now have a real chance
Technical Definition Here's how these three controls work under the hood, step by step.

Step 1 — Temperature: The model produces logitsThe raw, unnormalized scores the model assigns to every possible next token before converting them into probabilities. Higher logits = higher probability. — raw prediction scores, think of them as "confidence points" for every possible next word. TemperatureA number (0 to 1) that scales the model's confidence scores before picking the next token. Lower = more deterministic, higher = more random/creative. divides all logits by the temperature value before they're converted to probabilities via softmaxA mathematical function that converts a list of numbers into probabilities (all positive, summing to 1). The bigger the input number, the bigger its share of the probability. (a function that turns numbers into percentages that add up to 100%). A low temperature like 0.1 makes the top word's probability dominate (say, 95%). A high temperature like 1.0 keeps the distribution spread out (maybe 30%, 20%, 15%...).

Step 2 — Top-p: Top-pAlso called nucleus sampling. Instead of considering all possible next tokens, Claude only considers the smallest set whose combined probability exceeds p (e.g., 0.9 = top 90% of probability mass). (also called nucleus sampling) then trims the menu. It sorts words by probability and keeps only the smallest set whose probabilities add up to p. For example, 0.9 means "keep the top 90% of probability mass, discard the rest."

Step 3 — Top-k: Top-kLimits the model to only consider the k most likely next tokens. For example, top-k=50 means Claude only picks from its top 50 predictions, ignoring all others. is a simpler filter — it limits consideration to the k most likely tokens regardless of their probabilities. For example, top-k=50 means only the top 50 words can be chosen.

In practice, temperature is the one you'll adjust most often. Top-p and top-k are fine-tuning knobs for when you need precise control.

Interactive: Temperature & Sampling Playground

Prompt: "The best way to learn programming is"

Sampled output:
Click "Generate" to see a response at the current settings.
Why It Matters For agent work, you'll almost always want low temperature (0.0–0.3). Here's a real scenario: imagine an agent that routes 5,000 customer emails per day to 8 different tool functions (refund, escalate, FAQ lookup, etc.). At temperature 0.0, the same email consistently triggers the same tool — you get reproducible behavior that's easy to test and debug. At temperature 0.8, about 3–8% of borderline emails will randomly route to different tools on each run (150–400 unpredictable decisions per day), making your system unreliable and your logs a nightmare to audit. Save higher temperatures for creative writing tasks where variety is the goal, not a bug.

The "Calculator vs. Thinker" Mental Model

Everyday Analogy

BEFORE: Before LLMs, most software was deterministic — a calculator gives you the same answer every time: 2 + 2 = 4, always. Developers built their entire workflow around this certainty: write code, run tests, expect exact outputs.

PAIN: When teams first adopt LLMs, they instinctively treat the model like a calculator. They write a prompt, get a great answer, ship it — then are shocked when the same prompt gives a subtly different (or wrong) answer the next day. Bug reports pile up, tests fail intermittently, and trust erodes.

MAPPING: The fix is a mental model shift: Claude is a thinker, not a calculator. A thinker gives you their best answer, which can vary, can be wrong, and can surprise you with insight. Treat it like a very knowledgeable colleague who sometimes needs to be double-checked, not a database that returns exact facts. Once you internalize this, you'll naturally build verification steps, add guardrails, and design your agent for graceful handling of imperfect outputs.

What this actually looks like: Here's a real-world example of why the "thinker" model matters. Same prompt, two runs, both at temperature 0.0:

Non-deterministic behavior example
Prompt: "What is the population of Tokyo?"

Run 1: "Tokyo has a population of approximately 13.96 million
        people in the city proper as of 2023."

Run 2: "The population of Tokyo is about 14 million in the
        city proper, or roughly 37 million in the greater
        metropolitan area."

Both are reasonable. Neither is "wrong." But a calculator
would give you the exact same answer every time.
An agent that routes decisions based on this output
needs to handle BOTH variations gracefully.

This mental model is the single most important idea in this course. In plain English: a calculator always gives you 2 + 2 = 4. A thinker gives you their best reasoning, which is usually excellent but occasionally off. When you build an agent, you're building around a thinker — so you design for "usually right" rather than "always right."

How does this work internally? When Claude generates a response, it's making thousands of probabilistic choices (one per token). Each choice has some chance of going a different direction. Even at temperature 0.0, server-side implementation details like floating-point rounding and batch scheduling can cause tiny variations. The result is that outputs are highly consistent but not identical — and when the model is uncertain (borderline cases, ambiguous questions, math), those small variations can compound into meaningfully different answers.

How is this different from traditional software? In conventional programming, if a function returns the wrong result, it's a bug — you fix the code and it works. With LLMs, variation isn't a bug; it's a fundamental property of how the system works. This means your job as an agent builder shifts from "make it correct" to "make it reliably useful despite occasional imperfection." That's a completely different engineering discipline, and it's what the rest of this course teaches.

Here's how this mental model affects every decision you'll make:

  • Prompts (M03): You're giving instructions to a thinker, not writing code for a machine
  • Tool Use (M05): You give the thinker tools to compensate for what prediction can't do (real-time data, calculations, database lookups)
  • Guardrails (M16–M17): You build checks because thinkers can make mistakes
  • Evaluation (M18): You measure quality probabilistically, not as pass/fail
⚠️ Common Misconceptions

"LLMs are basically a smarter search engine / database, right?" — No. A database stores facts and retrieves them exactly. Claude doesn't store or retrieve anything — it generates new text by predicting tokens. When it gives you a correct fact, it's because its training patterns lead to that prediction, not because it "looked it up." This is why it can produce plausible-sounding facts that are completely wrong — there's no lookup step to fail; it just predicts what sounds right.

"If I use temperature 0, the output is deterministic." — Almost, but not quite. Temperature 0 makes Claude pick the highest-probability token each time, which is highly consistent. But in practice, minor server-side differences (floating-point math, batching) can occasionally produce slightly different outputs. Design your agents for "extremely consistent," not "bit-for-bit identical every time."

"Claude understands what I'm saying." — It's more accurate to say Claude is extremely good at pattern matching over language. It processes the statistical relationships between tokens in ways that produce remarkably useful results, but it has no internal model of truth, no beliefs, and no comprehension in the human sense. This matters because it means Claude can confidently produce incorrect information — it doesn't "know" it's wrong.

"More parameters = more accurate." — Bigger models are generally more capable, but "capable" and "accurate" are different things. A larger model can handle more complex reasoning and nuanced prompts, but it can still hallucinate, and it may do so more convincingly. Size doesn't eliminate the need for verification and guardrails.

"If Claude gets something wrong, I should just ask again." — Retrying the same prompt is a lottery, not a strategy. If the model's training patterns lead it toward a wrong answer, it will likely give the same wrong answer again. The fix is to change the approach: rephrase the prompt, provide examples, add context, or use a tool to fetch the correct information. You'll learn all these techniques in Modules 3 through 5.

Code Walkthrough: Your First Claude API Call

Concept → Code Bridge: You now understand the theory — LLMs predict tokens, process input in parallel, and generate output sequentially. The next step is to see this in action. The code below sends a prompt to Claude's API and gets back a response, letting you observe token prediction firsthand (including how many tokens went in and came out).

Let's make your first call to the Claude API. You'll need an API keyA secret string that authenticates your application with Anthropic's servers. Never share it, never commit it to code. Always store it as an environment variable. from console.anthropic.com.

Setup: API Key as Environment Variable

Security: Never Hardcode API Keys Store your key as an environment variable. Never put it directly in code, never commit it to Git.
bash
# Set your API key as an environment variable
export ANTHROPIC_API_KEY="your-key-here"

Making the Call

Let's start with the simplest possible Claude API call. The code below creates an Anthropic client, sends a single message, and prints the response. This three-step pattern — create client, call messages.create, read the content — is the foundation for every API call you'll make in this course. Once you internalize this structure, adding tools, streaming, and multi-turn conversations in later modules will feel like natural extensions.

Here's the one thing that trips up almost everyone on their first try: the response is message.content[0].text, not message.text. Why the extra [0]? Because content is a list of content blocks — Claude can return text, images, and tool calls in the same response. Even for a simple text reply, you need [0] to grab the first block. Forget this and you'll get a confusing list object instead of a string.

# pip install anthropic>=0.30.0
import anthropic

# The client reads ANTHROPIC_API_KEY from your environment
client = anthropic.Anthropic()

try:
    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="You are a helpful assistant who explains things clearly.",
        messages=[
            {"role": "user", "content": "What is a large language model? Explain in 2 sentences."}
        ]
    )
    # The response content is a list of content blocks
    print(message.content[0].text)
    print(f"\nTokens used: {message.usage.input_tokens} in, {message.usage.output_tokens} out")
except anthropic.AuthenticationError:
    print("Invalid API key. Check your ANTHROPIC_API_KEY environment variable.")
except anthropic.APIError as e:
    print(f"API error: {e.message}")
// npm install @anthropic-ai/sdk@^0.30.0
import Anthropic from '@anthropic-ai/sdk';

// The client reads ANTHROPIC_API_KEY from your environment
const client = new Anthropic();

try {
  const message = await client.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    system: 'You are a helpful assistant who explains things clearly.',
    messages: [
      { role: 'user', content: 'What is a large language model? Explain in 2 sentences.' }
    ]
  });

  // The response content is a list of content blocks
  console.log(message.content[0].text);
  console.log(`\nTokens used: ${message.usage.input_tokens} in, ${message.usage.output_tokens} out`);
} catch (error) {
  if (error instanceof Anthropic.APIError) {
    console.error(`API error: ${error.status} - ${error.message}`);
  } else {
    throw error;
  }
}
curl https://api.anthropic.com/v1/messages \
  -H "content-type: application/json" \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "claude-sonnet-4-6",
    "max_tokens": 1024,
    "system": "You are a helpful assistant who explains things clearly.",
    "messages": [
      {"role": "user", "content": "What is a large language model? Explain in 2 sentences."}
    ]
  }'
Expected Output:
A large language model is a neural network trained on vast amounts of text data to predict and generate human-like text. It works by learning statistical patterns in language, enabling it to answer questions, write content, and assist with a wide range of language tasks. Tokens used: 35 in, 48 out
What Just Happened? You sent 35 tokens of input (your prompt + system message) to Claude. Claude's transformer read all 35 tokens in parallel, then generated 48 tokens of output one at a time. Each output token was chosen as the most likely next token given everything before it. The usage object tells you exactly how many tokens were consumed — this will matter for cost tracking (Module 2) and context window management (Module 4).
Diagram: The Context Window
Context Window — 200,000 tokens total System Prompt ~500 tokens Role, rules, persona (sent every call) Conversation History variable user → assistant → user → assistant ... Grows with each turn — must be managed! Current User Message ~35 tokens Response Headroom (max_tokens) ~1,024 tokens If these sections exceed 200K tokens total, the API returns an error.

Experimenting with Temperature

The interesting part of this next example is that it sends the exact same prompt three times, each with a different temperature value (0.0, 0.5, and 1.0). The results will be noticeably different — that's the whole point. Seeing identical input produce varied output is the fastest way to feel that LLMs are stochastic, not deterministic. This is the kind of experiment worth running yourself, because reading about probability distributions is one thing — watching Claude give you three different answers to the same question makes it click.

import anthropic

client = anthropic.Anthropic()

prompt = "Write a one-sentence description of the moon."

for temp in [0.0, 0.5, 1.0]:
    try:
        message = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=100,
            temperature=temp,
            messages=[{"role": "user", "content": prompt}]
        )
        print(f"Temperature {temp}: {message.content[0].text}")
    except anthropic.APIError as e:
        print(f"Error at temperature {temp}: {e.message}")
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();
const prompt = 'Write a one-sentence description of the moon.';

for (const temp of [0.0, 0.5, 1.0]) {
  try {
    const message = await client.messages.create({
      model: 'claude-sonnet-4-6',
      max_tokens: 100,
      temperature: temp,
      messages: [{ role: 'user', content: prompt }]
    });
    console.log(`Temperature ${temp}: ${message.content[0].text}`);
  } catch (error) {
    if (error instanceof Anthropic.APIError) {
      console.error(`Error at temp ${temp}: ${error.message}`);
    } else {
      throw error;
    }
  }
}
Expected Output (results will vary at higher temps):
Temperature 0.0: The moon is Earth's only natural satellite, orbiting at an average distance of about 384,400 kilometers. Temperature 0.5: The moon is a celestial companion to Earth, illuminating our night sky as it reflects the sun's light. Temperature 1.0: Our luminous neighbor drifts through the cosmic dark, a pale stone mirror catching sunbeams for the sleeping world below.
What Just Happened? At temperature 0.0, Claude picked the highest-probability token at every step, producing a factual, encyclopedic sentence. At 0.5, it occasionally chose second- or third-ranked tokens, yielding a slightly more varied but still grounded response. At 1.0, the probability distribution was flattened enough that lower-probability — but more creative — words got selected, producing poetic and surprising language. Same model, same prompt, three very different outputs — all controlled by one number.

Hands-On Exercise: Hello Claude

What You'll Build

A series of small scripts that call Claude's API, experiment with temperature, and culminate in a working CLI chatbot. By the end you'll have made your first API call, observed how temperature changes output, and built a multi-turn conversation loop.

Time estimate: 20–30 minutes • Prerequisites: Python 3.9+ or Node.js 18+ • An Anthropic API key from console.anthropic.com

Environment Setup

Copy and paste this entire block into your terminal to create a project folder and install the SDK:

mkdir hello-claude && cd hello-claude
python -m venv venv
# macOS/Linux:
source venv/bin/activate
# Windows:
# venv\Scripts\activate
pip install "anthropic>=0.30.0"
export ANTHROPIC_API_KEY="your-key-here"   # Windows: set ANTHROPIC_API_KEY=your-key-here
mkdir hello-claude && cd hello-claude
npm init -y
npm install @anthropic-ai/sdk@^0.30.0
export ANTHROPIC_API_KEY="your-key-here"   # Windows: set ANTHROPIC_API_KEY=your-key-here

Step 1: Make Your First API Call

This step verifies that your API key works and that you can communicate with Claude. It's the "hello world" of agent development — if this works, everything else in the course will build on it.

Create a new file called hello.py (or hello.mjs for Node.js):

import anthropic

client = anthropic.Anthropic()

try:
    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="You are a helpful assistant who explains things clearly.",
        messages=[
            {"role": "user", "content": "What is a large language model? Explain in 2 sentences."}
        ]
    )
    print(message.content[0].text)
    print(f"\nTokens used: {message.usage.input_tokens} in, {message.usage.output_tokens} out")
except anthropic.AuthenticationError:
    print("Invalid API key. Check your ANTHROPIC_API_KEY environment variable.")
except anthropic.APIError as e:
    print(f"API error: {e.message}")
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

try {
  const message = await client.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    system: 'You are a helpful assistant who explains things clearly.',
    messages: [
      { role: 'user', content: 'What is a large language model? Explain in 2 sentences.' }
    ]
  });
  console.log(message.content[0].text);
  console.log(`\nTokens used: ${message.usage.input_tokens} in, ${message.usage.output_tokens} out`);
} catch (error) {
  if (error instanceof Anthropic.APIError) {
    console.error(`API error: ${error.status} - ${error.message}`);
  } else {
    throw error;
  }
}

Run it: python hello.py (or node hello.mjs)

Expected Output:
A large language model is a neural network trained on vast amounts of text data to predict and generate human-like text. It works by learning statistical patterns in language, enabling it to answer questions, write content, and assist with a wide range of language tasks. Tokens used: 35 in, 48 out
✅ Checkpoint: If you see a 2-sentence explanation followed by a token count, Step 1 is working. Your exact wording will differ — that's expected (remember, Claude is a thinker, not a calculator).
Troubleshooting
  • AuthenticationError — Your API key is missing or invalid. Run echo $ANTHROPIC_API_KEY (or echo %ANTHROPIC_API_KEY% on Windows) to check it's set.
  • ModuleNotFoundError: No module named 'anthropic' — You're not in the virtual environment. Run source venv/bin/activate first, then pip install anthropic.
  • Connection error / timeout — Check your internet connection. The API endpoint is api.anthropic.com — make sure it's not blocked by a firewall or VPN.

Step 2: Experiment with System Prompts

System prompts shape Claude's personality and behavior. This step shows you how much control a single string gives you over the output. You'll use the same user message but swap the system prompt to see wildly different responses.

Create a new file called system_prompts.py (or system_prompts.mjs):

import anthropic

client = anthropic.Anthropic()

prompts = [
    "You are a pirate. Respond in pirate speak.",
    "You are a formal academic. Use precise, scholarly language.",
    "Respond only in haiku format (5-7-5 syllables).",
]

for system_prompt in prompts:
    try:
        message = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=200,
            system=system_prompt,
            messages=[{"role": "user", "content": "What is the moon?"}]
        )
        print(f"System: {system_prompt}")
        print(f"Response: {message.content[0].text}\n")
    except anthropic.APIError as e:
        print(f"Error: {e.message}")
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

const prompts = [
  'You are a pirate. Respond in pirate speak.',
  'You are a formal academic. Use precise, scholarly language.',
  'Respond only in haiku format (5-7-5 syllables).',
];

for (const systemPrompt of prompts) {
  try {
    const message = await client.messages.create({
      model: 'claude-sonnet-4-6',
      max_tokens: 200,
      system: systemPrompt,
      messages: [{ role: 'user', content: 'What is the moon?' }]
    });
    console.log(`System: ${systemPrompt}`);
    console.log(`Response: ${message.content[0].text}\n`);
  } catch (error) {
    console.error(`Error: ${error.message}`);
  }
}

Run it: python system_prompts.py (or node system_prompts.mjs)

Expected Output (your wording will vary):
System: You are a pirate. Respond in pirate speak. Response: Arrr, the moon be that great glowing orb in the night sky, matey! She guides us pirates across the seven seas... System: You are a formal academic. Use precise, scholarly language. Response: The Moon is Earth's sole natural satellite, orbiting at a mean distance of approximately 384,400 kilometres... System: Respond only in haiku format (5-7-5 syllables). Response: Silver orb above / Pulling tides across the sea / Night's faithful lantern
✅ Checkpoint: If you see three distinctly different responses to the same question, Step 2 is working. The system prompt is the primary way you'll control agent behavior throughout this course.

Step 3: Temperature Experiment

This step makes the "thinker, not calculator" concept visceral. You'll run the exact same prompt multiple times at different temperatures and compare how consistent the outputs are. This is the experiment that makes non-determinism click.

Create a new file called temperature.py (or temperature.mjs):

import anthropic

client = anthropic.Anthropic()
prompt = "Write a one-sentence description of the moon."

for temp in [0.0, 1.0]:
    print(f"\n--- Temperature {temp} ---")
    for i in range(3):
        try:
            message = client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=100,
                temperature=temp,
                messages=[{"role": "user", "content": prompt}]
            )
            print(f"  Run {i+1}: {message.content[0].text}")
        except anthropic.APIError as e:
            print(f"  Error: {e.message}")
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();
const prompt = 'Write a one-sentence description of the moon.';

for (const temp of [0.0, 1.0]) {
  console.log(`\n--- Temperature ${temp} ---`);
  for (let i = 0; i < 3; i++) {
    try {
      const message = await client.messages.create({
        model: 'claude-sonnet-4-6',
        max_tokens: 100,
        temperature: temp,
        messages: [{ role: 'user', content: prompt }]
      });
      console.log(`  Run ${i + 1}: ${message.content[0].text}`);
    } catch (error) {
      console.error(`  Error: ${error.message}`);
    }
  }
}

Run it: python temperature.py (or node temperature.mjs)

Expected Output:
--- Temperature 0.0 --- Run 1: The moon is Earth's only natural satellite, orbiting at an average distance of about 384,400 kilometers. Run 2: The moon is Earth's only natural satellite, orbiting at an average distance of about 384,400 kilometers. Run 3: The moon is Earth's only natural satellite, orbiting at an average distance of about 384,400 kilometers. --- Temperature 1.0 --- Run 1: Our luminous neighbor drifts through the cosmic dark, a pale stone mirror catching sunbeams for the sleeping world below. Run 2: The moon is a celestial companion that has silently witnessed every chapter of Earth's long story. Run 3: Hanging in the night sky like a worn silver coin, the moon pulls our tides and stirs our oldest myths.
✅ Checkpoint: At temperature 0.0, all three runs should be nearly or exactly identical. At temperature 1.0, each run should be noticeably different. If you see this pattern, you've just observed the core difference between deterministic and stochastic generation.

Step 4: Observe Token Usage

Token counts drive API costs and context limits — two concepts you'll use throughout the entire course. This step trains you to notice how different prompts and settings affect token consumption, so it becomes instinctive. This uses the message.usage object from Step 1.

Create a new file called token_usage.py (or token_usage.mjs):

import anthropic

client = anthropic.Anthropic()

tests = [
    ("Short prompt", "Hi!", 50),
    ("Medium prompt", "Explain what a large language model is in detail.", 200),
    ("Long prompt with constraint", "Write a 3-paragraph essay about the history of computing.", 1024),
]

for label, prompt, max_tok in tests:
    try:
        message = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=max_tok,
            messages=[{"role": "user", "content": prompt}]
        )
        u = message.usage
        print(f"{label}:")
        print(f"  Input tokens:  {u.input_tokens}")
        print(f"  Output tokens: {u.output_tokens}")
        print(f"  Total tokens:  {u.input_tokens + u.output_tokens}\n")
    except anthropic.APIError as e:
        print(f"Error: {e.message}")
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

const tests = [
  ['Short prompt', 'Hi!', 50],
  ['Medium prompt', 'Explain what a large language model is in detail.', 200],
  ['Long prompt with constraint', 'Write a 3-paragraph essay about the history of computing.', 1024],
];

for (const [label, prompt, maxTok] of tests) {
  try {
    const message = await client.messages.create({
      model: 'claude-sonnet-4-6',
      max_tokens: maxTok,
      messages: [{ role: 'user', content: prompt }]
    });
    const u = message.usage;
    console.log(`${label}:`);
    console.log(`  Input tokens:  ${u.input_tokens}`);
    console.log(`  Output tokens: ${u.output_tokens}`);
    console.log(`  Total tokens:  ${u.input_tokens + u.output_tokens}\n`);
  } catch (error) {
    console.error(`Error: ${error.message}`);
  }
}

Run it: python token_usage.py (or node token_usage.mjs)

Expected Output (numbers will vary slightly):
Short prompt: Input tokens: 9 Output tokens: 32 Total tokens: 41 Medium prompt: Input tokens: 16 Output tokens: 187 Total tokens: 203 Long prompt with constraint: Input tokens: 18 Output tokens: 412 Total tokens: 430
✅ Checkpoint: Notice how short prompts use few input tokens but the output length varies based on what you asked for. The max_tokens parameter is a ceiling, not a target — Claude often uses fewer. You'll explore tokens deeply in M02.

Step 5 (Stretch Goal): Build a CLI Chat

Concept → Code Bridge: The previous examples were single-turn: one message in, one response out. Real agents have multi-turn conversations. The key difference is the conversation list — you accumulate messages so Claude can see the full history, just like how each output token depends on all previous tokens.

Now let's build something you can actually interact with: a terminal chatbot that remembers your conversation. This matters because multi-turn conversation is the foundation of every agent — agents don't just answer one question, they maintain context across a sequence of actions. The pattern you'll see here (accumulate messages in a list, send the full list on each call, append the response) is the exact same loop that powers production agent frameworks.

Here's the dilemma to watch for: what happens when an API call fails mid-conversation? Notice the conversation.pop() in the error handler. If you skip that, you'd have a user message sitting in the list with no assistant response after it. The next API call would fail because Claude's API requires strict alternating user/assistant messages. That one line of cleanup prevents a cascade of confusing errors.

import anthropic

client = anthropic.Anthropic()
conversation = []

print("Chat with Claude! (type 'quit' to exit)\n")

while True:
    user_input = input("You: ").strip()
    if user_input.lower() in ("quit", "exit"):
        break
    if not user_input:
        continue

    conversation.append({"role": "user", "content": user_input})

    try:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=512,
            system="You are a friendly, helpful assistant.",
            messages=conversation
        )
        assistant_msg = response.content[0].text
        conversation.append({"role": "assistant", "content": assistant_msg})
        print(f"\nClaude: {assistant_msg}\n")
    except anthropic.APIError as e:
        print(f"\nError: {e.message}\n")
        # Remove the failed user message so conversation stays consistent
        conversation.pop()
import Anthropic from '@anthropic-ai/sdk';
import * as readline from 'readline';

const client = new Anthropic();
const conversation = [];

const rl = readline.createInterface({
  input: process.stdin,
  output: process.stdout
});

console.log("Chat with Claude! (type 'quit' to exit)\n");

function ask() {
  rl.question('You: ', async (userInput) => {
    userInput = userInput.trim();
    if (['quit', 'exit'].includes(userInput.toLowerCase())) {
      rl.close();
      return;
    }
    if (!userInput) { ask(); return; }

    conversation.push({ role: 'user', content: userInput });

    try {
      const response = await client.messages.create({
        model: 'claude-sonnet-4-6',
        max_tokens: 512,
        system: 'You are a friendly, helpful assistant.',
        messages: conversation
      });
      const assistantMsg = response.content[0].text;
      conversation.push({ role: 'assistant', content: assistantMsg });
      console.log(`\nClaude: ${assistantMsg}\n`);
    } catch (error) {
      console.error(`\nError: ${error.message}\n`);
      conversation.pop();
    }
    ask();
  });
}
ask();
What Just Happened? You built a multi-turn chat loop. Each time you type a message, it's appended to the conversation array, sent to Claude along with the full history, and the response is appended back. Claude sees every previous exchange on each call — that's how it "remembers" the conversation. This pattern (accumulate messages → send all → append response) is the exact loop that powers agent frameworks you'll see in Modules 7–9.
✅ Checkpoint: If you can have a multi-turn conversation where Claude remembers what you said earlier (e.g., "My name is Alex" followed by "What's my name?"), Step 5 is working. Type quit to exit.
Troubleshooting
  • TypeError: Cannot read properties of undefined (Node.js) — Make sure your file has the .mjs extension or your package.json contains "type": "module" for ES module imports.
  • Claude doesn't remember previous messages — Check that you're appending both the user message and the assistant response to the conversation list. Both must be present.
  • Rate limit errors after many messages — Add a short delay between calls or reduce your conversation length. Long conversations also consume more tokens per call.

Verify Everything Works

Run all four scripts in sequence to confirm your setup is complete:

bash
# Python
python hello.py && python system_prompts.py && python temperature.py && python token_usage.py

# Node.js
node hello.mjs && node system_prompts.mjs && node temperature.mjs && node token_usage.mjs
🎉 Congratulations! If all four scripts ran without errors, you've completed M01. You can make API calls, control output with system prompts and temperature, and track token usage. These are the building blocks for everything that follows. Onward to M02!

Knowledge Check

Test your understanding of the concepts from this module. Select the best answer for each question.

Q1: What does a Large Language Model fundamentally do?

A Searches a database of pre-written answers and returns the best match
B Predicts the most likely next token in a sequence based on patterns learned from training data
C Executes code instructions to compute the correct answer to any question
D Understands the meaning of text the same way humans do, then formulates a response
Correct! LLMs are next-token predictors. They don't search databases, execute code, or understand language in a human sense — they predict what text comes next based on statistical patterns.

Q2: What does the temperature parameter control?

A The maximum length of the response
B How fast the model generates tokens
C The randomness of token selection — lower means more deterministic, higher means more creative
D The accuracy of the model's responses
Correct! Temperature scales the probability distribution. At 0, the model always picks the highest-probability token. At higher values, lower-probability tokens get a better chance of being selected.

Q3: Why is Claude described as a "thinker" rather than a "calculator"?

A Because Claude is slow at math
B Because its outputs can vary, can be wrong, and require verification — unlike deterministic computation
C Because Claude has consciousness and feelings
D Because Claude cannot do any calculations at all
Correct! The "thinker" mental model reminds us that LLM outputs are probabilistic, not deterministic. This is why agents need guardrails, evaluation, and human oversight.

Q4: How does Claude process input differently from how it generates output?

A Input is processed all at once (parallel); output is generated one token at a time (autoregressive)
B Both input and output are processed one token at a time
C Input is processed one word at a time; output is generated all at once
D Both are processed in parallel simultaneously
Correct! The transformer architecture processes all input tokens simultaneously using attention, but output must be generated one token at a time because each token depends on the ones before it.

Q5: Fill in the blank to complete a valid Claude API call:

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello!"}]
)
print(message.______[0].text)
A response
B text
C content
D choices
Correct! The Messages API returns a content array. Each element is a content block with a text property. So it's message.content[0].text.

Q6: What is the recommended temperature setting for an agent that calls tools and makes decisions?

A 1.0 — Maximum creativity for best results
B 0.7 — A balanced middle ground
C It doesn't matter; temperature has no effect on tool use
D 0.0–0.3 — Low temperature for reliable, consistent behavior
Correct! Agents need predictable, consistent outputs. Low temperature ensures the model picks the most likely tokens, reducing the chance of unexpected tool calls or erratic behavior.

Module Summary

Key Takeaways

  • LLMs are next-token predictors — they generate text by predicting the most likely continuation of a sequence.
  • Input = parallel, Output = sequential — Claude reads everything at once but writes one token at a time.
  • Temperature controls randomness — low for agents (reliability), high for creative tasks (variety).
  • Thinker, not calculator — outputs are probabilistic and need verification. This mental model guides every agent design decision.
  • The API is simple — create a client, send messages, get a response with content blocks and usage stats.

Next Module Preview: M02 — Tokens

Now that you know Claude predicts tokens, the natural question is: what exactly is a token? In Module 2, you'll build an interactive tokenizer, understand how tokens affect cost and context limits, and create a token budget calculator — a tool you'll use throughout the entire course.