M01: The LLM Mental Model
Understand what a Large Language Model really is, how Claude processes your text, and why the right mental model changes everything you'll build in this course.
Learning Objectives
- Explain what a Large Language Model is and how it generates text one token at a time
- Describe the difference between how Claude reads input (all at once) and writes output (one token at a time)
- Use temperature, top-p, and top-k controls and predict how they change output
- Make your first Claude API call using Python and Node.js
- Adopt the "thinker, not calculator" mental model for working with LLMs
What Is a Large Language Model?
BEFORE: Before LLMs, if you wanted a computer to answer a question, someone had to explicitly program every possible answer — think of old-school chatbots with giant lists of if/then rules, or search engines that could only find pages containing your exact keywords.
PAIN: That approach broke down constantly. Ask the chatbot something the programmer didn't anticipate, and you'd get "I don't understand." Ask a search engine a nuanced question, and you'd sift through ten blue links hoping one had your answer.
MAPPING: An LLM like Claude is the world's most well-read autocomplete. Your phone's autocomplete has read your messages; Claude has read billions of documents — books, code, conversations, scientific papers — and uses all of that to predict what comes next. Instead of following hand-written rules, it learned patterns from that mountain of text, so it can handle questions and tasks nobody explicitly programmed it for. It's autocomplete that went to every university, read every manual, and practiced every writing style.
What this actually looks like: When you type "The capital of France is", the model doesn't look up "France" in a table. Instead, it computes a probability for every possible next token. Here's a simplified version of what that prediction looks like internally:
Input: "The capital of France is"
Next token predictions:
" Paris" → 0.92 (92% probability)
" the" → 0.03 (3%)
" located" → 0.02 (2%)
" a" → 0.01 (1%)
" Lyon" → 0.005 (0.5%)
... thousands more tokens with tiny probabilities
First, "neural network" means a mathematical system that learns by example rather than by following hand-written rules. You show it billions of text samples, and it gradually adjusts billions of internal numbers (called parameters) until it gets good at one specific job.
That one job is: given a sequence of tokensThe smallest units of text that an LLM works with. A token can be a word, part of a word, or a punctuation mark. We'll explore tokens deeply in Module 2. (small chunks of text — roughly words or word-pieces), predict the most likely next token. That's it. Every impressive thing an LLM does — writing code, answering questions, translating languages — is a side effect of getting extremely good at next-token prediction.
Now for the "Large" part. Claude-class models have hundreds of billions of parameters and were trained on terabytes of text. It doesn't "understand" language the way humans do — it finds statistical patterns at a scale that produces remarkably useful results. The reason this matters for you as a builder is that the model's power and its failure modes both come from this prediction mechanism.
How Claude Processes Text
BEFORE: Before transformer-based models, older AI systems (like recurrent neural networks) had to read text word-by-word in order, like a person reading a sentence while covering up the words ahead of them. This made them slow and forgetful — by the time they reached the end of a long paragraph, they'd already started "forgetting" the beginning.
PAIN: That sequential reading created a bottleneck: longer inputs meant worse comprehension, because the model couldn't hold everything in mind at once.
MAPPING: Claude works like a speed reader who absorbs an entire page in one glance — every word attended to simultaneously, related to every other word. Then it writes its response one word at a time, each word influenced by everything it read plus everything it has written so far. The reading is instant and parallel; the writing is careful and sequential.
What this actually looks like: Here's a real API response showing the timing asymmetry. Notice how the input (your prompt) is processed almost instantly, but the output tokens trickle out one by one:
Request: 850 tokens of input
Response: 120 tokens of output
Timeline:
0ms → Request sent
180ms → First output token arrives (all 850 input tokens processed)
180ms → "The"
210ms → " best"
240ms → " approach"
... (each token ~30ms apart)
3780ms → Final token generated
Input processing: 180ms (850 tokens, all at once)
Output generation: 3600ms (120 tokens, one at a time)
This is why sending a 1,000-token prompt to Claude is nearly as fast as sending a 100-token prompt: the input processing step happens in parallel.
Output generation, however, works completely differently. It's autoregressiveA process where each output depends on all previous outputs. Claude generates token 5 by looking at tokens 1-4 plus the entire input. This is why generation is slower than reading., meaning "each step feeds into the next." Each new token is predicted based on all input tokens plus all previously generated output tokens. This is why output is the slow part — each token must wait for the previous one to be generated first.
(all at once)
(one by one)
How Inference Actually Works
BEFORE: You probably picture an LLM as a function: question goes in, full answer comes out. That mental model is wrong in a specific way that matters once you start optimizing latency and cost.
PAIN: Without the right picture, you can’t reason about why the first token takes 800 ms but the next 200 tokens stream out at 60 ms each. You can’t explain why a 50K-token prompt is expensive even before Claude writes a single word back. You can’t budget for production.
MAPPING: Inference is autoregressive — Claude generates output one token at a time, and each new token is conditioned on every token before it (yours and its own). Picture someone typing a long reply on a phone: they read what they’ve typed so far, pick the most likely next letter, append it, re-read, pick the next letter, append, and so on. That’s inference. There’s no “full answer” sitting in the model waiting to be unwrapped — the answer is constructed token-by-token, in real time, in the same call.
InferenceThe process of running a trained LLM to produce output. As opposed to training (which adjusts the model’s weights), inference uses the frozen weights to predict tokens. The cost you pay per API call is inference cost. is what happens when Claude is running, not learning. Every call to the API kicks off two distinct phases:
1. Prefill (a.k.a. the “forward pass” over your prompt). Your prompt — system + messages + tool definitions — is tokenized (M02), converted to embeddingsHigh-dimensional vectors (4096 numbers, in many models) that represent a token in a meaning-space. Tokens that are semantically related land near each other in that space. We’ll dig into embeddings in M09., and pushed through every transformer layer in parallel. The model computes attention across all input tokens at once, which is fast per-token but heavy: cost is roughly O(N²) in prompt length. The output of prefill is one set of logits — a probability distribution over the entire vocabulary — for the next token to generate.
2. Decode (a.k.a. token-by-token generation). The model samples one token from the prefill logits, appends it to the running sequence, and runs just that new token through the transformer (reusing cached attention values for everything before it — the KV cacheKey/Value cache. During decode, the attention computations for already-processed tokens are kept in memory so each new token only needs to compute its own attention against the cache, not re-run the whole prompt. This is what makes streaming cheap per-token after the first one.). That produces the next set of logits. Sample, append, repeat — until the model emits a stop token or hits max_tokens. Decode is sequential by construction: token N can’t start until token N−1 exists.
Two phases, two costs. Prefill latency is paid once per request and dominates “time to first token.” Decode latency is paid per output token and dominates “tokens per second.” This split is why a 50K-token prompt with a 100-token answer feels slow to start but finishes quickly — and why a 200-token prompt with a 4000-token answer feels snappy at first but takes forever.
- Tokenize prompt (N tokens)
- Embed all N tokens
- Run through every transformer layer in parallel
- Build the KV cache for every token
- Output: logits for token N+1
Cost: ~O(N²) compute, but parallelizable. Latency: drives time-to-first-token.
- Sample one token from current logits
- Append it to the running sequence
- Run just the new token through layers, reading KV cache
- Get logits for the next token
- Stop if token == <end> or budget hit; else go to step 1
Cost: ~O(1) per token (memory-bound). Latency: drives tokens-per-second.
Sampling — What “Logits” Become a Token
The transformer doesn’t pick a token; it outputs logits — a raw score per token in the vocabulary (~200K entries for Claude). To turn logits into a chosen token, three steps run:
- Softmax with temperature. Logits — divide by temperature T — then softmax to a probability distribution. T = 0 collapses to one-hot (always pick the argmax); T = 1 keeps the model’s native distribution; T > 1 flattens it (more surprise).
- Top-k / Top-p filter. Optionally truncate the distribution to either the top-k highest-probability tokens, or the smallest set whose cumulative probability exceeds p (nucleus sampling). This blocks the long tail of nonsense tokens.
- Sample. Draw one token from the (renormalized) filtered distribution. This single token becomes the “next token,” gets appended, and the loop continues.
Important consequence: the model’s output is non-deterministic even at temperature=1 because of the sampling step. Set temperature=0 if you need reproducibility — that picks the argmax with no sampling. The next H2 on Temperature/Top-p/Top-k goes deeper on when to use each.
Latency Anatomy — Where Your Seconds Go
Real numbers, on Claude Sonnet 4.6 (typical 2026 production load, single-region):
| Phase | What dominates | Order of magnitude |
|---|---|---|
| Network | TLS, request routing, region distance | 30–150 ms |
| Prefill (TTFT) | Prompt length; quadratic-ish in N | ~50 ms / 1K tokens (uncached); <5 ms / 1K cached |
| Decode | Output length; linear in tokens generated | ~50–100 tokens/s (Sonnet); higher with speculative decoding |
| Server queue | Concurrent traffic, rate-limit tier | 0–500 ms p99 |
- Time-to-first-token (TTFT) is set almost entirely by prompt length and whether the prefix is cached. Prompt caching (M22) turns “5 seconds of prefill” into “200 ms of prefill” on repeat-heavy prompts.
- Tokens-per-second (TPS) is set by the decode step. Output length is the cost driver here — doubling
max_tokensdoubles decode time, regardless of how long the prompt was. - Streaming (next subsection) doesn’t change total wall-clock latency; it just delivers tokens as they’re produced so users see something happening at TTFT instead of waiting for the full decode.
- Extended thinking (M22) and reasoning models add a hidden third phase — thinking tokens are decoded before any visible response. We’ll connect the dots in M22.
Streaming vs Batch — Same Inference, Different Delivery
One request, two ways to receive the answer:
- Non-streaming — you wait for the full decode, then receive the entire response in one chunk. Simple; the agent code in M05–M07 uses this shape.
- Streaming — the server sends each decoded token (or small group) as Server-Sent Events. Same total time, but the user sees the first tokens immediately. Essential for chat UIs and any agent loop where you want to surface progress before the response is complete.
from anthropic import Anthropic
import time
client = Anthropic()
# Streaming: each token (or small chunk) arrives as it’s decoded.
t0 = time.perf_counter()
first_token_t = None
total_tokens = 0
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[{"role": "user", "content": "Explain inference in one paragraph."}],
) as stream:
for chunk in stream.text_stream:
if first_token_t is None:
first_token_t = time.perf_counter() - t0
print(f"[TTFT: {first_token_t*1000:.0f} ms]")
print(chunk, end="", flush=True)
total_tokens += 1 # text chunks, not perfect token counts — close enough for ops
elapsed = time.perf_counter() - t0
print(f"\n[total: {elapsed:.2f}s, decode rate ~ {total_tokens/max(elapsed-first_token_t, 0.001):.0f} chunks/s]")
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const t0 = performance.now();
let firstTokenMs: number | null = null;
let chunks = 0;
const stream = client.messages.stream({
model: "claude-sonnet-4-6",
max_tokens: 512,
messages: [{ role: "user", content: "Explain inference in one paragraph." }],
});
for await (const event of stream) {
if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
if (firstTokenMs === null) {
firstTokenMs = performance.now() - t0;
console.log(`[TTFT: ${firstTokenMs.toFixed(0)} ms]`);
}
process.stdout.write(event.delta.text);
chunks++;
}
}
const elapsed = performance.now() - t0;
console.log(`\n[total: ${(elapsed / 1000).toFixed(2)}s, decode rate ~ ${(chunks / Math.max((elapsed - (firstTokenMs ?? 0)) / 1000, 0.001)).toFixed(0)} chunks/s]`);
You watched the two-phase model in real time. The TTFT print fires exactly when prefill finishes (and the first decoded token lands); the “decode rate” reports how fast subsequent tokens stream in. Run this against a short prompt and a 10K-token prompt back-to-back — you’ll see TTFT scale roughly with prompt length while decode rate stays roughly constant. That’s the prefill/decode split made visible.
“Inference and training are the same thing.” — They’re not. Training updates billions of weights using gradient descent across millions of examples; inference reads those frozen weights to predict the next token. You only ever do inference when calling the API; training happened at Anthropic before the model shipped.
“Streaming is faster than non-streaming.” — Same total time. Streaming just shows tokens as they decode rather than buffering them. Use streaming for UX (perceived latency); use non-streaming when you need the full response before doing anything (parsing JSON, tool dispatch).
“Big prompts are slow because Claude has to read them all.” — Half-right. Prefill processes the prompt in parallel, but the work scales roughly with N² due to attention. Long prompts hurt TTFT, not decode rate. Prompt caching (M22) collapses that cost on repeat-heavy prompts.
“temperature=0 means deterministic.” — Mostly true, with caveats. At 0 the sampling step becomes argmax (no randomness in the model). But tie-breaking, server-side batching, and floating-point non-determinism on GPUs can still produce different outputs across runs. For strict reproducibility, also pin the model snapshot and seed if the API supports it.
Temperature, Top-p & Top-k
BEFORE: Without sampling controls, a language model would always pick the single highest-probability next word — like a restaurant that only ever serves the most popular dish, regardless of what you're in the mood for. Every sentence would sound the same, mechanical and repetitive.
PAIN: That's terrible for creative tasks (bland writing), but also bad for technical tasks where multiple phrasings are equally correct — the model would get stuck in ruts, always producing identical outputs.
MAPPING: Temperature is a creativity dial. At 0, Claude always picks the safest, most predictable next word — like a cautious writer sticking to cliches. At 1.0, Claude is willing to take risks and surprise you, choosing less-probable but more interesting words. Top-p and top-k are like narrowing the menu of options Claude considers before making a choice — top-p says "only consider words that make up the top 90% of the probability," and top-k says "only consider the top 50 most likely words."
What this actually looks like: Here's the same set of next-token probabilities at three different temperature settings. Watch how the distribution shifts:
Prompt: "The best way to learn programming is"
Temperature 0.0 (greedy — always pick the top word):
" to" → 99.8% ← always chosen
" by" → 0.1%
" through" → 0.1%
Temperature 0.5 (moderate — top words dominate but others have a chance):
" to" → 58%
" by" → 22%
" through" → 12%
" with" → 5%
" from" → 3%
Temperature 1.0 (creative — spread across many options):
" to" → 30%
" by" → 22%
" through" → 15%
" with" → 10%
" from" → 6%
" when" → 4%
... more words now have a real chance
Step 1 — Temperature: The model produces logitsThe raw, unnormalized scores the model assigns to every possible next token before converting them into probabilities. Higher logits = higher probability. — raw prediction scores, think of them as "confidence points" for every possible next word. TemperatureA number (0 to 1) that scales the model's confidence scores before picking the next token. Lower = more deterministic, higher = more random/creative. divides all logits by the temperature value before they're converted to probabilities via softmaxA mathematical function that converts a list of numbers into probabilities (all positive, summing to 1). The bigger the input number, the bigger its share of the probability. (a function that turns numbers into percentages that add up to 100%). A low temperature like 0.1 makes the top word's probability dominate (say, 95%). A high temperature like 1.0 keeps the distribution spread out (maybe 30%, 20%, 15%...).
Step 2 — Top-p: Top-pAlso called nucleus sampling. Instead of considering all possible next tokens, Claude only considers the smallest set whose combined probability exceeds p (e.g., 0.9 = top 90% of probability mass). (also called nucleus sampling) then trims the menu. It sorts words by probability and keeps only the smallest set whose probabilities add up to p. For example, 0.9 means "keep the top 90% of probability mass, discard the rest."
Step 3 — Top-k: Top-kLimits the model to only consider the k most likely next tokens. For example, top-k=50 means Claude only picks from its top 50 predictions, ignoring all others. is a simpler filter — it limits consideration to the k most likely tokens regardless of their probabilities. For example, top-k=50 means only the top 50 words can be chosen.
In practice, temperature is the one you'll adjust most often. Top-p and top-k are fine-tuning knobs for when you need precise control.
Prompt: "The best way to learn programming is"
The "Calculator vs. Thinker" Mental Model
BEFORE: Before LLMs, most software was deterministic — a calculator gives you the same answer every time: 2 + 2 = 4, always. Developers built their entire workflow around this certainty: write code, run tests, expect exact outputs.
PAIN: When teams first adopt LLMs, they instinctively treat the model like a calculator. They write a prompt, get a great answer, ship it — then are shocked when the same prompt gives a subtly different (or wrong) answer the next day. Bug reports pile up, tests fail intermittently, and trust erodes.
MAPPING: The fix is a mental model shift: Claude is a thinker, not a calculator. A thinker gives you their best answer, which can vary, can be wrong, and can surprise you with insight. Treat it like a very knowledgeable colleague who sometimes needs to be double-checked, not a database that returns exact facts. Once you internalize this, you'll naturally build verification steps, add guardrails, and design your agent for graceful handling of imperfect outputs.
What this actually looks like: Here's a real-world example of why the "thinker" model matters. Same prompt, two runs, both at temperature 0.0:
Prompt: "What is the population of Tokyo?"
Run 1: "Tokyo has a population of approximately 13.96 million
people in the city proper as of 2023."
Run 2: "The population of Tokyo is about 14 million in the
city proper, or roughly 37 million in the greater
metropolitan area."
Both are reasonable. Neither is "wrong." But a calculator
would give you the exact same answer every time.
An agent that routes decisions based on this output
needs to handle BOTH variations gracefully.
This mental model is the single most important idea in this course. In plain English: a calculator always gives you 2 + 2 = 4. A thinker gives you their best reasoning, which is usually excellent but occasionally off. When you build an agent, you're building around a thinker — so you design for "usually right" rather than "always right."
How does this work internally? When Claude generates a response, it's making thousands of probabilistic choices (one per token). Each choice has some chance of going a different direction. Even at temperature 0.0, server-side implementation details like floating-point rounding and batch scheduling can cause tiny variations. The result is that outputs are highly consistent but not identical — and when the model is uncertain (borderline cases, ambiguous questions, math), those small variations can compound into meaningfully different answers.
How is this different from traditional software? In conventional programming, if a function returns the wrong result, it's a bug — you fix the code and it works. With LLMs, variation isn't a bug; it's a fundamental property of how the system works. This means your job as an agent builder shifts from "make it correct" to "make it reliably useful despite occasional imperfection." That's a completely different engineering discipline, and it's what the rest of this course teaches.
Here's how this mental model affects every decision you'll make:
- Prompts (M03): You're giving instructions to a thinker, not writing code for a machine
- Tool Use (M05): You give the thinker tools to compensate for what prediction can't do (real-time data, calculations, database lookups)
- Guardrails (M16–M17): You build checks because thinkers can make mistakes
- Evaluation (M18): You measure quality probabilistically, not as pass/fail
"LLMs are basically a smarter search engine / database, right?" — No. A database stores facts and retrieves them exactly. Claude doesn't store or retrieve anything — it generates new text by predicting tokens. When it gives you a correct fact, it's because its training patterns lead to that prediction, not because it "looked it up." This is why it can produce plausible-sounding facts that are completely wrong — there's no lookup step to fail; it just predicts what sounds right.
"If I use temperature 0, the output is deterministic." — Almost, but not quite. Temperature 0 makes Claude pick the highest-probability token each time, which is highly consistent. But in practice, minor server-side differences (floating-point math, batching) can occasionally produce slightly different outputs. Design your agents for "extremely consistent," not "bit-for-bit identical every time."
"Claude understands what I'm saying." — It's more accurate to say Claude is extremely good at pattern matching over language. It processes the statistical relationships between tokens in ways that produce remarkably useful results, but it has no internal model of truth, no beliefs, and no comprehension in the human sense. This matters because it means Claude can confidently produce incorrect information — it doesn't "know" it's wrong.
"More parameters = more accurate." — Bigger models are generally more capable, but "capable" and "accurate" are different things. A larger model can handle more complex reasoning and nuanced prompts, but it can still hallucinate, and it may do so more convincingly. Size doesn't eliminate the need for verification and guardrails.
"If Claude gets something wrong, I should just ask again." — Retrying the same prompt is a lottery, not a strategy. If the model's training patterns lead it toward a wrong answer, it will likely give the same wrong answer again. The fix is to change the approach: rephrase the prompt, provide examples, add context, or use a tool to fetch the correct information. You'll learn all these techniques in Modules 3 through 5.
Code Walkthrough: Your First Claude API Call
Let's make your first call to the Claude API. You'll need an API keyA secret string that authenticates your application with Anthropic's servers. Never share it, never commit it to code. Always store it as an environment variable. from console.anthropic.com.
Setup: API Key as Environment Variable
# Set your API key as an environment variable
export ANTHROPIC_API_KEY="your-key-here"
Making the Call
Let's start with the simplest possible Claude API call. The code below creates an Anthropic client, sends a single message, and prints the response. This three-step pattern — create client, call messages.create, read the content — is the foundation for every API call you'll make in this course. Once you internalize this structure, adding tools, streaming, and multi-turn conversations in later modules will feel like natural extensions.
Here's the one thing that trips up almost everyone on their first try: the response is message.content[0].text, not message.text. Why the extra [0]? Because content is a list of content blocks — Claude can return text, images, and tool calls in the same response. Even for a simple text reply, you need [0] to grab the first block. Forget this and you'll get a confusing list object instead of a string.
# pip install anthropic>=0.30.0
import anthropic
# The client reads ANTHROPIC_API_KEY from your environment
client = anthropic.Anthropic()
try:
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system="You are a helpful assistant who explains things clearly.",
messages=[
{"role": "user", "content": "What is a large language model? Explain in 2 sentences."}
]
)
# The response content is a list of content blocks
print(message.content[0].text)
print(f"\nTokens used: {message.usage.input_tokens} in, {message.usage.output_tokens} out")
except anthropic.AuthenticationError:
print("Invalid API key. Check your ANTHROPIC_API_KEY environment variable.")
except anthropic.APIError as e:
print(f"API error: {e.message}")
// npm install @anthropic-ai/sdk@^0.30.0
import Anthropic from '@anthropic-ai/sdk';
// The client reads ANTHROPIC_API_KEY from your environment
const client = new Anthropic();
try {
const message = await client.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 1024,
system: 'You are a helpful assistant who explains things clearly.',
messages: [
{ role: 'user', content: 'What is a large language model? Explain in 2 sentences.' }
]
});
// The response content is a list of content blocks
console.log(message.content[0].text);
console.log(`\nTokens used: ${message.usage.input_tokens} in, ${message.usage.output_tokens} out`);
} catch (error) {
if (error instanceof Anthropic.APIError) {
console.error(`API error: ${error.status} - ${error.message}`);
} else {
throw error;
}
}
curl https://api.anthropic.com/v1/messages \
-H "content-type: application/json" \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "claude-sonnet-4-6",
"max_tokens": 1024,
"system": "You are a helpful assistant who explains things clearly.",
"messages": [
{"role": "user", "content": "What is a large language model? Explain in 2 sentences."}
]
}'
usage object tells you exactly how many tokens were consumed — this will matter for cost tracking (Module 2) and context window management (Module 4).
Experimenting with Temperature
The interesting part of this next example is that it sends the exact same prompt three times, each with a different temperature value (0.0, 0.5, and 1.0). The results will be noticeably different — that's the whole point. Seeing identical input produce varied output is the fastest way to feel that LLMs are stochastic, not deterministic. This is the kind of experiment worth running yourself, because reading about probability distributions is one thing — watching Claude give you three different answers to the same question makes it click.
import anthropic
client = anthropic.Anthropic()
prompt = "Write a one-sentence description of the moon."
for temp in [0.0, 0.5, 1.0]:
try:
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=100,
temperature=temp,
messages=[{"role": "user", "content": prompt}]
)
print(f"Temperature {temp}: {message.content[0].text}")
except anthropic.APIError as e:
print(f"Error at temperature {temp}: {e.message}")
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
const prompt = 'Write a one-sentence description of the moon.';
for (const temp of [0.0, 0.5, 1.0]) {
try {
const message = await client.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 100,
temperature: temp,
messages: [{ role: 'user', content: prompt }]
});
console.log(`Temperature ${temp}: ${message.content[0].text}`);
} catch (error) {
if (error instanceof Anthropic.APIError) {
console.error(`Error at temp ${temp}: ${error.message}`);
} else {
throw error;
}
}
}
Hands-On Exercise: Hello Claude
What You'll Build
A series of small scripts that call Claude's API, experiment with temperature, and culminate in a working CLI chatbot. By the end you'll have made your first API call, observed how temperature changes output, and built a multi-turn conversation loop.
Time estimate: 20–30 minutes • Prerequisites: Python 3.9+ or Node.js 18+ • An Anthropic API key from console.anthropic.com
Environment Setup
Copy and paste this entire block into your terminal to create a project folder and install the SDK:
mkdir hello-claude && cd hello-claude
python -m venv venv
# macOS/Linux:
source venv/bin/activate
# Windows:
# venv\Scripts\activate
pip install "anthropic>=0.30.0"
export ANTHROPIC_API_KEY="your-key-here" # Windows: set ANTHROPIC_API_KEY=your-key-here
mkdir hello-claude && cd hello-claude
npm init -y
npm install @anthropic-ai/sdk@^0.30.0
export ANTHROPIC_API_KEY="your-key-here" # Windows: set ANTHROPIC_API_KEY=your-key-here
Step 1: Make Your First API Call
This step verifies that your API key works and that you can communicate with Claude. It's the "hello world" of agent development — if this works, everything else in the course will build on it.
Create a new file called hello.py (or hello.mjs for Node.js):
import anthropic
client = anthropic.Anthropic()
try:
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system="You are a helpful assistant who explains things clearly.",
messages=[
{"role": "user", "content": "What is a large language model? Explain in 2 sentences."}
]
)
print(message.content[0].text)
print(f"\nTokens used: {message.usage.input_tokens} in, {message.usage.output_tokens} out")
except anthropic.AuthenticationError:
print("Invalid API key. Check your ANTHROPIC_API_KEY environment variable.")
except anthropic.APIError as e:
print(f"API error: {e.message}")
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
try {
const message = await client.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 1024,
system: 'You are a helpful assistant who explains things clearly.',
messages: [
{ role: 'user', content: 'What is a large language model? Explain in 2 sentences.' }
]
});
console.log(message.content[0].text);
console.log(`\nTokens used: ${message.usage.input_tokens} in, ${message.usage.output_tokens} out`);
} catch (error) {
if (error instanceof Anthropic.APIError) {
console.error(`API error: ${error.status} - ${error.message}`);
} else {
throw error;
}
}
Run it: python hello.py (or node hello.mjs)
Troubleshooting
AuthenticationError— Your API key is missing or invalid. Runecho $ANTHROPIC_API_KEY(orecho %ANTHROPIC_API_KEY%on Windows) to check it's set.ModuleNotFoundError: No module named 'anthropic'— You're not in the virtual environment. Runsource venv/bin/activatefirst, thenpip install anthropic.Connection error/ timeout — Check your internet connection. The API endpoint isapi.anthropic.com— make sure it's not blocked by a firewall or VPN.
Step 2: Experiment with System Prompts
System prompts shape Claude's personality and behavior. This step shows you how much control a single string gives you over the output. You'll use the same user message but swap the system prompt to see wildly different responses.
Create a new file called system_prompts.py (or system_prompts.mjs):
import anthropic
client = anthropic.Anthropic()
prompts = [
"You are a pirate. Respond in pirate speak.",
"You are a formal academic. Use precise, scholarly language.",
"Respond only in haiku format (5-7-5 syllables).",
]
for system_prompt in prompts:
try:
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=200,
system=system_prompt,
messages=[{"role": "user", "content": "What is the moon?"}]
)
print(f"System: {system_prompt}")
print(f"Response: {message.content[0].text}\n")
except anthropic.APIError as e:
print(f"Error: {e.message}")
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
const prompts = [
'You are a pirate. Respond in pirate speak.',
'You are a formal academic. Use precise, scholarly language.',
'Respond only in haiku format (5-7-5 syllables).',
];
for (const systemPrompt of prompts) {
try {
const message = await client.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 200,
system: systemPrompt,
messages: [{ role: 'user', content: 'What is the moon?' }]
});
console.log(`System: ${systemPrompt}`);
console.log(`Response: ${message.content[0].text}\n`);
} catch (error) {
console.error(`Error: ${error.message}`);
}
}
Run it: python system_prompts.py (or node system_prompts.mjs)
Step 3: Temperature Experiment
This step makes the "thinker, not calculator" concept visceral. You'll run the exact same prompt multiple times at different temperatures and compare how consistent the outputs are. This is the experiment that makes non-determinism click.
Create a new file called temperature.py (or temperature.mjs):
import anthropic
client = anthropic.Anthropic()
prompt = "Write a one-sentence description of the moon."
for temp in [0.0, 1.0]:
print(f"\n--- Temperature {temp} ---")
for i in range(3):
try:
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=100,
temperature=temp,
messages=[{"role": "user", "content": prompt}]
)
print(f" Run {i+1}: {message.content[0].text}")
except anthropic.APIError as e:
print(f" Error: {e.message}")
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
const prompt = 'Write a one-sentence description of the moon.';
for (const temp of [0.0, 1.0]) {
console.log(`\n--- Temperature ${temp} ---`);
for (let i = 0; i < 3; i++) {
try {
const message = await client.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 100,
temperature: temp,
messages: [{ role: 'user', content: prompt }]
});
console.log(` Run ${i + 1}: ${message.content[0].text}`);
} catch (error) {
console.error(` Error: ${error.message}`);
}
}
}
Run it: python temperature.py (or node temperature.mjs)
Step 4: Observe Token Usage
Token counts drive API costs and context limits — two concepts you'll use throughout the entire course. This step trains you to notice how different prompts and settings affect token consumption, so it becomes instinctive. This uses the message.usage object from Step 1.
Create a new file called token_usage.py (or token_usage.mjs):
import anthropic
client = anthropic.Anthropic()
tests = [
("Short prompt", "Hi!", 50),
("Medium prompt", "Explain what a large language model is in detail.", 200),
("Long prompt with constraint", "Write a 3-paragraph essay about the history of computing.", 1024),
]
for label, prompt, max_tok in tests:
try:
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=max_tok,
messages=[{"role": "user", "content": prompt}]
)
u = message.usage
print(f"{label}:")
print(f" Input tokens: {u.input_tokens}")
print(f" Output tokens: {u.output_tokens}")
print(f" Total tokens: {u.input_tokens + u.output_tokens}\n")
except anthropic.APIError as e:
print(f"Error: {e.message}")
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
const tests = [
['Short prompt', 'Hi!', 50],
['Medium prompt', 'Explain what a large language model is in detail.', 200],
['Long prompt with constraint', 'Write a 3-paragraph essay about the history of computing.', 1024],
];
for (const [label, prompt, maxTok] of tests) {
try {
const message = await client.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: maxTok,
messages: [{ role: 'user', content: prompt }]
});
const u = message.usage;
console.log(`${label}:`);
console.log(` Input tokens: ${u.input_tokens}`);
console.log(` Output tokens: ${u.output_tokens}`);
console.log(` Total tokens: ${u.input_tokens + u.output_tokens}\n`);
} catch (error) {
console.error(`Error: ${error.message}`);
}
}
Run it: python token_usage.py (or node token_usage.mjs)
max_tokens parameter is a ceiling, not a target — Claude often uses fewer. You'll explore tokens deeply in M02.
Step 5 (Stretch Goal): Build a CLI Chat
conversation list — you accumulate messages so Claude can see the full history, just like how each output token depends on all previous tokens.
Now let's build something you can actually interact with: a terminal chatbot that remembers your conversation. This matters because multi-turn conversation is the foundation of every agent — agents don't just answer one question, they maintain context across a sequence of actions. The pattern you'll see here (accumulate messages in a list, send the full list on each call, append the response) is the exact same loop that powers production agent frameworks.
Here's the dilemma to watch for: what happens when an API call fails mid-conversation? Notice the conversation.pop() in the error handler. If you skip that, you'd have a user message sitting in the list with no assistant response after it. The next API call would fail because Claude's API requires strict alternating user/assistant messages. That one line of cleanup prevents a cascade of confusing errors.
import anthropic
client = anthropic.Anthropic()
conversation = []
print("Chat with Claude! (type 'quit' to exit)\n")
while True:
user_input = input("You: ").strip()
if user_input.lower() in ("quit", "exit"):
break
if not user_input:
continue
conversation.append({"role": "user", "content": user_input})
try:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system="You are a friendly, helpful assistant.",
messages=conversation
)
assistant_msg = response.content[0].text
conversation.append({"role": "assistant", "content": assistant_msg})
print(f"\nClaude: {assistant_msg}\n")
except anthropic.APIError as e:
print(f"\nError: {e.message}\n")
# Remove the failed user message so conversation stays consistent
conversation.pop()
import Anthropic from '@anthropic-ai/sdk';
import * as readline from 'readline';
const client = new Anthropic();
const conversation = [];
const rl = readline.createInterface({
input: process.stdin,
output: process.stdout
});
console.log("Chat with Claude! (type 'quit' to exit)\n");
function ask() {
rl.question('You: ', async (userInput) => {
userInput = userInput.trim();
if (['quit', 'exit'].includes(userInput.toLowerCase())) {
rl.close();
return;
}
if (!userInput) { ask(); return; }
conversation.push({ role: 'user', content: userInput });
try {
const response = await client.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 512,
system: 'You are a friendly, helpful assistant.',
messages: conversation
});
const assistantMsg = response.content[0].text;
conversation.push({ role: 'assistant', content: assistantMsg });
console.log(`\nClaude: ${assistantMsg}\n`);
} catch (error) {
console.error(`\nError: ${error.message}\n`);
conversation.pop();
}
ask();
});
}
ask();
conversation array, sent to Claude along with the full history, and the response is appended back. Claude sees every previous exchange on each call — that's how it "remembers" the conversation. This pattern (accumulate messages → send all → append response) is the exact loop that powers agent frameworks you'll see in Modules 7–9.
quit to exit.
Troubleshooting
TypeError: Cannot read properties of undefined(Node.js) — Make sure your file has the.mjsextension or yourpackage.jsoncontains"type": "module"for ES module imports.- Claude doesn't remember previous messages — Check that you're appending both the user message and the assistant response to the
conversationlist. Both must be present. - Rate limit errors after many messages — Add a short delay between calls or reduce your conversation length. Long conversations also consume more tokens per call.
Verify Everything Works
Run all four scripts in sequence to confirm your setup is complete:
# Python
python hello.py && python system_prompts.py && python temperature.py && python token_usage.py
# Node.js
node hello.mjs && node system_prompts.mjs && node temperature.mjs && node token_usage.mjs
Knowledge Check
Test your understanding of the concepts from this module. Select the best answer for each question.
Q1: What does a Large Language Model fundamentally do?
Q2: What does the temperature parameter control?
Q3: Why is Claude described as a "thinker" rather than a "calculator"?
Q4: How does Claude process input differently from how it generates output?
Q5: Fill in the blank to complete a valid Claude API call:
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello!"}]
)
print(message.______[0].text)
response
text
content
choices
content array. Each element is a content block with a text property. So it's message.content[0].text.Q6: What is the recommended temperature setting for an agent that calls tools and makes decisions?
Module Summary
Key Takeaways
- LLMs are next-token predictors — they generate text by predicting the most likely continuation of a sequence.
- Input = parallel, Output = sequential — Claude reads everything at once but writes one token at a time.
- Temperature controls randomness — low for agents (reliability), high for creative tasks (variety).
- Thinker, not calculator — outputs are probabilistic and need verification. This mental model guides every agent design decision.
- The API is simple — create a client, send messages, get a response with content blocks and usage stats.
Next Module Preview: M02 — Tokens
Now that you know Claude predicts tokens, the natural question is: what exactly is a token? In Module 2, you'll build an interactive tokenizer, understand how tokens affect cost and context limits, and create a token budget calculator — a tool you'll use throughout the entire course.