CC1: API Essentials for CLI Power Users
Everything Claude Code hides from you — models, the Messages API, system prompts, temperature, streaming, structured output — and why writing skills, subagents, and hooks gets a lot easier once you've seen these primitives in plain Python and TypeScript.
Learning Objectives
- Pick the right Claude model (Opus 4.7 / Sonnet 4.6 / Haiku 4.5) for a given job.
- Make a Messages API request from Python and TypeScript with proper error handling.
- Hold a multi-turn conversation by structuring the
messagesarray correctly. - Distinguish system prompts from user messages, and recognize how
CLAUDE.mdbecomes one. - Tune
temperaturefor deterministic vs creative output. - Stream responses for low-latency UX (and know when streaming hurts you).
- Coerce JSON output reliably — via prompting, prefilling, and tool use.
Why CLI Users Need the API
Driving a car without ever opening the hood works fine until something goes wrong. The day the engine knocks, you don't need to be a mechanic, but knowing what an engine, a spark plug, and an oil pump are turns a panic call into a 30-second diagnosis: "smells like a misfire."
The pain without that mental model: you treat every symptom as random and only the dealership can help. Every conversation starts at zero context.
Claude Code is the same. You can ship for months using only the CLI. But the moment you write a skill that hangs, a subagent that won't follow instructions, or a hook whose JSON looks wrong — you need to know what a model, a system prompt, and a messages array are. This module is the engine bay.
Three places where API knowledge becomes load-bearing for CLI work:
- Skills ship a system-prompt fragment that gets injected when the skill is invoked. Knowing what a system prompt does — vs a user message — is the difference between a skill that steers Claude and one that's politely ignored.
- Subagents run with their own model selection and their own system prompt. Picking Haiku for a fast linter subagent vs Opus for a deep reviewer is a cost decision that compounds across hundreds of invocations.
- Hooks sometimes need to call Claude themselves (e.g. an HTTP hook that summarizes a long stack trace before deciding to block). That's a raw Messages API call — no CLI in sight.
Every Claude Code feature you'll use in CC3–CC14 is a thin wrapper over the Messages API. If you understand one API call deeply, the rest of the course feels like configuration, not magic.
The Claude Model Lineup
Anthropic's current generation is Claude 4.x, which comes in three sizes. Picking the right one for a given task is the single highest-leverage cost decision in any Claude-powered system.
| Model | ID | When to pick it | Relative speed |
|---|---|---|---|
| Opus 4.7 | claude-opus-4-7 | Hard reasoning, multi-file refactors, deep code review, planning agents. | Slowest |
| Sonnet 4.6 | claude-sonnet-4-6 | Default. Production code-gen, RAG, tool use. Great quality / speed balance. | Medium |
| Haiku 4.5 | claude-haiku-4-5-20251001 | Classification, routing, lightweight tool calls, latency-sensitive subagents. | Fastest |
A model ID is the string you pass in the model field of a Messages API request. Anthropic ships dated and undated aliases. claude-sonnet-4-6 is an undated alias that always points to the latest Sonnet 4.6 snapshot; claude-haiku-4-5-20251001 is a pinned version that won't move under you. Pin in production; use undated for prototyping.
The 3-question decision tree
- Is this a deep-reasoning task (architecture, multi-file refactor, hard debugging)? → Opus 4.7
- Is this latency-critical or volume-heavy (a hook running on every save, a router subagent)? → Haiku 4.5
- Otherwise → Sonnet 4.6 (this is 80% of the time)
Opus is roughly 5× the price of Sonnet, which is roughly 3× the price of Haiku. A subagent that runs 1000 times a day on Opus costs ~15× what it would on Haiku — if Haiku can do the job, that's a meaningful savings without reading ten cost-optimization blog posts.
You learned that picking the model is a cost-vs-quality dial. For Claude Code, you'll mostly leave the CLI on its default (Opus 4.7 or Sonnet 4.6 depending on settings), but you'll explicitly set the model in subagent frontmatter (CC6) and skill metadata (CC5).
Your First Messages API Call
Before going further, get an API key from console.anthropic.com and put it in your environment:
export ANTHROPIC_API_KEY="sk-ant-..." # never commit this
Now the smallest possible request — a single user turn, asking for the fix to a typo:
import os
from anthropic import Anthropic, APIError
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
try:
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[
{"role": "user", "content": "Fix the typo: 'recieve' → ?"}
],
)
print(resp.content[0].text)
except APIError as e:
print(f"API error {e.status_code}: {e.message}")
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic(); // reads ANTHROPIC_API_KEY from env
try {
const resp = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 512,
messages: [
{ role: "user", content: "Fix the typo: 'recieve' → ?" }
],
});
// content is a typed union; for plain text take the first block
const block = resp.content[0];
if (block.type === "text") console.log(block.text);
} catch (e: any) {
console.error(`API error ${e.status}: ${e.message}`);
}
Anatomy of the request — four required fields
model— which Claude. Always required.max_tokens— cap on the response. Required — there's no implicit default. Set this to your worst-case need; you only pay for actual output tokens.messages— the conversation, an array of{role, content}objects.roleis"user"or"assistant". The first message must be"user".system(optional but in 99% of real apps) — instructions about Claude's role. Top level, not part ofmessages. We'll see this below.
Anatomy of the response
Claude returns a Message object. The interesting fields:
{
"id": "msg_01ABC...",
"model": "claude-sonnet-4-6",
"role": "assistant",
"content": [
{ "type": "text", "text": "receive" }
],
"stop_reason": "end_turn",
"usage": { "input_tokens": 18, "output_tokens": 3 }
}
contentis always an array of blocks, not a string. Most of the time it has onetextblock; with tool use you'll seetool_useblocks too (CC8).stop_reasontells you why Claude stopped:"end_turn"(finished naturally),"max_tokens"(you ran out of budget — the response is truncated),"tool_use"(Claude wants to call a tool),"stop_sequence"(matched one of your stop sequences).usageis your billing — cache it if you're cost-sensitive.
Treating resp.content as a string. It's an array. Use resp.content[0].text in Python, or the typed union pattern in TypeScript. You'll get a TypeError the first time and it'll feel like the SDK is broken — it's not.
Multi-Turn Conversations
Imagine writing a novel where every chapter is filed in a separate folder, and the writer has amnesia. To write Chapter 4, you hand them Chapters 1, 2, and 3 plus the new prompt. They read everything, write Chapter 4, hand it back, and forget the whole conversation.
The pain without this: every "remember when I said..." gets a blank stare. Continuity has to be reconstructed from outside.
The Messages API is exactly this. Claude is stateless — no server-side memory. A "multi-turn conversation" is you sending the entire prior conversation back on every request. The messages array is the chapter folder.
To continue a conversation, you append both the previous assistant response and the new user message, then re-send the whole array:
history = [
{"role": "user", "content": "What's the capital of Australia?"},
]
# Turn 1
r1 = client.messages.create(model="claude-sonnet-4-6", max_tokens=128, messages=history)
assistant_text = r1.content[0].text
history.append({"role": "assistant", "content": assistant_text})
# Turn 2 — user replies, append, re-send everything
history.append({"role": "user", "content": "And its population?"})
r2 = client.messages.create(model="claude-sonnet-4-6", max_tokens=128, messages=history)
print(r2.content[0].text) # Claude knows we're still talking about Canberra
type Msg = { role: "user" | "assistant"; content: string };
const history: Msg[] = [
{ role: "user", content: "What's the capital of Australia?" },
];
const r1 = await client.messages.create({
model: "claude-sonnet-4-6", max_tokens: 128, messages: history,
});
const reply1 = r1.content[0].type === "text" ? r1.content[0].text : "";
history.push({ role: "assistant", content: reply1 });
history.push({ role: "user", content: "And its population?" });
const r2 = await client.messages.create({
model: "claude-sonnet-4-6", max_tokens: 128, messages: history,
});
console.log(r2.content[0].type === "text" ? r2.content[0].text : "");
Two rules that bite people
- Roles must alternate. Two consecutive
usermessages = 400 error. If you need to inject extra context, merge it into one user message. - The whole history is re-billed every turn. A 50-turn conversation re-sends 50 turns of input every time. This is where prompt caching (CC10) saves you money — mark the stable prefix as cacheable.
Inside a Claude Code session, the CLI maintains the messages array for you. /compact trims it when it gets too long. Your CLAUDE.md is prepended every turn as part of the system prompt — that's why it's billed every turn (and why CC1's "keep it under 200 lines" rule matters).
System Prompts — Where CLAUDE.md Goes
The system field is a top-level instruction about Claude's role, persona, or constraints. It's not part of messages. Claude treats it with higher priority than user content — useful for "always" rules.
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system=(
"You are a senior Python reviewer. Be terse. "
"Always reply in this format:\n"
"VERDICT: \n"
"REASON: "
),
messages=[
{"role": "user", "content": "def add(a,b): return a+b"}
],
)
print(resp.content[0].text)
# VERDICT: ship
# REASON: Trivially correct, no validation needed for two-arg sum.
const resp = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 512,
system:
"You are a senior Python reviewer. Be terse. " +
"Always reply in this format:\nVERDICT: \nREASON: ",
messages: [
{ role: "user", content: "def add(a,b): return a+b" },
],
});
How Claude Code uses system prompts under the hood
Every Claude Code session builds a system prompt from multiple sources, concatenated:
| Source | Where it comes from | Module |
|---|---|---|
| Tool descriptions | Built-in CLI tools (Bash, Edit, Read, etc.) | built-in |
| CLAUDE.md cascade | 5 tiers (managed → user) | CC3 |
| Active skill descriptions | Skills currently loaded for this session | CC5 |
| Subagent system prompt | When running inside a subagent invocation | CC6 |
| Memory index | Auto-memory entries from prior sessions | CC3 |
You can see the assembled system prompt with claude --debug — useful when Claude isn't following your CLAUDE.md and you suspect it's not loading.
System prompts are billed and (with caching) shared across requests. Putting per-request data there — the user's name, today's date — busts the cache. Put dynamic data in user messages; reserve system for stable role/style/policy.
Temperature — Determinism vs Creativity
A short-order cook on a Tuesday lunch knows the burger order: same patty, same bun, same pickles. Boring? Yes. Predictable? Yes. That's temperature 0 — pick the most likely next thing every time.
The same cook on Friday night, asked to invent a special, weighs ten ideas and picks one of the more interesting ones. Variety, but more risk of an experiment that flops. That's temperature 1.
Same dial, same kitchen. The only question is: do you want a reliable burger or a fun special?
Temperature is a number from 0.0 to 1.0 that controls how Claude samples its next token. At 0, Claude almost always picks the highest-probability token (very deterministic, but not bit-for-bit identical — there's still some non-determinism from inference internals). At 1.0, it samples more broadly, producing more varied output.
| Use case | Temperature | Why |
|---|---|---|
| Code generation, refactors | 0.0–0.2 | You want the same answer every run; reproducibility helps debugging. |
| Classification, extraction, evals (CC11) | 0.0 | Stability across runs is the whole point. |
| Writing, brainstorming | 0.7–1.0 | You want variety; reading the same draft twice is uninteresting. |
| Tool-using agents | 0.0–0.3 | Tool selection should be predictable; randomness here is a bug. |
Claude Code defaults to a low temperature suitable for code work. Subagents you write should typically also use low temperature unless their job is creative (e.g. a "draft release notes" subagent might use 0.7). The frontmatter exposes this in CC6.
Response Streaming
By default, the SDK waits for the full response before returning. For long answers (or long tool sequences), that's a multi-second blank screen. Streaming sends each token as it's generated, via Server-Sent Events.
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "Explain async/await in 2 sentences."}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True) # tokens arrive in real time
final = stream.get_final_message()
print(f"\n[done in {final.usage.output_tokens} tokens]")
const stream = await client.messages.stream({
model: "claude-sonnet-4-6",
max_tokens: 1024,
messages: [{ role: "user", content: "Explain async/await in 2 sentences." }],
});
for await (const chunk of stream) {
if (chunk.type === "content_block_delta" &&
chunk.delta.type === "text_delta") {
process.stdout.write(chunk.delta.text);
}
}
const final = await stream.finalMessage();
console.log(`\n[done in ${final.usage.output_tokens} tokens]`);
When streaming hurts
- Batch jobs. If nobody's watching, you don't need first-token latency — non-streaming has slightly less protocol overhead.
- Strict JSON parsers. You can't json.parse a half-streamed object. Either buffer to end, or use tool-use (which streams structured chunks you can assemble safely).
- Hooks that decide quickly. A hook that needs to say "block" or "allow" wants the verdict, not a token-by-token monologue.
The CLI streams every assistant turn — that's why you see text appear character-by-character. Subagents stream too; the parent agent reads the stream and integrates it. Long-running subagents look "frozen" without streaming, even when working correctly — if you build one, default to streaming output.
Structured Data — Getting Reliable JSON
Free-form text is great for humans, terrible for code that needs to parse it. Three ways to get structured data out of Claude, in increasing order of reliability:
Option 1: Ask politely
Lowest effort, lowest reliability. Works ~90% of the time on simple shapes:
resp = client.messages.create(
model="claude-sonnet-4-6", max_tokens=256,
messages=[{"role": "user", "content":
'Return JSON only: {"sentiment": "positive|negative|neutral", "score": 0..1}. '
'Text: "I love the new keyboard."'
}],
)
import json
data = json.loads(resp.content[0].text) # parse, hope
Failure mode: Claude wraps the JSON in ```json ... ``` fences or adds a leading "Here's the JSON:". You have to strip them.
Option 2: Prefill the assistant turn
You start the assistant's reply for it. Claude continues from where you left off, so it can't add a preamble:
resp = client.messages.create(
model="claude-sonnet-4-6", max_tokens=256,
messages=[
{"role": "user", "content": 'Sentiment of "I love the new keyboard." Return JSON.'},
{"role": "assistant", "content": "{"}, # forces JSON-first
],
stop_sequences=["}"], # stop right after the closing brace
)
json_text = "{" + resp.content[0].text + "}"
data = json.loads(json_text)
Option 3: Tool use (most reliable)
Define a tool whose input_schema is exactly the JSON shape you want. Claude will fill in the schema. Covered in detail in CC8 — the short version:
tools = [{
"name": "record_sentiment",
"description": "Record sentiment analysis of text.",
"input_schema": {
"type": "object",
"properties": {
"sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
"score": {"type": "number", "minimum": 0, "maximum": 1},
},
"required": ["sentiment", "score"],
},
}]
resp = client.messages.create(
model="claude-sonnet-4-6", max_tokens=256,
tools=tools,
tool_choice={"type": "tool", "name": "record_sentiment"}, # force this tool
messages=[{"role": "user", "content": "I love the new keyboard."}],
)
# resp.content has a tool_use block with .input matching the schema
tool_use = next(b for b in resp.content if b.type == "tool_use")
data = tool_use.input # already a dict, schema-validated
You saw three reliability tiers for structured output. For production, default to tool use. It's the only one that's schema-validated end-to-end. Prefilling is a useful fallback when the SDK doesn't support tools (rare). Free-form parsing is fine for prototypes and demos.
Hands-On Lab — Build a "ask my CLAUDE.md" CLI
You'll write a 30-line script that loads your project's CLAUDE.md, treats it as a system prompt, and lets you ask questions about your own conventions. This is essentially a 1% version of Claude Code — a useful sanity check that the API really is what's underneath.
Step 1 — Project setup
$ mkdir ask-claude-md && cd ask-claude-md
$ python -m venv .venv && source .venv/bin/activate
$ pip install anthropic
$ export ANTHROPIC_API_KEY="sk-ant-..."
# Use any CLAUDE.md you have. If you don't have one, make a starter:
$ cat > CLAUDE.md <<'EOF'
# my-saas-app
- Money is always cents (integers), never floats.
- Use Zod for API boundaries. Errors throw HttpError(status, msg).
- We use pnpm, not npm. Tests must pass before commit.
EOF
Step 2 — The script (ask.py)
import sys, pathlib
from anthropic import Anthropic, APIError
claude_md = pathlib.Path("CLAUDE.md").read_text()
question = " ".join(sys.argv[1:]) or "What conventions does this project follow?"
client = Anthropic()
system = (
"You are answering questions about a software project. "
"The project's CLAUDE.md file is below. Answer ONLY based on it. "
"If the answer isn't there, say so explicitly.\n\n"
f"<CLAUDE_MD>\n{claude_md}\n</CLAUDE_MD>"
)
try:
with client.messages.stream(
model="claude-sonnet-4-6", max_tokens=512, temperature=0,
system=system,
messages=[{"role": "user", "content": question}],
) as stream:
for chunk in stream.text_stream:
print(chunk, end="", flush=True)
print()
except APIError as e:
print(f"\n[error {e.status_code}] {e.message}", file=sys.stderr)
sys.exit(1)
Step 3 — Try it
$ python ask.py "How should I represent money?"
Always as cents (integers). Never floats. (Per CLAUDE.md.)
$ python ask.py "What's our preferred CSS framework?"
The CLAUDE.md provided does not mention a CSS framework, so I can't answer.
Step 4 — What you just built
You wrote, in 25 lines, the conceptual core of Claude Code: load the developer's instructions into the system prompt, then forward the user's question. The real Claude Code adds tools (CC8), permissions (CC4), session management, and a thousand other niceties — but the foundation is exactly this.
Step 5 — Push it further (optional)
- Add multi-turn: keep a list of past messages and re-send on each invocation.
- Switch to
tool_choiceto force a JSON answer with{found: bool, citation: str, answer: str}. - Try Haiku (
claude-haiku-4-5-20251001) and time the difference; CLAUDE.md questions are easy work that benefits from the cheaper model.
A working ask.py that streams answers about your CLAUDE.md, with proper error handling and a low temperature. You've now written, by hand, the smallest possible version of what Claude Code does — and you can extend it. CC2 builds on top: better prompts that get better answers from this exact pattern.
Knowledge Check
1. Your hook needs to classify a stack trace as "transient" vs "real bug" 50,000 times a day. Which model?
2. You send messages=[{"role":"user", ...}, {"role":"user", ...}]. What happens?
3. You're calling Claude in a CI script that diffs two JSON files. What temperature?
4. You need Claude to return {verdict: "approve|deny", reason: string} from a hook. Most reliable approach?
{ and stop on }.input_schema matches and use tool_choice to force it.dict matching the schema, no string parsing.tool_choice is the only option that guarantees shape.5. Your skill's behavior changed across a model upgrade. What's the most likely cause and the right fix?
claude-sonnet-4-6). Pin to a dated snapshot in production.Module Summary
- Three Claude 4.x models: Opus (deep reasoning), Sonnet (default), Haiku (fast & cheap). Pin dated IDs in prod.
- Messages API request:
model,max_tokens,messages, optionalsystem. Responsecontentis an array of blocks, not a string. - Multi-turn: roles alternate user/assistant; you re-send the whole history every turn (until you cache — CC10).
- System prompt = stable role/policy. CLAUDE.md, skill descriptions, and subagent prompts all become parts of it.
- Temperature 0 for code, classification, evals, tool use. 0.7–1.0 for writing/brainstorming.
- Streaming via SSE for human-facing UX; non-streaming for batch jobs and quick-verdict hooks.
- Structured data: ask < prefill < tool use with forced
tool_choice(the only schema-validated path).