CC1: API Essentials for CLI Power Users

Learning Objectives

Pick the right Claude model (Opus 4.7 / Sonnet 4.6 / Haiku 4.5) for a given job.
Make a Messages API request from Python and TypeScript with proper error handling.
Hold a multi-turn conversation by structuring the messages array correctly.
Distinguish system prompts from user messages, and recognize how CLAUDE.md becomes one.
Tune temperature for deterministic vs creative output.
Stream responses for low-latency UX (and know when streaming hurts you).
Coerce JSON output reliably — via prompting, prefilling, and tool use.

Claude Code is a CLI on top of the Messages API. Most "Claude Code did X weird thing" mysteries dissolve once you can reason about the underlying API call.

Why CLI Users Need the API

Everyday Analogy

Driving a car without ever opening the hood works fine until something goes wrong. The day the engine knocks, you don't need to be a mechanic, but knowing what an engine, a spark plug, and an oil pump are turns a panic call into a 30-second diagnosis: "smells like a misfire."

The pain without that mental model: you treat every symptom as random and only the dealership can help. Every conversation starts at zero context.

Claude Code is the same. You can ship for months using only the CLI. But the moment you write a skill that hangs, a subagent that won't follow instructions, or a hook whose JSON looks wrong — you need to know what a model, a system prompt, and a messages array are. This module is the engine bay.

Three places where API knowledge becomes load-bearing for CLI work:

Skills ship a system-prompt fragment that gets injected when the skill is invoked. Knowing what a system prompt does — vs a user message — is the difference between a skill that steers Claude and one that's politely ignored.
Subagents run with their own model selection and their own system prompt. Picking Haiku for a fast linter subagent vs Opus for a deep reviewer is a cost decision that compounds across hundreds of invocations.
Hooks sometimes need to call Claude themselves (e.g. an HTTP hook that summarizes a long stack trace before deciding to block). That's a raw Messages API call — no CLI in sight.

Why it matters

Every Claude Code feature you'll use in CC3–CC14 is a thin wrapper over the Messages API. If you understand one API call deeply, the rest of the course feels like configuration, not magic.

The Claude Model Lineup

Anthropic's current generation is Claude 4.x, which comes in three sizes. Picking the right one for a given task is the single highest-leverage cost decision in any Claude-powered system.

Claude 4.x — Three Sizes, Three Use Cases

Model	ID	When to pick it	Relative speed
Opus 4.7	`claude-opus-4-7`	Hard reasoning, multi-file refactors, deep code review, planning agents.	Slowest
Sonnet 4.6	`claude-sonnet-4-6`	Default. Production code-gen, RAG, tool use. Great quality / speed balance.	Medium
Haiku 4.5	`claude-haiku-4-5-20251001`	Classification, routing, lightweight tool calls, latency-sensitive subagents.	Fastest

Technical Definition

A model ID is the string you pass in the model field of a Messages API request. Anthropic ships dated and undated aliases. claude-sonnet-4-6 is an undated alias that always points to the latest Sonnet 4.6 snapshot; claude-haiku-4-5-20251001 is a pinned version that won't move under you. Pin in production; use undated for prototyping.

The 3-question decision tree

Is this a deep-reasoning task (architecture, multi-file refactor, hard debugging)? → Opus 4.7
Is this latency-critical or volume-heavy (a hook running on every save, a router subagent)? → Haiku 4.5
Otherwise → Sonnet 4.6 (this is 80% of the time)

Cost intuition

Opus is roughly 5× the price of Sonnet, which is roughly 3× the price of Haiku. A subagent that runs 1000 times a day on Opus costs ~15× what it would on Haiku — if Haiku can do the job, that's a meaningful savings without reading ten cost-optimization blog posts.

What just happened?

You learned that picking the model is a cost-vs-quality dial. For Claude Code, you'll mostly leave the CLI on its default (Opus 4.7 or Sonnet 4.6 depending on settings), but you'll explicitly set the model in subagent frontmatter (CC6) and skill metadata (CC5).

Your First Messages API Call

Before going further, get an API key from console.anthropic.com and put it in your environment:

export ANTHROPIC_API_KEY="sk-ant-..."   # never commit this

Now the smallest possible request — a single user turn, asking for the fix to a typo:

import os
from anthropic import Anthropic, APIError

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

try:
    resp = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        messages=[
            {"role": "user", "content": "Fix the typo: 'recieve' → ?"}
        ],
    )
    print(resp.content[0].text)
except APIError as e:
    print(f"API error {e.status_code}: {e.message}")

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic(); // reads ANTHROPIC_API_KEY from env

try {
  const resp = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 512,
    messages: [
      { role: "user", content: "Fix the typo: 'recieve' → ?" }
    ],
  });
  // content is a typed union; for plain text take the first block
  const block = resp.content[0];
  if (block.type === "text") console.log(block.text);
} catch (e: any) {
  console.error(`API error ${e.status}: ${e.message}`);
}

Anatomy of the request — four required fields

model — which Claude. Always required.
max_tokens — cap on the response. Required — there's no implicit default. Set this to your worst-case need; you only pay for actual output tokens.
messages — the conversation, an array of {role, content} objects. role is "user" or "assistant". The first message must be "user".
system (optional but in 99% of real apps) — instructions about Claude's role. Top level, not part of messages. We'll see this below.

Anatomy of the response

Claude returns a Message object. The interesting fields:

{
  "id": "msg_01ABC...",
  "model": "claude-sonnet-4-6",
  "role": "assistant",
  "content": [
    { "type": "text", "text": "receive" }
  ],
  "stop_reason": "end_turn",
  "usage": { "input_tokens": 18, "output_tokens": 3 }
}

content is always an array of blocks, not a string. Most of the time it has one text block; with tool use you'll see tool_use blocks too (CC8).
stop_reason tells you why Claude stopped: "end_turn" (finished naturally), "max_tokens" (you ran out of budget — the response is truncated), "tool_use" (Claude wants to call a tool), "stop_sequence" (matched one of your stop sequences).
usage is your billing — cache it if you're cost-sensitive.

The most common rookie bug

Treating resp.content as a string. It's an array. Use resp.content[0].text in Python, or the typed union pattern in TypeScript. You'll get a TypeError the first time and it'll feel like the SDK is broken — it's not.

Multi-Turn Conversations

Everyday Analogy

Imagine writing a novel where every chapter is filed in a separate folder, and the writer has amnesia. To write Chapter 4, you hand them Chapters 1, 2, and 3 plus the new prompt. They read everything, write Chapter 4, hand it back, and forget the whole conversation.

The pain without this: every "remember when I said..." gets a blank stare. Continuity has to be reconstructed from outside.

The Messages API is exactly this. Claude is stateless — no server-side memory. A "multi-turn conversation" is you sending the entire prior conversation back on every request. The messages array is the chapter folder.

To continue a conversation, you append both the previous assistant response and the new user message, then re-send the whole array:

history = [
    {"role": "user", "content": "What's the capital of Australia?"},
]

# Turn 1
r1 = client.messages.create(model="claude-sonnet-4-6", max_tokens=128, messages=history)
assistant_text = r1.content[0].text
history.append({"role": "assistant", "content": assistant_text})

# Turn 2 — user replies, append, re-send everything
history.append({"role": "user", "content": "And its population?"})
r2 = client.messages.create(model="claude-sonnet-4-6", max_tokens=128, messages=history)
print(r2.content[0].text)  # Claude knows we're still talking about Canberra

type Msg = { role: "user" | "assistant"; content: string };
const history: Msg[] = [
  { role: "user", content: "What's the capital of Australia?" },
];

const r1 = await client.messages.create({
  model: "claude-sonnet-4-6", max_tokens: 128, messages: history,
});
const reply1 = r1.content[0].type === "text" ? r1.content[0].text : "";
history.push({ role: "assistant", content: reply1 });

history.push({ role: "user", content: "And its population?" });
const r2 = await client.messages.create({
  model: "claude-sonnet-4-6", max_tokens: 128, messages: history,
});
console.log(r2.content[0].type === "text" ? r2.content[0].text : "");

Two rules that bite people

Roles must alternate. Two consecutive user messages = 400 error. If you need to inject extra context, merge it into one user message.
The whole history is re-billed every turn. A 50-turn conversation re-sends 50 turns of input every time. This is where prompt caching (CC10) saves you money — mark the stable prefix as cacheable.

How this maps to Claude Code

Inside a Claude Code session, the CLI maintains the messages array for you. /compact trims it when it gets too long. Your CLAUDE.md is prepended every turn as part of the system prompt — that's why it's billed every turn (and why CC1's "keep it under 200 lines" rule matters).

System Prompts — Where CLAUDE.md Goes

The system field is a top-level instruction about Claude's role, persona, or constraints. It's not part of messages. Claude treats it with higher priority than user content — useful for "always" rules.

resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    system=(
        "You are a senior Python reviewer. Be terse. "
        "Always reply in this format:\n"
        "VERDICT: \n"
        "REASON: "
    ),
    messages=[
        {"role": "user", "content": "def add(a,b): return a+b"}
    ],
)
print(resp.content[0].text)
# VERDICT: ship
# REASON: Trivially correct, no validation needed for two-arg sum.

const resp = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 512,
  system:
    "You are a senior Python reviewer. Be terse. " +
    "Always reply in this format:\nVERDICT: \nREASON: ",
  messages: [
    { role: "user", content: "def add(a,b): return a+b" },
  ],
});

How Claude Code uses system prompts under the hood

Every Claude Code session builds a system prompt from multiple sources, concatenated:

What ends up in Claude Code's system prompt

Source	Where it comes from	Module
Tool descriptions	Built-in CLI tools (Bash, Edit, Read, etc.)	built-in
CLAUDE.md cascade	5 tiers (managed → user)	CC3
Active skill descriptions	Skills currently loaded for this session	CC5
Subagent system prompt	When running inside a subagent invocation	CC6
Memory index	Auto-memory entries from prior sessions	CC3

You can see the assembled system prompt with claude --debug — useful when Claude isn't following your CLAUDE.md and you suspect it's not loading.

Don't put dynamic data in system

System prompts are billed and (with caching) shared across requests. Putting per-request data there — the user's name, today's date — busts the cache. Put dynamic data in user messages; reserve system for stable role/style/policy.

Temperature — Determinism vs Creativity

Everyday Analogy

A short-order cook on a Tuesday lunch knows the burger order: same patty, same bun, same pickles. Boring? Yes. Predictable? Yes. That's temperature 0 — pick the most likely next thing every time.

The same cook on Friday night, asked to invent a special, weighs ten ideas and picks one of the more interesting ones. Variety, but more risk of an experiment that flops. That's temperature 1.

Same dial, same kitchen. The only question is: do you want a reliable burger or a fun special?

Technical Definition

Temperature is a number from 0.0 to 1.0 that controls how Claude samples its next token. At 0, Claude almost always picks the highest-probability token (very deterministic, but not bit-for-bit identical — there's still some non-determinism from inference internals). At 1.0, it samples more broadly, producing more varied output.

Use case	Temperature	Why
Code generation, refactors	0.0–0.2	You want the same answer every run; reproducibility helps debugging.
Classification, extraction, evals (CC11)	0.0	Stability across runs is the whole point.
Writing, brainstorming	0.7–1.0	You want variety; reading the same draft twice is uninteresting.
Tool-using agents	0.0–0.3	Tool selection should be predictable; randomness here is a bug.

For Claude Code specifically

Claude Code defaults to a low temperature suitable for code work. Subagents you write should typically also use low temperature unless their job is creative (e.g. a "draft release notes" subagent might use 0.7). The frontmatter exposes this in CC6.

Response Streaming

By default, the SDK waits for the full response before returning. For long answers (or long tool sequences), that's a multi-second blank screen. Streaming sends each token as it's generated, via Server-Sent Events.

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain async/await in 2 sentences."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)  # tokens arrive in real time
    final = stream.get_final_message()
    print(f"\n[done in {final.usage.output_tokens} tokens]")

const stream = await client.messages.stream({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Explain async/await in 2 sentences." }],
});

for await (const chunk of stream) {
  if (chunk.type === "content_block_delta" &&
      chunk.delta.type === "text_delta") {
    process.stdout.write(chunk.delta.text);
  }
}
const final = await stream.finalMessage();
console.log(`\n[done in ${final.usage.output_tokens} tokens]`);

When streaming hurts

Batch jobs. If nobody's watching, you don't need first-token latency — non-streaming has slightly less protocol overhead.
Strict JSON parsers. You can't json.parse a half-streamed object. Either buffer to end, or use tool-use (which streams structured chunks you can assemble safely).
Hooks that decide quickly. A hook that needs to say "block" or "allow" wants the verdict, not a token-by-token monologue.

In Claude Code

The CLI streams every assistant turn — that's why you see text appear character-by-character. Subagents stream too; the parent agent reads the stream and integrates it. Long-running subagents look "frozen" without streaming, even when working correctly — if you build one, default to streaming output.

Structured Data — Getting Reliable JSON

Free-form text is great for humans, terrible for code that needs to parse it. Three ways to get structured data out of Claude, in increasing order of reliability:

Option 1: Ask politely

Lowest effort, lowest reliability. Works ~90% of the time on simple shapes:

resp = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=256,
    messages=[{"role": "user", "content":
        'Return JSON only: {"sentiment": "positive|negative|neutral", "score": 0..1}. '
        'Text: "I love the new keyboard."'
    }],
)
import json
data = json.loads(resp.content[0].text)  # parse, hope

Failure mode: Claude wraps the JSON in ```json ... ``` fences or adds a leading "Here's the JSON:". You have to strip them.

Option 2: Prefill the assistant turn

You start the assistant's reply for it. Claude continues from where you left off, so it can't add a preamble:

resp = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=256,
    messages=[
        {"role": "user", "content": 'Sentiment of "I love the new keyboard." Return JSON.'},
        {"role": "assistant", "content": "{"},   # forces JSON-first
    ],
    stop_sequences=["}"],   # stop right after the closing brace
)
json_text = "{" + resp.content[0].text + "}"
data = json.loads(json_text)

Option 3: Tool use (most reliable)

Define a tool whose input_schema is exactly the JSON shape you want. Claude will fill in the schema. Covered in detail in CC8 — the short version:

tools = [{
    "name": "record_sentiment",
    "description": "Record sentiment analysis of text.",
    "input_schema": {
        "type": "object",
        "properties": {
            "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
            "score": {"type": "number", "minimum": 0, "maximum": 1},
        },
        "required": ["sentiment", "score"],
    },
}]
resp = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=256,
    tools=tools,
    tool_choice={"type": "tool", "name": "record_sentiment"},  # force this tool
    messages=[{"role": "user", "content": "I love the new keyboard."}],
)
# resp.content has a tool_use block with .input matching the schema
tool_use = next(b for b in resp.content if b.type == "tool_use")
data = tool_use.input    # already a dict, schema-validated

What just happened?

You saw three reliability tiers for structured output. For production, default to tool use. It's the only one that's schema-validated end-to-end. Prefilling is a useful fallback when the SDK doesn't support tools (rare). Free-form parsing is fine for prototypes and demos.

Hands-On Lab — Build a "ask my CLAUDE.md" CLI

You'll write a 30-line script that loads your project's CLAUDE.md, treats it as a system prompt, and lets you ask questions about your own conventions. This is essentially a 1% version of Claude Code — a useful sanity check that the API really is what's underneath.

Step 1 — Project setup

$ mkdir ask-claude-md && cd ask-claude-md
$ python -m venv .venv && source .venv/bin/activate
$ pip install anthropic
$ export ANTHROPIC_API_KEY="sk-ant-..."

# Use any CLAUDE.md you have. If you don't have one, make a starter:
$ cat > CLAUDE.md <<'EOF'
# my-saas-app
- Money is always cents (integers), never floats.
- Use Zod for API boundaries. Errors throw HttpError(status, msg).
- We use pnpm, not npm. Tests must pass before commit.
EOF

Step 2 — The script (`ask.py`)

import sys, pathlib
from anthropic import Anthropic, APIError

claude_md = pathlib.Path("CLAUDE.md").read_text()
question = " ".join(sys.argv[1:]) or "What conventions does this project follow?"

client = Anthropic()

system = (
    "You are answering questions about a software project. "
    "The project's CLAUDE.md file is below. Answer ONLY based on it. "
    "If the answer isn't there, say so explicitly.\n\n"
    f"<CLAUDE_MD>\n{claude_md}\n</CLAUDE_MD>"
)

try:
    with client.messages.stream(
        model="claude-sonnet-4-6", max_tokens=512, temperature=0,
        system=system,
        messages=[{"role": "user", "content": question}],
    ) as stream:
        for chunk in stream.text_stream:
            print(chunk, end="", flush=True)
        print()
except APIError as e:
    print(f"\n[error {e.status_code}] {e.message}", file=sys.stderr)
    sys.exit(1)

Step 3 — Try it

$ python ask.py "How should I represent money?"
Always as cents (integers). Never floats. (Per CLAUDE.md.)

$ python ask.py "What's our preferred CSS framework?"
The CLAUDE.md provided does not mention a CSS framework, so I can't answer.

Step 4 — What you just built

You wrote, in 25 lines, the conceptual core of Claude Code: load the developer's instructions into the system prompt, then forward the user's question. The real Claude Code adds tools (CC8), permissions (CC4), session management, and a thousand other niceties — but the foundation is exactly this.

Step 5 — Push it further (optional)

Add multi-turn: keep a list of past messages and re-send on each invocation.
Switch to tool_choice to force a JSON answer with {found: bool, citation: str, answer: str}.
Try Haiku (claude-haiku-4-5-20251001) and time the difference; CLAUDE.md questions are easy work that benefits from the cheaper model.

The CLAUDE.md got bigger.

Correct. Undated aliases follow the latest snapshot. For production stability, pin to a dated ID. Use the alias for fast prototyping.

Look again. Undated model aliases auto-upgrade. Pinning to a dated ID locks behavior until you choose to migrate.

Module Summary

Three Claude 4.x models: Opus (deep reasoning), Sonnet (default), Haiku (fast & cheap). Pin dated IDs in prod.
Messages API request: model, max_tokens, messages, optional system. Response content is an array of blocks, not a string.
Multi-turn: roles alternate user/assistant; you re-send the whole history every turn (until you cache — CC10).
System prompt = stable role/policy. CLAUDE.md, skill descriptions, and subagent prompts all become parts of it.
Temperature 0 for code, classification, evals, tool use. 0.7–1.0 for writing/brainstorming.
Streaming via SSE for human-facing UX; non-streaming for batch jobs and quick-verdict hooks.
Structured data: ask < prefill < tool use with forced tool_choice (the only schema-validated path).

CC1: API Essentials for CLI Power Users

Learning Objectives

Why CLI Users Need the API

The Claude Model Lineup

The 3-question decision tree

Your First Messages API Call

Anatomy of the request — four required fields

Anatomy of the response

Multi-Turn Conversations

Two rules that bite people

System Prompts — Where CLAUDE.md Goes

How Claude Code uses system prompts under the hood

Temperature — Determinism vs Creativity

Response Streaming

When streaming hurts

Structured Data — Getting Reliable JSON

Option 1: Ask politely

Option 2: Prefill the assistant turn

Option 3: Tool use (most reliable)

Hands-On Lab — Build a "ask my CLAUDE.md" CLI

Step 1 — Project setup

Step 2 — The script (ask.py)

Step 3 — Try it

Step 4 — What you just built

Step 5 — Push it further (optional)

Knowledge Check

1. Your hook needs to classify a stack trace as "transient" vs "real bug" 50,000 times a day. Which model?

2. You send messages=[{"role":"user", ...}, {"role":"user", ...}]. What happens?

3. You're calling Claude in a CI script that diffs two JSON files. What temperature?

4. You need Claude to return {verdict: "approve|deny", reason: string} from a hook. Most reliable approach?

5. Your skill's behavior changed across a model upgrade. What's the most likely cause and the right fix?

Module Summary

Step 2 — The script (`ask.py`)

2. You send `messages=[{"role":"user", ...}, {"role":"user", ...}]`. What happens?

4. You need Claude to return `{verdict: "approve|deny", reason: string}` from a hook. Most reliable approach?