Claude Code Mastery — Direct Track
CC10 — Claude Features Module 11 of 16
50 minIntermediate
← CC9: MCP 🏠 Home CC11: Evaluation →

CC10: Claude Features — Thinking, Caching, Citations, Files

The features that make production Claude apps fast, cheap, and trustworthy: extended thinking for hard reasoning, prompt caching for cost (90%+ savings on stable prefixes), image/PDF input, citations for grounded answers, and the code-execution + Files API for running Python sandboxed.

Learning Objectives

  • Enable extended thinking and pick a sensible budget per task type.
  • Add prompt caching to a long system prompt for 90%+ cost reduction on cache hits.
  • Recognize the four caching rules and the cache-control breakpoint limit.
  • Send images and PDFs as message content; understand size and page limits.
  • Get citations on RAG answers without writing the citation logic yourself.
  • Run sandboxed Python via the code-execution tool and persist files via the Files API.
CC8 was Claude reaching out to your tools. CC10 is Claude reaching for capabilities Anthropic ships in the box: hidden reasoning, document inputs, server-side Python, citations.

Extended Thinking

Everyday Analogy

Asking a senior engineer a hard architecture question and getting an immediate yes/no answer is suspicious. Ask the same question and watch them go quiet, draw on a whiteboard for two minutes, then come back with a confident answer — that's better. The thinking didn't happen in the answer; it happened before the answer.

Extended thinking is exactly that. Claude produces a hidden reasoning trace before its visible answer. The trace costs tokens, but on hard tasks (architecture, multi-file refactors, ambiguous spec questions) it raises accuracy materially.

Technical Definition

Extended Thinking is a Claude 4.x feature you enable per request via the thinking parameter. The model emits one or more hidden thinking blocks (which you echo back in multi-turn but never display to users) before its text answer. You set a budget_tokens for the thinking; output tokens are capped separately by max_tokens.

resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    thinking={"type": "enabled", "budget_tokens": 8000},
    messages=[{"role": "user", "content":
        "We have 200 microservices, 80% Java. Should we migrate auth from JWT "
        "to mTLS service-to-service? Walk through the trade-offs."
    }],
)

for block in resp.content:
    if block.type == "thinking":
        # Don't show users; do log for debugging
        print("[thinking]", block.thinking[:200], "...")
    elif block.type == "text":
        print("[answer]", block.text)

Picking a budget

Task typebudget_tokens
Trivial (lookup, classification)Don't enable
Code review of one PR2,000–4,000
Multi-file refactor planning8,000–16,000
Architecture / hard debugging16,000–32,000
Thinking tokens are billed

Thinking tokens count as output tokens at the same rate. A 16K-token thinking budget is real money — only spend it where you'd spend an Opus call anyway. Never enable thinking on Haiku — if the task is hard enough to need thinking, it's worth Sonnet or Opus.

In Claude Code

The CLI maps "think", "think hard", "think deeply", "ultrathink" to budgets ~4K / 8K / 16K / 32K respectively. Subagents you write can set a default budget in their frontmatter (CC6). For hooks that need to make a fast decision, do not enable thinking — the latency hit is too high.

Prompt Caching — The Single Biggest Cost Lever

Everyday Analogy

Imagine a tax accountant who re-reads your entire 200-page corporate filing every time you ask "what's box 14b?" Slow, and you're billed for every re-read. Now imagine they keep the filing on their desk for 5 minutes after each question — same answer, no re-read fee.

That's prompt caching. The first time you send a long stable prefix (a CLAUDE.md, a system prompt, a document), Claude caches it. Subsequent requests within ~5 minutes referencing the same prefix are billed at ~10% of the input cost. The cached prefix doesn't count against your TTFT either; cache hits are noticeably faster.

Technical Definition

Prompt caching is a feature where Anthropic stores a prefix of your prompt for a short window (5 minutes default, 1 hour available). On subsequent requests that share that prefix exactly, the cached portion is billed at a fraction of normal input cost. You opt in by marking cache breakpoints in your prompt with cache_control: {"type": "ephemeral"}.

Caching a long system prompt

long_claude_md = open("CLAUDE.md").read()  # imagine 5K tokens
schema_docs = open("schema.md").read()      # imagine 3K tokens

resp = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": long_claude_md + "\n\n" + schema_docs,
            "cache_control": {"type": "ephemeral"},   # ← cache this prefix
        },
    ],
    messages=[{"role": "user", "content": "How should I model a refund?"}],
)
print(resp.usage)
# usage shows: cache_creation_input_tokens (first call), cache_read_input_tokens (later calls)

What you save

Cache miss (first request)

8,000 input tokens billed at full rate.

~$0.024

+ writing the cache adds ~25% one-time premium.

Cache hit (subsequent)

8,000 input tokens billed at ~10% rate.

~$0.0024

10× cheaper, faster TTFT.

Numbers are illustrative; check current pricing. The point: for any long-stable prefix served >5 times in a 5-minute window, caching pays for itself many times over.

When NOT to cache
  • One-shot calls (no second request → you pay the write premium for nothing).
  • Prefixes shorter than the minimum cacheable size (Sonnet/Opus: 1024 tokens, Haiku: 2048).
  • Prefixes that change between calls — even one differing token busts the cache.

Caching Rules — The Four That Bite

Cache breakpoints look simple. The gotchas are in the matching rules.

Rule 1: Maximum 4 cache breakpoints per request

You can place up to four cache_control markers. Use them strategically:

Typical 4-breakpoint layout
PositionContentWhy cache it
1Tool definitionsStable across all turns of a session.
2System prompt (CLAUDE.md, role)Stable across the project.
3Few-shot examples or large documentStable across many requests.
4Conversation history up to last user turnReused next turn.

Rule 2: Caches match prefix-exactly

Prepend a single space, change one character, even add a newline — the cache misses entirely. Same input, byte-for-byte, every time for a hit.

Rule 3: TTL is 5 minutes by default; 1 hour optional

{
  "type": "text",
  "text": "...",
  "cache_control": {"type": "ephemeral", "ttl": "1h"}   # 1-hour TTL
}

1-hour caching costs more on the write (~2× ephemeral). Use it for content reused across hours, like a customer onboarding doc.

Rule 4: Caching only kicks in past a minimum size

Sonnet/Opus: ~1024 tokens. Haiku: ~2048 tokens. Below that, the marker is ignored. Don't try to cache a 100-token system prompt — it won't.

Order matters

Caching is prefix-based. If you put a frequently-changing token before a stable one, you've broken the cache for everything after. Always: most stable content first, most variable content last. System prompt > tools > documents > few-shot > history > current user turn.

Image & PDF Input

Claude 4.x can read images and PDFs as part of any user message. They go in content as image blocks alongside text.

Image input

import base64
with open("ui-mockup.png", "rb") as f:
    img_b64 = base64.standard_b64encode(f.read()).decode()

resp = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=1024,
    messages=[{"role": "user", "content": [
        {"type": "image", "source": {
            "type": "base64", "media_type": "image/png", "data": img_b64,
        }},
        {"type": "text", "text": "What React component structure would you propose for this UI?"},
    ]}],
)
  • Supported types: image/png, image/jpeg, image/gif, image/webp.
  • Max 5 MB per image, max 100 images per request.
  • Alternative source: {"type": "url", "url": "https://..."} if your image is publicly fetchable.

PDF input

with open("design-doc.pdf", "rb") as f:
    pdf_b64 = base64.standard_b64encode(f.read()).decode()

resp = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=2048,
    messages=[{"role": "user", "content": [
        {"type": "document", "source": {
            "type": "base64", "media_type": "application/pdf", "data": pdf_b64,
        }},
        {"type": "text", "text": "Summarize the proposed migration plan in 5 bullets."},
    ]}],
)
  • Up to 32 MB / 100 pages per PDF.
  • Claude reads both the text and embedded images of the PDF.
  • For long PDFs, page-range via cache_control on the document so a follow-up question doesn't re-bill the PDF.
In Claude Code

You can drag-drop an image into the CLI prompt — it's encoded and sent as an image block automatically. Useful for "what's wrong with this UI?" or "convert this whiteboard sketch to component code." PDFs work the same way.

Citations — Grounded Answers

If you ship a RAG-style feature ("answer based on these docs"), users want to know which doc a sentence came from. Citations is a built-in feature where Claude annotates each generated sentence with the source span it came from — without you writing the citation logic.

resp = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=1024,
    messages=[{"role": "user", "content": [
        {
            "type": "document",
            "source": {"type": "text", "data": open("policy.md").read()},
            "title": "Refund Policy v3",
            "context": "Internal policy doc, last updated 2025-08-12",
            "citations": {"enabled": True},   # ← opt in
        },
        {"type": "text", "text": "Can a customer get a refund 35 days after purchase?"},
    ]}],
)

# resp.content has text blocks; each text block can have a `citations` array
# pointing to the document and character ranges that grounded the sentence.
for block in resp.content:
    if block.type == "text":
        print(block.text)
        for cite in (block.citations or []):
            print(f"  → {cite.cited_text!r} (chars {cite.start_char_index}-{cite.end_char_index})")

What citations get you

  • Per-sentence source spans, automatically (no parsing of "[1]" footnotes).
  • Reduced hallucination — Claude cannot cite text that isn't in the documents.
  • UX surface: highlight the cited span in the original doc on hover.
Citations only over document blocks

You must use the document content type with citations.enabled: true. Plain text passed in a user message won't get citations. For Claude Code use cases: wrap your project's docs as document blocks when you want Claude's answers to cite specific lines.

Code Execution & the Files API

The code-execution tool lets Claude run sandboxed Python. The Files API lets you upload files Claude can read or write into. Together they're how you give Claude "do data analysis on this CSV" or "render this chart" capabilities without you writing the executor.

Code execution tool

tools = [{"type": "code_execution_20250522", "name": "code_execution"}]

resp = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=4096,
    tools=tools,
    extra_headers={"anthropic-beta": "code-execution-2025-05-22"},
    messages=[{"role": "user", "content":
        "Generate a histogram of these numbers: 1,1,2,3,3,3,4,4,5,5,5,5,7."
    }],
)
# Claude writes Python that uses matplotlib, the sandbox runs it,
# returns the figure as a tool_result, Claude embeds it in the answer.

Files API — persisting inputs/outputs

file = client.beta.files.upload(
    file=open("sales.csv", "rb"),
)   # one upload, reusable
print(file.id)   # "file_abc123"

resp = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=4096,
    tools=[{"type": "code_execution_20250522", "name": "code_execution"}],
    extra_headers={"anthropic-beta": "code-execution-2025-05-22,files-api-2025-04-14"},
    messages=[{"role": "user", "content": [
        {"type": "container_upload", "file_id": file.id},   # mount file into sandbox
        {"type": "text", "text": "Compute total sales by region. Write the result to result.csv."},
    ]}],
)
# Claude's code_execution writes /mnt/result.csv; you can fetch it via the Files API.

Sandbox guarantees

  • Ephemeral container, no internet access from inside the sandbox.
  • Standard data-science Python (numpy, pandas, matplotlib, scipy) preinstalled.
  • ~1 GB RAM, 5-min wallclock per call.
  • Files persist across tool calls in the same conversation; ephemeral after.
Security model

The sandbox is isolated from your machine and from Anthropic's network. You can safely have Claude execute untrusted code (e.g. user-submitted snippets) as long as you understand: output is what Claude says it is — trust only what you can verify. If you upload sensitive data via Files, treat the file ID as a secret — anyone with it can call Claude with that file.

How Claude Code Uses These Features

FeatureClaude Code use
Extended thinkingTriggered by "think" / "ultrathink" keywords; budgets configurable per subagent.
Prompt cachingCLAUDE.md + tool definitions are cached automatically; you don't add cache_control yourself.
Image inputDrag-drop or paste image into the prompt → sent as image block.
PDF inputSame drag-drop. Useful for design docs, RFCs.
CitationsNot exposed in the CLI directly; available if you build a skill that calls the API with documents.
Code executionUsed internally by some skills; you can invoke via a custom skill or MCP server (CC9).
Where you'll add these manually

Skills you write that call the API directly (CC5) and HTTP hooks that call out to Claude (CC7) are where you'll add caching, citations, and thinking budgets explicitly. The CLI handles the basics; for power features, drop down to API.

Hands-On Lab — Cache a Long System Prompt

You'll measure the actual cost difference of prompt caching with a real CLAUDE.md-sized system prompt. Three runs: cold, warm, and broken.

Step 1 — Build a 4K-token system prompt

import json
from anthropic import Anthropic

# Pad with realistic CLAUDE.md content. Aim for >1024 tokens (Sonnet minimum).
SYSTEM = ("# Project: PublicRecords\n\n" +
          ("- All amounts in cents.\n- Use Drizzle ORM.\n- Tests must pass.\n" * 200))

client = Anthropic()

def call(question: str):
    return client.messages.create(
        model="claude-sonnet-4-6", max_tokens=128,
        system=[{
            "type": "text", "text": SYSTEM,
            "cache_control": {"type": "ephemeral"},
        }],
        messages=[{"role": "user", "content": question}],
    )

Step 2 — Cold run (creates the cache)

r1 = call("How should money be represented?")
print(json.dumps(r1.usage.model_dump(), indent=2))
# Expect: cache_creation_input_tokens > 0, cache_read_input_tokens == 0

Step 3 — Warm runs (cache hits)

for q in ["Which ORM?", "Do tests need to pass before commit?",
          "What's the unit for amounts?"]:
    r = call(q)
    print(q, "→", r.usage.model_dump())
    # Expect: cache_creation_input_tokens == 0, cache_read_input_tokens > 0

Step 4 — Bust the cache (rookie mistake)

# Add even ONE extra space at the start of SYSTEM
SYSTEM_BROKEN = " " + SYSTEM
r = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=128,
    system=[{"type": "text", "text": SYSTEM_BROKEN,
             "cache_control": {"type": "ephemeral"}}],
    messages=[{"role": "user", "content": "Which ORM?"}],
)
print(r.usage.model_dump())
# Cache miss again; cache_creation_input_tokens > 0

Step 5 — Compute the savings

from anthropic.types import Usage
def cost(u: Usage, in_per_m=3.0, cache_in_per_m=3.75, cache_read_per_m=0.30,
         out_per_m=15.0):
    return (u.input_tokens * in_per_m
            + (u.cache_creation_input_tokens or 0) * cache_in_per_m
            + (u.cache_read_input_tokens     or 0) * cache_read_per_m
            + u.output_tokens * out_per_m) / 1_000_000

# Compare cold first call vs a warm one. Expect ~10x savings on input portion.
Lab complete — what you should have

Three sets of usage numbers proving: (1) cold call writes the cache at a 25% premium, (2) warm calls are ~10× cheaper on the cached portion, (3) one extra space busts the cache entirely. For any skill or hook that reuses a prompt prefix more than 5 times in 5 minutes, caching is free money — you just have to enable it.

Knowledge Check

1. You have a hook that runs on every file save and needs to make a fast block/allow decision. Should you enable extended thinking?

A
Yes — better decisions are always worth it.
B
No — thinking adds latency and tokens; quick verdicts don't need it.
C
Only on Haiku.
D
Always at max budget for safety.
Correct. Hooks should be fast. Extended thinking adds seconds of latency and significant token cost. Save it for hard reasoning, not snap verdicts.
Look again. Thinking is for hard tasks where accuracy matters more than latency. Hooks need quick decisions; thinking is the wrong tool here.

2. Your one-shot CLI tool runs Claude once per invocation, no follow-up. Should you cache the system prompt?

A
Yes — caching is always cheaper.
B
No — you pay the write premium with no follow-up to amortize it.
C
Yes, with 1h TTL.
D
Yes, but only on Opus.
Correct. Caching has a write premium (~25% over normal input). One call without a follow-up means you pay the premium for nothing. Only cache when you'll get a hit within 5 minutes.
Look again. Cache writes cost more than uncached. They pay off only when you get cache hits later. One-shots get no benefit.

3. Your cached prefix sometimes gets cache hits and sometimes doesn't. Most likely cause?

A
API outage.
B
You're concatenating today's date or a request ID into the prefix.
C
You changed the model.
D
Caching is non-deterministic.
Correct. Caches match prefix-exactly. Any varying token in the cached region (timestamps, IDs) busts it. Move dynamic content after the cache breakpoint.
Look again. Caching is byte-exact. Inconsistent hits almost always mean a varying token slipped into the cached prefix.

4. You're building a "summarize this PDF" feature and want users to see source spans. Best approach?

A
Ask Claude to add "(see page X)" footnotes manually.
B
Pass the PDF as a document block with citations.enabled: true and read each text block's citations array.
C
Run a regex post-process on Claude's output.
D
Don't bother — users trust the model.
Correct. Built-in citations give you per-sentence source spans, schema-validated, no parsing. It's the only approach that's hallucination-resistant.
Look again. Citations is a built-in feature exactly for this case. Asking for footnotes leaves room for hallucination; the citations API guarantees the cited text exists in the source.

5. Your code-execution sandbox call returns "no module named requests." Why?

A
You forgot to install requests with pip.
B
The sandbox has no internet access; requests is in the bundle but you can't reach external URLs anyway. The error suggests the sandbox version doesn't include it — use stdlib urllib alternatives, but you still can't reach the internet.
C
Anthropic disabled it.
D
Claude is hallucinating.
Correct. The sandbox is offline. Even when packages are present, you can't make HTTP calls. For network-dependent tasks, use a custom tool you implement yourself, not code-execution.
Look again. The key constraint is the sandbox has no internet access. Anything that needs HTTP must run in your own tool, not in the code-execution sandbox.

Module Summary

  • Extended thinking: hidden reasoning trace before the answer. Budgets 2K–32K. Worth it on hard reasoning, not on hooks or quick lookups.
  • Prompt caching: mark stable prefixes with cache_control: {"type": "ephemeral"}. ~10× cheaper on hits, ~25% more expensive on first write.
  • Caching rules: max 4 breakpoints; prefix-exact matching; 5-min default TTL (1h available); minimum 1024 tokens (Sonnet/Opus) or 2048 (Haiku).
  • Order: stable content first (system > tools > docs > few-shot), variable content last.
  • Image & PDF: base64 in image/document blocks. PDFs up to 32 MB / 100 pages.
  • Citations: opt in with citations.enabled: true on document blocks. Per-sentence source spans, no parsing.
  • Code execution: sandboxed Python, offline, with the Files API for inputs/outputs. Use beta header.
  • In Claude Code: caching and basic image/PDF are auto. Thinking via "think hard" keywords. Citations + code-exec require dropping to the API.