CC10: Claude Features — Thinking, Caching, Citations, Files

Learning Objectives

Enable extended thinking and pick a sensible budget per task type.
Add prompt caching to a long system prompt for 90%+ cost reduction on cache hits.
Recognize the four caching rules and the cache-control breakpoint limit.
Send images and PDFs as message content; understand size and page limits.
Get citations on RAG answers without writing the citation logic yourself.
Run sandboxed Python via the code-execution tool and persist files via the Files API.

CC8 was Claude reaching out to your tools. CC10 is Claude reaching for capabilities Anthropic ships in the box: hidden reasoning, document inputs, server-side Python, citations.

Extended Thinking

Everyday Analogy

Asking a senior engineer a hard architecture question and getting an immediate yes/no answer is suspicious. Ask the same question and watch them go quiet, draw on a whiteboard for two minutes, then come back with a confident answer — that's better. The thinking didn't happen in the answer; it happened before the answer.

Extended thinking is exactly that. Claude produces a hidden reasoning trace before its visible answer. The trace costs tokens, but on hard tasks (architecture, multi-file refactors, ambiguous spec questions) it raises accuracy materially.

Technical Definition

Extended Thinking is a Claude 4.x feature you enable per request via the thinking parameter. The model emits one or more hidden thinking blocks (which you echo back in multi-turn but never display to users) before its text answer. You set a budget_tokens for the thinking; output tokens are capped separately by max_tokens.

resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    thinking={"type": "enabled", "budget_tokens": 8000},
    messages=[{"role": "user", "content":
        "We have 200 microservices, 80% Java. Should we migrate auth from JWT "
        "to mTLS service-to-service? Walk through the trade-offs."
    }],
)

for block in resp.content:
    if block.type == "thinking":
        # Don't show users; do log for debugging
        print("[thinking]", block.thinking[:200], "...")
    elif block.type == "text":
        print("[answer]", block.text)

Picking a budget

Task type	budget_tokens
Trivial (lookup, classification)	Don't enable
Code review of one PR	2,000–4,000
Multi-file refactor planning	8,000–16,000
Architecture / hard debugging	16,000–32,000

Thinking tokens are billed

Thinking tokens count as output tokens at the same rate. A 16K-token thinking budget is real money — only spend it where you'd spend an Opus call anyway. Never enable thinking on Haiku — if the task is hard enough to need thinking, it's worth Sonnet or Opus.

In Claude Code

The CLI maps "think", "think hard", "think deeply", "ultrathink" to budgets ~4K / 8K / 16K / 32K respectively. Subagents you write can set a default budget in their frontmatter (CC6). For hooks that need to make a fast decision, do not enable thinking — the latency hit is too high.

Prompt Caching — The Single Biggest Cost Lever

Everyday Analogy

Imagine a tax accountant who re-reads your entire 200-page corporate filing every time you ask "what's box 14b?" Slow, and you're billed for every re-read. Now imagine they keep the filing on their desk for 5 minutes after each question — same answer, no re-read fee.

That's prompt caching. The first time you send a long stable prefix (a CLAUDE.md, a system prompt, a document), Claude caches it. Subsequent requests within ~5 minutes referencing the same prefix are billed at ~10% of the input cost. The cached prefix doesn't count against your TTFT either; cache hits are noticeably faster.

Technical Definition

Prompt caching is a feature where Anthropic stores a prefix of your prompt for a short window (5 minutes default, 1 hour available). On subsequent requests that share that prefix exactly, the cached portion is billed at a fraction of normal input cost. You opt in by marking cache breakpoints in your prompt with cache_control: {"type": "ephemeral"}.

Caching a long system prompt

long_claude_md = open("CLAUDE.md").read()  # imagine 5K tokens
schema_docs = open("schema.md").read()      # imagine 3K tokens

resp = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": long_claude_md + "\n\n" + schema_docs,
            "cache_control": {"type": "ephemeral"},   # ← cache this prefix
        },
    ],
    messages=[{"role": "user", "content": "How should I model a refund?"}],
)
print(resp.usage)
# usage shows: cache_creation_input_tokens (first call), cache_read_input_tokens (later calls)

What you save

Cache miss (first request)

8,000 input tokens billed at full rate.

~$0.024

+ writing the cache adds ~25% one-time premium.

Cache hit (subsequent)

8,000 input tokens billed at ~10% rate.

~$0.0024

10× cheaper, faster TTFT.

Numbers are illustrative; check current pricing. The point: for any long-stable prefix served >5 times in a 5-minute window, caching pays for itself many times over.

When NOT to cache

One-shot calls (no second request → you pay the write premium for nothing).
Prefixes shorter than the minimum cacheable size (Sonnet/Opus: 1024 tokens, Haiku: 2048).
Prefixes that change between calls — even one differing token busts the cache.

Caching Rules — The Four That Bite

Cache breakpoints look simple. The gotchas are in the matching rules.

Rule 1: Maximum 4 cache breakpoints per request

You can place up to four cache_control markers. Use them strategically:

Typical 4-breakpoint layout

Position	Content	Why cache it
1	Tool definitions	Stable across all turns of a session.
2	System prompt (CLAUDE.md, role)	Stable across the project.
3	Few-shot examples or large document	Stable across many requests.
4	Conversation history up to last user turn	Reused next turn.

Rule 2: Caches match prefix-exactly

Prepend a single space, change one character, even add a newline — the cache misses entirely. Same input, byte-for-byte, every time for a hit.

Rule 3: TTL is 5 minutes by default; 1 hour optional

{
  "type": "text",
  "text": "...",
  "cache_control": {"type": "ephemeral", "ttl": "1h"}   # 1-hour TTL
}

1-hour caching costs more on the write (~2× ephemeral). Use it for content reused across hours, like a customer onboarding doc.

Rule 4: Caching only kicks in past a minimum size

Sonnet/Opus: ~1024 tokens. Haiku: ~2048 tokens. Below that, the marker is ignored. Don't try to cache a 100-token system prompt — it won't.

Order matters

Caching is prefix-based. If you put a frequently-changing token before a stable one, you've broken the cache for everything after. Always: most stable content first, most variable content last. System prompt > tools > documents > few-shot > history > current user turn.

Image & PDF Input

Claude 4.x can read images and PDFs as part of any user message. They go in content as image blocks alongside text.

Image input

import base64
with open("ui-mockup.png", "rb") as f:
    img_b64 = base64.standard_b64encode(f.read()).decode()

resp = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=1024,
    messages=[{"role": "user", "content": [
        {"type": "image", "source": {
            "type": "base64", "media_type": "image/png", "data": img_b64,
        }},
        {"type": "text", "text": "What React component structure would you propose for this UI?"},
    ]}],
)

Supported types: image/png, image/jpeg, image/gif, image/webp.
Max 5 MB per image, max 100 images per request.
Alternative source: {"type": "url", "url": "https://..."} if your image is publicly fetchable.

PDF input

with open("design-doc.pdf", "rb") as f:
    pdf_b64 = base64.standard_b64encode(f.read()).decode()

resp = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=2048,
    messages=[{"role": "user", "content": [
        {"type": "document", "source": {
            "type": "base64", "media_type": "application/pdf", "data": pdf_b64,
        }},
        {"type": "text", "text": "Summarize the proposed migration plan in 5 bullets."},
    ]}],
)

Up to 32 MB / 100 pages per PDF.
Claude reads both the text and embedded images of the PDF.
For long PDFs, page-range via cache_control on the document so a follow-up question doesn't re-bill the PDF.

In Claude Code

You can drag-drop an image into the CLI prompt — it's encoded and sent as an image block automatically. Useful for "what's wrong with this UI?" or "convert this whiteboard sketch to component code." PDFs work the same way.

Citations — Grounded Answers

If you ship a RAG-style feature ("answer based on these docs"), users want to know which doc a sentence came from. Citations is a built-in feature where Claude annotates each generated sentence with the source span it came from — without you writing the citation logic.

resp = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=1024,
    messages=[{"role": "user", "content": [
        {
            "type": "document",
            "source": {"type": "text", "data": open("policy.md").read()},
            "title": "Refund Policy v3",
            "context": "Internal policy doc, last updated 2025-08-12",
            "citations": {"enabled": True},   # ← opt in
        },
        {"type": "text", "text": "Can a customer get a refund 35 days after purchase?"},
    ]}],
)

# resp.content has text blocks; each text block can have a `citations` array
# pointing to the document and character ranges that grounded the sentence.
for block in resp.content:
    if block.type == "text":
        print(block.text)
        for cite in (block.citations or []):
            print(f"  → {cite.cited_text!r} (chars {cite.start_char_index}-{cite.end_char_index})")

What citations get you

Per-sentence source spans, automatically (no parsing of "[1]" footnotes).
Reduced hallucination — Claude cannot cite text that isn't in the documents.
UX surface: highlight the cited span in the original doc on hover.

Citations only over document blocks

You must use the document content type with citations.enabled: true. Plain text passed in a user message won't get citations. For Claude Code use cases: wrap your project's docs as document blocks when you want Claude's answers to cite specific lines.

Code Execution & the Files API

The code-execution tool lets Claude run sandboxed Python. The Files API lets you upload files Claude can read or write into. Together they're how you give Claude "do data analysis on this CSV" or "render this chart" capabilities without you writing the executor.

Code execution tool

tools = [{"type": "code_execution_20250522", "name": "code_execution"}]

resp = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=4096,
    tools=tools,
    extra_headers={"anthropic-beta": "code-execution-2025-05-22"},
    messages=[{"role": "user", "content":
        "Generate a histogram of these numbers: 1,1,2,3,3,3,4,4,5,5,5,5,7."
    }],
)
# Claude writes Python that uses matplotlib, the sandbox runs it,
# returns the figure as a tool_result, Claude embeds it in the answer.

Files API — persisting inputs/outputs

file = client.beta.files.upload(
    file=open("sales.csv", "rb"),
)   # one upload, reusable
print(file.id)   # "file_abc123"

resp = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=4096,
    tools=[{"type": "code_execution_20250522", "name": "code_execution"}],
    extra_headers={"anthropic-beta": "code-execution-2025-05-22,files-api-2025-04-14"},
    messages=[{"role": "user", "content": [
        {"type": "container_upload", "file_id": file.id},   # mount file into sandbox
        {"type": "text", "text": "Compute total sales by region. Write the result to result.csv."},
    ]}],
)
# Claude's code_execution writes /mnt/result.csv; you can fetch it via the Files API.

Sandbox guarantees

Ephemeral container, no internet access from inside the sandbox.
Standard data-science Python (numpy, pandas, matplotlib, scipy) preinstalled.
~1 GB RAM, 5-min wallclock per call.
Files persist across tool calls in the same conversation; ephemeral after.

Security model

The sandbox is isolated from your machine and from Anthropic's network. You can safely have Claude execute untrusted code (e.g. user-submitted snippets) as long as you understand: output is what Claude says it is — trust only what you can verify. If you upload sensitive data via Files, treat the file ID as a secret — anyone with it can call Claude with that file.

How Claude Code Uses These Features

Feature	Claude Code use
Extended thinking	Triggered by "think" / "ultrathink" keywords; budgets configurable per subagent.
Prompt caching	CLAUDE.md + tool definitions are cached automatically; you don't add `cache_control` yourself.
Image input	Drag-drop or paste image into the prompt → sent as image block.
PDF input	Same drag-drop. Useful for design docs, RFCs.
Citations	Not exposed in the CLI directly; available if you build a skill that calls the API with documents.
Code execution	Used internally by some skills; you can invoke via a custom skill or MCP server (CC9).

Where you'll add these manually

Skills you write that call the API directly (CC5) and HTTP hooks that call out to Claude (CC7) are where you'll add caching, citations, and thinking budgets explicitly. The CLI handles the basics; for power features, drop down to API.

Hands-On Lab — Cache a Long System Prompt

You'll measure the actual cost difference of prompt caching with a real CLAUDE.md-sized system prompt. Three runs: cold, warm, and broken.

Step 1 — Build a 4K-token system prompt

import json
from anthropic import Anthropic

# Pad with realistic CLAUDE.md content. Aim for >1024 tokens (Sonnet minimum).
SYSTEM = ("# Project: PublicRecords\n\n" +
          ("- All amounts in cents.\n- Use Drizzle ORM.\n- Tests must pass.\n" * 200))

client = Anthropic()

def call(question: str):
    return client.messages.create(
        model="claude-sonnet-4-6", max_tokens=128,
        system=[{
            "type": "text", "text": SYSTEM,
            "cache_control": {"type": "ephemeral"},
        }],
        messages=[{"role": "user", "content": question}],
    )

Step 2 — Cold run (creates the cache)

r1 = call("How should money be represented?")
print(json.dumps(r1.usage.model_dump(), indent=2))
# Expect: cache_creation_input_tokens > 0, cache_read_input_tokens == 0

Step 3 — Warm runs (cache hits)

for q in ["Which ORM?", "Do tests need to pass before commit?",
          "What's the unit for amounts?"]:
    r = call(q)
    print(q, "→", r.usage.model_dump())
    # Expect: cache_creation_input_tokens == 0, cache_read_input_tokens > 0

Step 4 — Bust the cache (rookie mistake)

# Add even ONE extra space at the start of SYSTEM
SYSTEM_BROKEN = " " + SYSTEM
r = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=128,
    system=[{"type": "text", "text": SYSTEM_BROKEN,
             "cache_control": {"type": "ephemeral"}}],
    messages=[{"role": "user", "content": "Which ORM?"}],
)
print(r.usage.model_dump())
# Cache miss again; cache_creation_input_tokens > 0

Step 5 — Compute the savings

from anthropic.types import Usage
def cost(u: Usage, in_per_m=3.0, cache_in_per_m=3.75, cache_read_per_m=0.30,
         out_per_m=15.0):
    return (u.input_tokens * in_per_m
            + (u.cache_creation_input_tokens or 0) * cache_in_per_m
            + (u.cache_read_input_tokens     or 0) * cache_read_per_m
            + u.output_tokens * out_per_m) / 1_000_000

# Compare cold first call vs a warm one. Expect ~10x savings on input portion.

Claude is hallucinating.

Correct. The sandbox is offline. Even when packages are present, you can't make HTTP calls. For network-dependent tasks, use a custom tool you implement yourself, not code-execution.

Look again. The key constraint is the sandbox has no internet access. Anything that needs HTTP must run in your own tool, not in the code-execution sandbox.

Module Summary

Extended thinking: hidden reasoning trace before the answer. Budgets 2K–32K. Worth it on hard reasoning, not on hooks or quick lookups.
Prompt caching: mark stable prefixes with cache_control: {"type": "ephemeral"}. ~10× cheaper on hits, ~25% more expensive on first write.
Caching rules: max 4 breakpoints; prefix-exact matching; 5-min default TTL (1h available); minimum 1024 tokens (Sonnet/Opus) or 2048 (Haiku).
Order: stable content first (system > tools > docs > few-shot), variable content last.
Image & PDF: base64 in image/document blocks. PDFs up to 32 MB / 100 pages.
Citations: opt in with citations.enabled: true on document blocks. Per-sentence source spans, no parsing.
Code execution: sandboxed Python, offline, with the Files API for inputs/outputs. Use beta header.
In Claude Code: caching and basic image/PDF are auto. Thinking via "think hard" keywords. Citations + code-exec require dropping to the API.