CC10: Claude Features — Thinking, Caching, Citations, Files
The features that make production Claude apps fast, cheap, and trustworthy: extended thinking for hard reasoning, prompt caching for cost (90%+ savings on stable prefixes), image/PDF input, citations for grounded answers, and the code-execution + Files API for running Python sandboxed.
Learning Objectives
- Enable extended thinking and pick a sensible budget per task type.
- Add prompt caching to a long system prompt for 90%+ cost reduction on cache hits.
- Recognize the four caching rules and the cache-control breakpoint limit.
- Send images and PDFs as message content; understand size and page limits.
- Get citations on RAG answers without writing the citation logic yourself.
- Run sandboxed Python via the code-execution tool and persist files via the Files API.
Extended Thinking
Asking a senior engineer a hard architecture question and getting an immediate yes/no answer is suspicious. Ask the same question and watch them go quiet, draw on a whiteboard for two minutes, then come back with a confident answer — that's better. The thinking didn't happen in the answer; it happened before the answer.
Extended thinking is exactly that. Claude produces a hidden reasoning trace before its visible answer. The trace costs tokens, but on hard tasks (architecture, multi-file refactors, ambiguous spec questions) it raises accuracy materially.
Extended Thinking is a Claude 4.x feature you enable per request via the thinking parameter. The model emits one or more hidden thinking blocks (which you echo back in multi-turn but never display to users) before its text answer. You set a budget_tokens for the thinking; output tokens are capped separately by max_tokens.
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=4096,
thinking={"type": "enabled", "budget_tokens": 8000},
messages=[{"role": "user", "content":
"We have 200 microservices, 80% Java. Should we migrate auth from JWT "
"to mTLS service-to-service? Walk through the trade-offs."
}],
)
for block in resp.content:
if block.type == "thinking":
# Don't show users; do log for debugging
print("[thinking]", block.thinking[:200], "...")
elif block.type == "text":
print("[answer]", block.text)
Picking a budget
| Task type | budget_tokens |
|---|---|
| Trivial (lookup, classification) | Don't enable |
| Code review of one PR | 2,000–4,000 |
| Multi-file refactor planning | 8,000–16,000 |
| Architecture / hard debugging | 16,000–32,000 |
Thinking tokens count as output tokens at the same rate. A 16K-token thinking budget is real money — only spend it where you'd spend an Opus call anyway. Never enable thinking on Haiku — if the task is hard enough to need thinking, it's worth Sonnet or Opus.
The CLI maps "think", "think hard", "think deeply", "ultrathink" to budgets ~4K / 8K / 16K / 32K respectively. Subagents you write can set a default budget in their frontmatter (CC6). For hooks that need to make a fast decision, do not enable thinking — the latency hit is too high.
Prompt Caching — The Single Biggest Cost Lever
Imagine a tax accountant who re-reads your entire 200-page corporate filing every time you ask "what's box 14b?" Slow, and you're billed for every re-read. Now imagine they keep the filing on their desk for 5 minutes after each question — same answer, no re-read fee.
That's prompt caching. The first time you send a long stable prefix (a CLAUDE.md, a system prompt, a document), Claude caches it. Subsequent requests within ~5 minutes referencing the same prefix are billed at ~10% of the input cost. The cached prefix doesn't count against your TTFT either; cache hits are noticeably faster.
Prompt caching is a feature where Anthropic stores a prefix of your prompt for a short window (5 minutes default, 1 hour available). On subsequent requests that share that prefix exactly, the cached portion is billed at a fraction of normal input cost. You opt in by marking cache breakpoints in your prompt with cache_control: {"type": "ephemeral"}.
Caching a long system prompt
long_claude_md = open("CLAUDE.md").read() # imagine 5K tokens
schema_docs = open("schema.md").read() # imagine 3K tokens
resp = client.messages.create(
model="claude-sonnet-4-6", max_tokens=1024,
system=[
{
"type": "text",
"text": long_claude_md + "\n\n" + schema_docs,
"cache_control": {"type": "ephemeral"}, # ← cache this prefix
},
],
messages=[{"role": "user", "content": "How should I model a refund?"}],
)
print(resp.usage)
# usage shows: cache_creation_input_tokens (first call), cache_read_input_tokens (later calls)
What you save
Cache miss (first request)
8,000 input tokens billed at full rate.
~$0.024
+ writing the cache adds ~25% one-time premium.
Cache hit (subsequent)
8,000 input tokens billed at ~10% rate.
~$0.0024
10× cheaper, faster TTFT.
Numbers are illustrative; check current pricing. The point: for any long-stable prefix served >5 times in a 5-minute window, caching pays for itself many times over.
- One-shot calls (no second request → you pay the write premium for nothing).
- Prefixes shorter than the minimum cacheable size (Sonnet/Opus: 1024 tokens, Haiku: 2048).
- Prefixes that change between calls — even one differing token busts the cache.
Caching Rules — The Four That Bite
Cache breakpoints look simple. The gotchas are in the matching rules.
Rule 1: Maximum 4 cache breakpoints per request
You can place up to four cache_control markers. Use them strategically:
| Position | Content | Why cache it |
|---|---|---|
| 1 | Tool definitions | Stable across all turns of a session. |
| 2 | System prompt (CLAUDE.md, role) | Stable across the project. |
| 3 | Few-shot examples or large document | Stable across many requests. |
| 4 | Conversation history up to last user turn | Reused next turn. |
Rule 2: Caches match prefix-exactly
Prepend a single space, change one character, even add a newline — the cache misses entirely. Same input, byte-for-byte, every time for a hit.
Rule 3: TTL is 5 minutes by default; 1 hour optional
{
"type": "text",
"text": "...",
"cache_control": {"type": "ephemeral", "ttl": "1h"} # 1-hour TTL
}
1-hour caching costs more on the write (~2× ephemeral). Use it for content reused across hours, like a customer onboarding doc.
Rule 4: Caching only kicks in past a minimum size
Sonnet/Opus: ~1024 tokens. Haiku: ~2048 tokens. Below that, the marker is ignored. Don't try to cache a 100-token system prompt — it won't.
Caching is prefix-based. If you put a frequently-changing token before a stable one, you've broken the cache for everything after. Always: most stable content first, most variable content last. System prompt > tools > documents > few-shot > history > current user turn.
Image & PDF Input
Claude 4.x can read images and PDFs as part of any user message. They go in content as image blocks alongside text.
Image input
import base64
with open("ui-mockup.png", "rb") as f:
img_b64 = base64.standard_b64encode(f.read()).decode()
resp = client.messages.create(
model="claude-sonnet-4-6", max_tokens=1024,
messages=[{"role": "user", "content": [
{"type": "image", "source": {
"type": "base64", "media_type": "image/png", "data": img_b64,
}},
{"type": "text", "text": "What React component structure would you propose for this UI?"},
]}],
)
- Supported types:
image/png,image/jpeg,image/gif,image/webp. - Max 5 MB per image, max 100 images per request.
- Alternative source:
{"type": "url", "url": "https://..."}if your image is publicly fetchable.
PDF input
with open("design-doc.pdf", "rb") as f:
pdf_b64 = base64.standard_b64encode(f.read()).decode()
resp = client.messages.create(
model="claude-sonnet-4-6", max_tokens=2048,
messages=[{"role": "user", "content": [
{"type": "document", "source": {
"type": "base64", "media_type": "application/pdf", "data": pdf_b64,
}},
{"type": "text", "text": "Summarize the proposed migration plan in 5 bullets."},
]}],
)
- Up to 32 MB / 100 pages per PDF.
- Claude reads both the text and embedded images of the PDF.
- For long PDFs, page-range via
cache_controlon the document so a follow-up question doesn't re-bill the PDF.
You can drag-drop an image into the CLI prompt — it's encoded and sent as an image block automatically. Useful for "what's wrong with this UI?" or "convert this whiteboard sketch to component code." PDFs work the same way.
Citations — Grounded Answers
If you ship a RAG-style feature ("answer based on these docs"), users want to know which doc a sentence came from. Citations is a built-in feature where Claude annotates each generated sentence with the source span it came from — without you writing the citation logic.
resp = client.messages.create(
model="claude-sonnet-4-6", max_tokens=1024,
messages=[{"role": "user", "content": [
{
"type": "document",
"source": {"type": "text", "data": open("policy.md").read()},
"title": "Refund Policy v3",
"context": "Internal policy doc, last updated 2025-08-12",
"citations": {"enabled": True}, # ← opt in
},
{"type": "text", "text": "Can a customer get a refund 35 days after purchase?"},
]}],
)
# resp.content has text blocks; each text block can have a `citations` array
# pointing to the document and character ranges that grounded the sentence.
for block in resp.content:
if block.type == "text":
print(block.text)
for cite in (block.citations or []):
print(f" → {cite.cited_text!r} (chars {cite.start_char_index}-{cite.end_char_index})")
What citations get you
- Per-sentence source spans, automatically (no parsing of "[1]" footnotes).
- Reduced hallucination — Claude cannot cite text that isn't in the documents.
- UX surface: highlight the cited span in the original doc on hover.
document blocks
You must use the document content type with citations.enabled: true. Plain text passed in a user message won't get citations. For Claude Code use cases: wrap your project's docs as document blocks when you want Claude's answers to cite specific lines.
Code Execution & the Files API
The code-execution tool lets Claude run sandboxed Python. The Files API lets you upload files Claude can read or write into. Together they're how you give Claude "do data analysis on this CSV" or "render this chart" capabilities without you writing the executor.
Code execution tool
tools = [{"type": "code_execution_20250522", "name": "code_execution"}]
resp = client.messages.create(
model="claude-sonnet-4-6", max_tokens=4096,
tools=tools,
extra_headers={"anthropic-beta": "code-execution-2025-05-22"},
messages=[{"role": "user", "content":
"Generate a histogram of these numbers: 1,1,2,3,3,3,4,4,5,5,5,5,7."
}],
)
# Claude writes Python that uses matplotlib, the sandbox runs it,
# returns the figure as a tool_result, Claude embeds it in the answer.
Files API — persisting inputs/outputs
file = client.beta.files.upload(
file=open("sales.csv", "rb"),
) # one upload, reusable
print(file.id) # "file_abc123"
resp = client.messages.create(
model="claude-sonnet-4-6", max_tokens=4096,
tools=[{"type": "code_execution_20250522", "name": "code_execution"}],
extra_headers={"anthropic-beta": "code-execution-2025-05-22,files-api-2025-04-14"},
messages=[{"role": "user", "content": [
{"type": "container_upload", "file_id": file.id}, # mount file into sandbox
{"type": "text", "text": "Compute total sales by region. Write the result to result.csv."},
]}],
)
# Claude's code_execution writes /mnt/result.csv; you can fetch it via the Files API.
Sandbox guarantees
- Ephemeral container, no internet access from inside the sandbox.
- Standard data-science Python (numpy, pandas, matplotlib, scipy) preinstalled.
- ~1 GB RAM, 5-min wallclock per call.
- Files persist across tool calls in the same conversation; ephemeral after.
The sandbox is isolated from your machine and from Anthropic's network. You can safely have Claude execute untrusted code (e.g. user-submitted snippets) as long as you understand: output is what Claude says it is — trust only what you can verify. If you upload sensitive data via Files, treat the file ID as a secret — anyone with it can call Claude with that file.
How Claude Code Uses These Features
| Feature | Claude Code use |
|---|---|
| Extended thinking | Triggered by "think" / "ultrathink" keywords; budgets configurable per subagent. |
| Prompt caching | CLAUDE.md + tool definitions are cached automatically; you don't add cache_control yourself. |
| Image input | Drag-drop or paste image into the prompt → sent as image block. |
| PDF input | Same drag-drop. Useful for design docs, RFCs. |
| Citations | Not exposed in the CLI directly; available if you build a skill that calls the API with documents. |
| Code execution | Used internally by some skills; you can invoke via a custom skill or MCP server (CC9). |
Skills you write that call the API directly (CC5) and HTTP hooks that call out to Claude (CC7) are where you'll add caching, citations, and thinking budgets explicitly. The CLI handles the basics; for power features, drop down to API.
Hands-On Lab — Cache a Long System Prompt
You'll measure the actual cost difference of prompt caching with a real CLAUDE.md-sized system prompt. Three runs: cold, warm, and broken.
Step 1 — Build a 4K-token system prompt
import json
from anthropic import Anthropic
# Pad with realistic CLAUDE.md content. Aim for >1024 tokens (Sonnet minimum).
SYSTEM = ("# Project: PublicRecords\n\n" +
("- All amounts in cents.\n- Use Drizzle ORM.\n- Tests must pass.\n" * 200))
client = Anthropic()
def call(question: str):
return client.messages.create(
model="claude-sonnet-4-6", max_tokens=128,
system=[{
"type": "text", "text": SYSTEM,
"cache_control": {"type": "ephemeral"},
}],
messages=[{"role": "user", "content": question}],
)
Step 2 — Cold run (creates the cache)
r1 = call("How should money be represented?")
print(json.dumps(r1.usage.model_dump(), indent=2))
# Expect: cache_creation_input_tokens > 0, cache_read_input_tokens == 0
Step 3 — Warm runs (cache hits)
for q in ["Which ORM?", "Do tests need to pass before commit?",
"What's the unit for amounts?"]:
r = call(q)
print(q, "→", r.usage.model_dump())
# Expect: cache_creation_input_tokens == 0, cache_read_input_tokens > 0
Step 4 — Bust the cache (rookie mistake)
# Add even ONE extra space at the start of SYSTEM
SYSTEM_BROKEN = " " + SYSTEM
r = client.messages.create(
model="claude-sonnet-4-6", max_tokens=128,
system=[{"type": "text", "text": SYSTEM_BROKEN,
"cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": "Which ORM?"}],
)
print(r.usage.model_dump())
# Cache miss again; cache_creation_input_tokens > 0
Step 5 — Compute the savings
from anthropic.types import Usage
def cost(u: Usage, in_per_m=3.0, cache_in_per_m=3.75, cache_read_per_m=0.30,
out_per_m=15.0):
return (u.input_tokens * in_per_m
+ (u.cache_creation_input_tokens or 0) * cache_in_per_m
+ (u.cache_read_input_tokens or 0) * cache_read_per_m
+ u.output_tokens * out_per_m) / 1_000_000
# Compare cold first call vs a warm one. Expect ~10x savings on input portion.
Three sets of usage numbers proving: (1) cold call writes the cache at a 25% premium, (2) warm calls are ~10× cheaper on the cached portion, (3) one extra space busts the cache entirely. For any skill or hook that reuses a prompt prefix more than 5 times in 5 minutes, caching is free money — you just have to enable it.
Knowledge Check
1. You have a hook that runs on every file save and needs to make a fast block/allow decision. Should you enable extended thinking?
2. Your one-shot CLI tool runs Claude once per invocation, no follow-up. Should you cache the system prompt?
3. Your cached prefix sometimes gets cache hits and sometimes doesn't. Most likely cause?
4. You're building a "summarize this PDF" feature and want users to see source spans. Best approach?
document block with citations.enabled: true and read each text block's citations array.5. Your code-execution sandbox call returns "no module named requests." Why?
requests is in the bundle but you can't reach external URLs anyway. The error suggests the sandbox version doesn't include it — use stdlib urllib alternatives, but you still can't reach the internet.Module Summary
- Extended thinking: hidden reasoning trace before the answer. Budgets 2K–32K. Worth it on hard reasoning, not on hooks or quick lookups.
- Prompt caching: mark stable prefixes with
cache_control: {"type": "ephemeral"}. ~10× cheaper on hits, ~25% more expensive on first write. - Caching rules: max 4 breakpoints; prefix-exact matching; 5-min default TTL (1h available); minimum 1024 tokens (Sonnet/Opus) or 2048 (Haiku).
- Order: stable content first (system > tools > docs > few-shot), variable content last.
- Image & PDF: base64 in
image/documentblocks. PDFs up to 32 MB / 100 pages. - Citations: opt in with
citations.enabled: trueon document blocks. Per-sentence source spans, no parsing. - Code execution: sandboxed Python, offline, with the Files API for inputs/outputs. Use beta header.
- In Claude Code: caching and basic image/PDF are auto. Thinking via "think hard" keywords. Citations + code-exec require dropping to the API.