// anthropic engineering · feb 2026

Lessons from Building
Claude Code

Prompt Caching Is Everything

Thariq Shafi · Claude Code Team @ Anthropic

01 / 12

01 — the problem

"Cache Rules Everything Around Me"

Long-running agentic products need caching to be feasible
Reuses computation from previous roundtrips
Dramatically decreases latency AND cost
Claude Code is built around prompt caching from day one

We run alerts on cache hit rate and declare SEVs if they're too low

02 / 12

02 — mechanics

Prefix Matching

API caches from start → cache_control breakpoint
Generates cryptographic hash of prefix content
ONE character difference kills the match
Everything past the break → reprocessed at full price

                # The golden rule

                cached_prefix = hash(request[0 : breakpoint])

                if cached_prefix == previous_hash:

                    reuse_computation()  # 10x cheaper

                else:

                    full_reprocess()     # full price 💸

03 / 12

03 — architecture

Order Is Everything

Static first, dynamic last — maximize shared prefix

1. Static System Prompt & Tools globally cached

2. CLAUDE.MD per project

3. Session Context per session

                    4. Conversation Messages
                    variable — never cached
                

"Surprisingly fragile" — a timestamp in the system prompt broke caching across the entire product

04 / 12

04 — lesson

System Messages,
Not System Prompt

✗ Don't

Update system prompt with new info (time, file changes)

→ Invalidates entire cache prefix

✓ Do

Send system message in next conversation turn

→ Cache prefix survives intact

                # Instead of editing system prompt:

                messages.append({

                  "role": "system",

                  "content": "It is now Wednesday"

                })

05 / 12

05 — lesson

Don't Switch Models
Mid-Session

Prompt caches are unique to each model
100K tokens cached on Opus → switching to Haiku = full rebuild
Haiku would actually cost MORE than keeping Opus

Solution: Use subagents with focused handoff messages — Claude Code's Explore agents use Haiku this way

                # Opus prepares handoff for Haiku subagent

                opus.handoff("Summarize this function",

                  model="haiku",

                  context="focused_subset")

06 / 12

06 — design pattern

Plan Mode — A Tool,
Not a Config Switch

✗ Obvious approach

Swap to read-only toolset

→ Cache break! 💥

✓ Claude Code's approach

EnterPlanMode / ExitPlanMode are callable tools

→ Tool definitions never change

System message tells model to restrict to read operations
Model can autonomously enter plan mode on hard problems

07 / 12

07 — design pattern

Tool Search —
Defer, Don't Remove

MCP can load dozens of tools — expensive in prefix
Removing mid-conversation = cache break
Solution: lightweight stubs with defer_loading: true
Same stubs, same order, every request

                # Stub in prefix (always present, never changes)

                {"name": "mcp_database_query",

                 "defer_loading": true}

                # Full schema loaded on-demand via ToolSearch

                ToolSearch("database_query") → full schema

08 / 12

08 — design pattern

Compaction —
Cache-Safe Forking

✗ Naive implementation

Different system prompt, no tools → pays full price for entire history

✓ Cache-safe

Same prefix (prompt + tools + history) → append compaction as final message

Request looks identical to parent's last request
Cache prefix fully reused — only new = compaction prompt
Now built into the API so you don't have to learn this

09 / 12

09 — economics

The Cost Math

Claude Sonnet — per million tokens

Uncached input $3.00

Cache write (1.25×) $3.75

Cached read (0.1×) $0.30

Without cache

$0.60/turn × 50 turns = $30.00

With cache

$0.06/turn × 50 turns = $3.60

10× cost reduction — this is why cache hit rate determines whether the product can exist at its price

10 / 12

10 — independent validation

Manus Confirms
Independently

"KV-cache hit rate is the single most important metric for a production-stage AI agent" — Peak Ji, Manus founder

100:1 input-to-output ratio → 99% cost in input processing
Rebuilt agent framework 4 times before getting caching right
Different approach: token logit masking + filesystem as context
Same rules: stable prefixes, append-only, no tool changes

Leave errors in context — they carry signal that prevents repeating mistakes

11 / 12

// conclusion

Key Takeaways

                # Design your agent around these rules

                1. Design your entire system around prefix matching

                2. Static first, dynamic last — order is everything

                3. System messages > system prompt changes

                4. Never change tools or models mid-session

                5. Monitor cache hit rate like you monitor uptime

                6. Fork operations must share the parent's prefix

"Claude Code is built around prompt caching from day one — you should do the same."

12 / 12

Lessons from BuildingClaude Code