I spent the last year building AI automation workflows with Claude. Voice agents. n8n pipelines. Client chatbots. Somewhere around month six, my API bills started making no sense. I was burning through tokens on stuff that should have been cheap. Conversations were getting sluggish. Claude would forget constraints I had clearly stated three messages ago.
The problem was not Claude. The problem was how I was using it.
Most people treat Claude like a chat window. Type a question, get an answer, move on. But in 2026, Claude is an engineering tool. Claude Code alone represents roughly $2.5 billion of Anthropic's revenue, and according to the 2025 Stack Overflow Developer Survey, about 70% of developers now prefer Claude for coding tasks specifically. That kind of adoption does not happen by accident.
The gap between casual users and people who actually get results comes down to one thing: treating prompt engineering and token management as an empirical discipline. You test. You measure. You cut what does not work.
Here are 17 techniques I use daily, sorted by how much they actually move the needle.Tier 1 · Highest leverage
The Highest Leverage Moves
1. Structure Everything Around Prompt Caching
This is the single biggest cost optimization available on the Claude API right now, and most people either skip it or implement it wrong.
The concept is simple. Place your static content (system instructions, tool definitions, reference documents) at the very beginning of your prompt and keep that prefix identical across requests. Anthropic caches that prefix server side. The first request pays full price. Every subsequent request that hits the cache pays about 10% of the normal input price.
The real numbers from production, pulled from published case studies: ProjectDiscovery saw their cache hit rates climb from 7% to 74% in a single deployment after they moved dynamic content out of the cacheable prefix. A blog automation pipeline running about 30 to 50 API calls daily on Claude Sonnet 4.6 dropped monthly costs from $40 to $60 down to $15 to $20 after implementing proper caching with batched operations.
But here is the part most guides leave out. Anthropic reduced the default cache TTL from 60 minutes down to 5 minutes in early 2026. That single change increased effective API costs by 30% to 60% for teams that were not paying attention. A background worker that used to write cache once and read it 20 times in an hour now needs to hit the cache at least twice within five minutes just to break even.
The fix: batch your API calls into tight loops that complete within the 5 minute window. If you are running translation pipelines, document processing, or multi step agent chains, structure them so all related calls fire in sequence rather than spread across the hour.
Common mistakes that silently kill your cache hit rate:
- Putting timestamps like
"Current time: 2026-05-13T14:32:15Z"in your system prompt. This invalidates the cache on every single request. Move timestamps to the user message. - Including user specific content like names or company details in the cached prefix. Every unique user gets a cache miss. Move personalization to the user message block.
- Inconsistent whitespace in your prompt builder. Normalize aggressively or you will see 80% cache misses for no technical reason at all.
2. Keep CLAUDE.md Under 200 Lines (Ideally Under 500 Tokens)
Your CLAUDE.md file loads before Claude reads your code, before it reads your task, before anything else. It persists in the context window for the entire session and is never lazy loaded or evicted. A 5,000 token CLAUDE.md costs 5,000 tokens on every single turn, whether you send 2 messages or 200.
Boris Cherny, the creator of Claude Code, keeps his own CLAUDE.md at roughly 2,500 tokens (about 100 lines). His team ships Claude Code itself with that setup. If the person who built the tool keeps it that lean, you probably should too.
The rule of thumb from Anthropic's own documentation: for each line, ask "Would removing this cause Claude to make mistakes?" If not, cut it.
What belongs in CLAUDE.md: package manager, test command, typecheck command, main source directories, and a handful of non negotiable rules. Five rules and three file pointers is the right size.
What does not belong: meeting notes, design history, implementation guides, domain knowledge that is only sometimes relevant. Move specialized workflows into skills (.claude/skills/) that load on demand. Move module specific rules into path scoped rule files (.claude/rules/) so they only activate when Claude edits files in that directory.
One team at Marmelab reported that when their CLAUDE.md grew too long, Claude would straight up ignore rules marked with MUST in all caps. Shortening the file fixed the adherence problem immediately.
3. Switch Models and Adjust Effort by Task Complexity
Not every task requires Claude Opus. Opus 4.6 costs $5 per million input tokens and $25 per million output tokens. Sonnet 4.6 costs $3/$15. Haiku 4.5 costs $1/$5. Output tokens are 5x input across all models.
The practical split that works: use Haiku for brainstorming, quick exploration, and simple tasks. Use Sonnet for heavy implementation (this should be your default). Reserve Opus for complex architectural reasoning and tasks where nuance actually matters.
One developer documented saving roughly 67% of tokens by running brainstorming sessions on Haiku instead of Opus. The quality difference for exploratory work was negligible.
Inside Claude Code, use the /model command to switch models mid session. Use /effort to dial down the thinking budget for straightforward tasks. Low effort on a variable rename saves real money over a full day of coding.
Also watch out for Opus 4.7's new tokenizer. It can produce up to 35% more tokens for the same input text compared to previous versions. Your per token rate stays the same, but your effective bill per request goes up if you are not tracking actual token counts.
4. Use Plan Mode Before Expensive Operations
This one is free and it prevents the single biggest source of wasted tokens: trial and error execution.
Press Shift+Tab in Claude Code to toggle Plan Mode. Claude outputs a step by step plan without making any actual file changes. You review the plan, edit it (Ctrl+G opens it in your editor for direct modification), cut anything unnecessary, then switch back to normal mode and let Claude execute.
Without Plan Mode, Claude tries an approach, hits an error, backtracks, tries something else. Every iteration costs tokens. With Plan Mode, you catch wrong assumptions before implementation begins. Correcting a plan costs almost nothing. Unwinding a half finished feature costs everything.
A team at Marmelab made this their first rule: "Use plan mode for anything complex. Before Claude writes a single line of code, let it lay out its approach."Tier 2 · Codebase management
Real World Codebase Management
5. Write Characterization Tests Before Refactoring Legacy Code
Claude does not know what your legacy code is supposed to do. It only sees what it currently does. Before you refactor anything, have Claude write minimal characterization tests that capture the existing outputs. Run these tests continuously as you incrementally refactor to catch regressions the moment they happen.
This is not theoretical. A game developer using Claude Code on a Godot project (a 2D space shooter called "Mobsters") described running two parallel Claude sessions: one refactoring existing code while the other wrote and ran tests against the current behavior. The tests acted as guardrails that prevented silent breakage across sessions.
The prompt is simple: "Write characterization tests for [module]. Capture the current input/output behavior exactly as it is, even if the behavior seems wrong. We will fix it after we have the safety net."
6. Isolate Instructions with Path Scoped Rules
If your CLAUDE.md keeps growing because different parts of your codebase need different instructions, stop trying to cram everything into one file. Create rule files in .claude/rules/ with YAML frontmatter specifying which file paths they apply to.
Example: a rule file at .claude/rules/api-validation.md with frontmatter paths: ["src/api/**/*.ts"] will only load when Claude edits files in your API directory. Your database migration rules load only when Claude touches migration files. Your frontend component rules load only for component files.
This keeps your context window clean. Claude only carries the rules it actually needs for the current task. According to Anthropic's documentation, CLAUDE.md files in subdirectories follow a cascade where the most specific scope wins on conflicts.
7. Master Git Worktrees for Parallel Sessions
Sequential work is a bottleneck. If you are waiting for one Claude Code task to finish before starting the next, you are leaving velocity on the table.
Git worktrees let you create separate working directories that share the same repository history but operate on independent branches. Claude Code has first class support for this with the --worktree flag (shorthand: -w).
The setup in practice: open Terminal 1 and run claude -w feature-payments. Open Terminal 2 and run claude -w bugfix-auth. You now have two fully isolated Claude sessions running in parallel, each on their own branch, each with their own files on disk. Zero interference.
A fintech engineer profiled in a Code With Seb article runs six parallel Claude Code sessions simultaneously: one writing tests, one refactoring a service, one drafting a migration, one reviewing a PR, and two in plan mode for upcoming work. He reports shipping at roughly 3x throughput.
The practical ceiling for most developers is 2 to 4 parallel sessions before review overhead becomes its own problem. The key rule to prevent merge conflicts: scope worktrees by module, not by task. All billing related tasks go to the billing worktree. All auth tasks go to the auth worktree. Two Claude sessions never edit the same file because the filesystem boundary prevents it.
Important gotcha: a worktree is a fresh checkout, so untracked files like .env are not present by default. Create a .worktreeinclude file in your project root to automatically copy env files and secrets configs into each new worktree.
8. Deploy Specialized Sub Agents
A single agent asked to write code, review for security, and check layout will often provide a shallow analysis on all three. The better pattern: spawn an adversarial "Critic" subagent with read only permissions to audit the work of a "Fixer" agent.
Because the read only Critic cannot write code, it has no incentive to gloss over mistakes. It exists only to find problems.
Beyond the Critic/Fixer pattern, Claude Code subagents solve a fundamental context management problem. When Claude researches your codebase, it reads lots of files, and all of them consume your context window. Subagents run in separate context windows and report back summaries. The main session stays clean for implementation.
The prompt: "Use subagents to investigate how our authentication system handles token refresh, and whether we have any existing OAuth utilities I should reuse."
Define permanent subagents as markdown files at .claude/agents/investigator.md for tasks you run frequently. One team built a "self driving documentation" system: a Claude Code subagent combined with Playwright that automatically explores their software, identifies knowledge gaps in documentation, and creates changes by itself.
9. Reference Exact Line Numbers Instead of Exploring
Every vague instruction like "look around the repo" triggers expensive, token heavy exploration. Claude opens files, reads dead ends, and reconstructs context you could have handed it directly.
Instead of "find the authentication bug," write: "Compare src/auth/session.ts lines 30 to 90 with src/api/login.ts lines 10 to 60 and explain the mismatch."
Instead of asking Claude to rewrite an entire 300 line file for a 15 line change, explicitly request diff style output. You save tokens on both the input (Claude reads less) and the output (Claude writes less).
Use @ to reference files directly in your prompts. By typing @path/to/file.ts, Claude loads the referenced file directly into context without having to search for it first.
10. Strip Files and Exclude Irrelevance
Before including files in your context, remove dead code, unused imports, and non logic comment blocks. Create a .claudeignore file to permanently block Claude from reading token heavy, irrelevant files.
Common candidates for .claudeignore: package-lock.json, yarn.lock, node_modules, build output directories, massive test fixtures, and generated code files. These files can be thousands of tokens each and Claude gains nothing from reading them.
Filter log outputs before Claude sees them. Do not feed raw logs into the chat. Pipe through grep to extract only the error lines:
pnpm test 2>&1 | grep -A 5 -E "FAIL|ERROR|Error|failed" | head -120
This cuts log related token usage dramatically.
Tier 3 · Memory & continuityWorkflow and Memory Continuity
11. Run Proactive Token Audits with /context
If your token usage feels mysteriously high, run /context before optimizing anything else. This command reveals "quiet offenders": a massive log dump from 20 turns ago, a huge file Claude read early in the session, or MCP tool overhead you did not realize was accumulating.
You can also set up a live status line in your terminal that shows real time context percentage and model costs. Add this to your Claude settings:
{
"statusLine": {
"type": "command",
"command": "jq -r '\"[\\(.model.display_name)] \\(.context_window.used_percentage // 0)% context\"'"
}
}
This prevents surprise token spikes by making context consumption visible at all times.
12. Compact Proactively with Memory Anchors
Do not wait for Claude to auto compact at 95% capacity. By then, the model is already losing granular details and the resulting summary is messier than it needs to be.
Run /compact proactively after completing each sub task while the session is still "healthy." The summary will be cleaner and more useful.
Before compacting, explicitly tell Claude what to preserve: "Before we compact, note that we decided to use optimistic locking for the booking system, and the three files we still need to modify are X, Y, and Z."
You can also customize compaction behavior directly in your CLAUDE.md with instructions like "When compacting, always preserve the full list of modified files and any test commands."
13. Create Session Handoff Prompts
When a session gets too long, ask Claude to write a "session handoff note" summarizing what was built, key decisions, and next steps. Save this document and feed it into a brand new session to transfer context at a fraction of the token cost.
This is especially powerful combined with the /btw command for quick questions. The answer appears in a dismissible overlay and never enters conversation history, so you can check details without growing context.
For teams, one approach that is gaining traction: ask Claude at the end of each session to generate a structured handoff that includes files modified, decisions made, open questions, and the exact next step. Paste this into the new session's first message. The new session picks up exactly where the old one left off.
14. Build Stateful, Interactive Artifacts
Claude Artifacts now support persistent storage with a key value API (up to 5MB per key) and can connect to external services through Model Context Protocol (MCP) servers, including live integrations with tools like Google Calendar, Gmail, and Slack.
This means you can build fully interactive micro apps inside Claude: task trackers that remember your filters across sessions, analytical dashboards that pull live data from your connected services, or project management widgets that persist state between conversations.
The storage API is straightforward: use window.storage.get(), .set(), .list(), and .delete() with hierarchical keys (like "todos:task_1" or "metrics:2026-05"). Personal data is private by default. Shared data (visible to all users of an artifact) requires explicitly passing shared: true.
One practical use: a financial tracker that logs expenses via a Telegram bot, stores them in Google Sheets through an n8n workflow, and surfaces the data in a Claude Artifact dashboard with filters that persist between sessions.
Tier 4 · Output & prompt tuningOutput Formatting and Prompt Tuning
15. Scaffold Prompts with XML Tags
Anthropic's models are specifically optimized to parse XML tags. Wrapping distinct parts of your prompt in tags like <task>, <context>, and <examples> provides unambiguous boundaries that maximize information retrieval accuracy.
This is not just theoretical advice. Anthropic's own system prompts use this structure extensively. When you look at how Claude Code's internal prompts are constructed, they are heavily tagged with XML to separate instructions, context, rules, and examples into distinct addressable blocks.
The practical benefit: when Claude encounters a tagged prompt, it knows exactly where to look for each piece of information. Ambiguity drops. Hallucination rates drop. Token efficiency goes up because Claude does not need to "guess" which part of the prompt applies to the current subtask.
16. Pre fill Claude's Responses
When calling the Claude API, you can pre fill the beginning of Claude's response in the Assistant role. If you need JSON output, start the assistant message with {. If you need a specific XML schema, start with the opening tag like <analysis>.
This technique instantly bypasses Claude's conversational preamble ("Sure! I'd be happy to help you with...") and forces it directly into the structure you need. You save tokens on the output side and get more predictable, parseable results.
This is API only. You cannot do this in the chat interface. But if you are building production systems, it is one of the simplest ways to guarantee format compliance.
17. Set Hard Constraints Instead of Vague Instructions
"Be concise" is vague and Claude will interpret it differently every time. "Answer in 3 sentences max" is a constraint Claude will actually follow.
Instead of "keep it short," specify "respond in under 100 words." Instead of "give me a summary," specify "summarize in exactly 5 key points, each under 20 words."
You can also set this at the system prompt level with what some developers call "Absolute Mode": instruct Claude to remove all filler, all transitional phrases, all conversational warmup, and end the response exactly at the requested material. The token savings compound over long sessions.
The Bottom Line
The gap between a $300 monthly Claude bill and a $50 one is not about using a cheaper model. It is about architecture. Prompt caching alone, properly implemented with batched operations and clean prefix hygiene, can cut input costs by 60% to 70%. Add model routing (Haiku for exploration, Sonnet for implementation, Opus for architecture), proactive compaction, and lean CLAUDE.md files, and a well optimized production app on Sonnet 4.6 typically spends $30 to $100 per month for moderate traffic.
Every technique in this list has been verified against real production workloads or official Anthropic documentation as of May 2026. The models will keep improving. The fundamentals of context management and cost optimization will not change.
Stop treating Claude like a chatbot. Start treating it like infrastructure.