Token Consumption Optimization
With usage-based billing, every token you send and receive costs money. This page covers practical strategies to reduce token consumption without sacrificing output quality โ ordered by impact so you can focus on what matters most.
Choose the Right Model for the Taskโ
Impact: High ยท Effort: Low
The single biggest cost lever is not defaulting to the most powerful (and expensive) model for every interaction.
| Task type | Recommended tier | Examples |
|---|---|---|
| Quick questions, boilerplate, explanations | Smaller/cheaper models | GPT-4.1-mini, Claude Haiku |
| Complex reasoning, large refactors, architecture | Frontier models | Claude Sonnet 4, GPT-4.1, o4-mini |
In VS Code Copilot, use the model picker in the chat panel to switch models per-conversation. Default to a cheaper model and escalate only when you need stronger reasoning.
For a detailed model selection breakdown, see the model comparison table in Prompt Engineering.
Set Thinking Effort Appropriatelyโ
Impact: High ยท Effort: Low
Models with extended thinking (Claude, o-series) consume tokens on internal reasoning. You can control how much they think.
- Full thinking: architecture decisions, complex debugging, multi-step refactors
- Reduced thinking: code generation, formatting, boilerplate, simple explanations
Claude Codeโ
# Set thinking budget via flag
claude --thinking-budget low
# Or use the /think command in-session with levels
/think low
/think medium
/think high
VS Code Copilotโ
Some models expose a reasoning effort setting in VS Code. Check your model's configuration in settings under github.copilot.
Lower thinking effort doesn't mean worse output for routine tasks. The model still has its full training โ you're just limiting the "scratch paper" it uses before answering.
Use Different Models for Subagentsโ
Impact: High ยท Effort: Medium
In multi-agent setups, not every agent needs frontier-level intelligence. Use an expensive orchestrator for planning and cheap workers for execution.
Pattern:
- Orchestrator (strong reasoning model): decomposes the task, makes architectural decisions
- Workers (cheaper model): execute file edits, formatting, simple transformations
A particularly effective application of this pattern is delegating file search and reading to small utility models like Claude Haiku, GPT-4.1-mini, or similar lightweight models. Frontier models consume expensive tokens every time they read or search through files โ offloading this to a cheap subagent with a simple prompt like "find and read the relevant files for X" can save significant cost. In Copilot custom agent configurations, you can use a custom instruction file to direct the main agent to delegate all file exploration through cheaper subagents (see the example instruction below). In Claude Code, you can achieve the same by routing file discovery to the Task tool with a lightweight model.
Example: Copilot instruction for subagent delegationโ
The following .instructions.md file tells the main agent to delegate all codebase reading and searching to a subagent running a small model:
---
description: "Delegate codebase reading and searching to a subagent using a small model"
applyTo: "**"
---
# Subagent for Codebase Exploration
When reading files or searching the codebase (using tools like `read_file`, `grep_search`, `file_search`, `semantic_search`, `list_dir`), always delegate to a subagent with a small, fast model such as `Claude 3.5 Haiku (Copilot)` or `GPT-4o Mini (Copilot)`.
Use the `Explore` agent when available, or invoke `runSubagent` with a small model specified via the `model` parameter.
Claude Codeโ
The Task tool supports a model parameter to override the model per-subagent:
Use the Task tool to delegate the file formatting subtask.
Set model to claude-haiku for this subtask โ it doesn't need deep reasoning.
Copilot Coding Agentโ
Configure setup steps in your workflow to specify model preferences for different stages of work.
For more on the orchestrator + subagent pattern, see Multi-Agent Orchestration.
Manage Context Window Sizeโ
Impact: Medium-High ยท Effort: Low
Every token in the context window costs money โ both the input you send and the output generated in response. Bloated context is the most common source of waste.
Do:
- Reference specific files with
#fileor@workspacewith targeted queries - Close irrelevant files/tabs before starting agent sessions
- Use
.gitignore-style patterns in tool configs to exclude irrelevant directories from indexing - Configure content exclusions (org-level Copilot setting) to prevent large or sensitive files from being sent
Don't:
- Add entire directories to context when asking about one function
- Leave 30 open tabs during an agent session โ some tools include open files as context
- Include
node_modules, build output, or generated files in indexed content
Some tools silently include open editor tabs, recent files, or entire directory trees as context. Audit what your tool is actually sending โ the token count in your billing may surprise you.
Compact Long Conversationsโ
Impact: Medium ยท Effort: Low
Each new turn in a conversation re-sends the entire conversation history as input tokens. A 50-turn conversation means turn 50 includes all previous turns as input.
Claude Codeโ
Use /compact to summarize and compress conversation history:
/compact
This replaces the full history with a condensed summary, dramatically reducing input tokens for subsequent turns.
VS Code Copilotโ
Start a new chat thread instead of extending a long conversation. There's no built-in compaction โ fresh threads are your tool.
Rule of thumb: if your conversation exceeds ~20 turns, compact or start fresh. The accumulated context from early turns is usually no longer relevant and is burning tokens every turn.
Use Context Files to Reduce Repetitive Promptingโ
Impact: Medium ยท Effort: Medium (one-time setup)
Instead of re-explaining your architecture, conventions, or feature requirements in every prompt, document them in context files that the agent reads once at session start.
Agent Context Filesโ
Feature work may have a curated context folder at .agents/contexts/<feature-name>/. If one exists for the feature you're working on:
- Read all files in the folder before writing any code. Start with
README.mdif present. Distinguish source-of-truth files (requirements, architecture) from current-state files (assessments, known issues). - Ground your work in the documented decisions โ don't contradict them without flagging it.
- Keep context files in sync โ when your work changes scope, resolves a known issue, or introduces a new architectural decision, update the relevant context files proactively.
- Never delete or overwrite context files without explicit instruction.
If the user mentions a feature but the context path is unclear, list the available subdirectories in .agents/contexts/ and ask which context applies.
Why This Saves Tokensโ
- Context files are read once at session start โ no re-pasting every turn
- The agent produces better output with proper context, reducing regeneration
- Shared context files mean the same information works across team members and sessions
Write Better Promptsโ
Impact: Medium ยท Effort: Low
Concise, specific prompts use fewer input tokens and produce more focused (shorter) output.
- State constraints upfront โ "return only the function, no explanation" avoids a 200-token explanation you'll ignore
- Reference specific line ranges or functions instead of pasting entire files
- Be explicit about format: "respond with a code block only" vs. leaving it ambiguous
- Vague prompts produce longer, less useful outputs that you'll regenerate โ costing 2x or more
See Prompt Engineering for detailed techniques.
Avoid Regeneration Loopsโ
Impact: Medium ยท Effort: Low
Each regeneration is a full new request โ all context re-sent plus a new generation. Three regenerations cost 3x one well-crafted prompt.
Instead of hitting regenerate:
- Identify what's wrong with the output
- Edit your prompt to add the missing constraint
- Submit the refined prompt
This produces better results and costs less than hoping the next random sample will be correct.
If you find yourself regenerating more than once, the problem is almost always the prompt โ not bad luck. Add specificity rather than retrying.
Analyze Your Usageโ
Impact: Medium ยท Effort: Low
The VS Code Chronicle extension tracks session data locally and can surface optimization opportunities.
/chronicle:cost-tips
This analyzes your usage patterns and reports:
- Model overuse (using frontier models for simple tasks)
- Context bloat (sessions with unnecessarily large context)
- Retry patterns (repeated regenerations)
- Outlier sessions (unusually expensive interactions)
Setupโ
{
"github.copilot.chat.localIndex.enabled": true
}
Let it collect data for 5โ7 days before running /chronicle:cost-tips for useful recommendations.
For more details, see the Chronicle cost tips blog post.
Chronicle only covers VS Code Copilot sessions. If you also use Claude Code or other tools, you'll need to track those separately.
Batch Related Workโ
Impact: Low-Medium ยท Effort: Low
Each conversational turn has overhead: system prompt, conversation history, and tool context are all re-sent. Combine related small requests into one prompt.
Expensive (3 turns):
Turn 1: "Fix the type error on line 42 of auth.ts"
Turn 2: "Fix the type error on line 87 of auth.ts"
Turn 3: "Fix the type error on line 103 of auth.ts"
Cheaper (1 turn):
"Fix the type errors on lines 42, 87, and 103 of auth.ts"
Balance batching with clarity. If a prompt gets so complex that the model struggles, you'll end up regenerating โ which defeats the purpose. Group related items; don't create mega-prompts.
Leverage Cachingโ
Impact: Low (mostly automatic) ยท Effort: Low
Provider-side caching reduces costs for repeated context โ but it's largely automatic.
What providers do:
- Anthropic's API caches repeated system prompts and prefix content
- GitHub Copilot handles caching server-side โ you benefit automatically
What you can do to help:
- Keep instruction files (
.github/copilot-instructions.md,CLAUDE.md) stable โ don't edit them every session - Use consistent system prompts across sessions so cache hits are more likely
- Avoid unnecessarily reordering context between turns
Quick Referenceโ
| Strategy | Impact | Effort | Applies to |
|---|---|---|---|
| Choose the right model | High | Low | All tools |
| Set thinking effort | High | Low | Claude Code, o-series models |
| Different models for subagents | High | Medium | Claude Code, multi-agent setups |
| Manage context window size | Medium-High | Low | All tools |
| Compact long conversations | Medium | Low | Claude Code, VS Code Copilot |
| Context files for reuse | Medium | Medium | All tools |
| Write better prompts | Medium | Low | All tools |
| Avoid regeneration loops | Medium | Low | All tools |
| Analyze usage with Chronicle | Medium | Low | VS Code Copilot |
| Batch related work | Low-Medium | Low | All tools |
| Leverage caching | Low | Low | Anthropic API, Copilot |
Resourcesโ
- Chronicle Cost Tips Blog Post โ detailed walkthrough of the
/chronicle:cost-tipscommand - Prompt Engineering โ techniques that also reduce token waste
- Multi-Agent Orchestration โ the orchestrator + subagent pattern
- AI Coding Agents โ how agents consume context and tokens
- GitHub Copilot Billing Documentation
- Anthropic Token Counting