Skip to main content

Token Consumption Optimization

With usage-based billing, every token you send and receive costs money. This page covers practical strategies to reduce token consumption without sacrificing output quality โ€” ordered by impact so you can focus on what matters most.

Choose the Right Model for the Taskโ€‹

Impact: High ยท Effort: Low

The single biggest cost lever is not defaulting to the most powerful (and expensive) model for every interaction.

Task typeRecommended tierExamples
Quick questions, boilerplate, explanationsSmaller/cheaper modelsGPT-4.1-mini, Claude Haiku
Complex reasoning, large refactors, architectureFrontier modelsClaude Sonnet 4, GPT-4.1, o4-mini
tip

In VS Code Copilot, use the model picker in the chat panel to switch models per-conversation. Default to a cheaper model and escalate only when you need stronger reasoning.

For a detailed model selection breakdown, see the model comparison table in Prompt Engineering.

Set Thinking Effort Appropriatelyโ€‹

Impact: High ยท Effort: Low

Models with extended thinking (Claude, o-series) consume tokens on internal reasoning. You can control how much they think.

  • Full thinking: architecture decisions, complex debugging, multi-step refactors
  • Reduced thinking: code generation, formatting, boilerplate, simple explanations

Claude Codeโ€‹

# Set thinking budget via flag
claude --thinking-budget low

# Or use the /think command in-session with levels
/think low
/think medium
/think high

VS Code Copilotโ€‹

Some models expose a reasoning effort setting in VS Code. Check your model's configuration in settings under github.copilot.

note

Lower thinking effort doesn't mean worse output for routine tasks. The model still has its full training โ€” you're just limiting the "scratch paper" it uses before answering.

Use Different Models for Subagentsโ€‹

Impact: High ยท Effort: Medium

In multi-agent setups, not every agent needs frontier-level intelligence. Use an expensive orchestrator for planning and cheap workers for execution.

Pattern:

  • Orchestrator (strong reasoning model): decomposes the task, makes architectural decisions
  • Workers (cheaper model): execute file edits, formatting, simple transformations

A particularly effective application of this pattern is delegating file search and reading to small utility models like Claude Haiku, GPT-4.1-mini, or similar lightweight models. Frontier models consume expensive tokens every time they read or search through files โ€” offloading this to a cheap subagent with a simple prompt like "find and read the relevant files for X" can save significant cost. In Copilot custom agent configurations, you can use a custom instruction file to direct the main agent to delegate all file exploration through cheaper subagents (see the example instruction below). In Claude Code, you can achieve the same by routing file discovery to the Task tool with a lightweight model.

Example: Copilot instruction for subagent delegationโ€‹

The following .instructions.md file tells the main agent to delegate all codebase reading and searching to a subagent running a small model:

---
description: "Delegate codebase reading and searching to a subagent using a small model"
applyTo: "**"
---

# Subagent for Codebase Exploration

When reading files or searching the codebase (using tools like `read_file`, `grep_search`, `file_search`, `semantic_search`, `list_dir`), always delegate to a subagent with a small, fast model such as `Claude 3.5 Haiku (Copilot)` or `GPT-4o Mini (Copilot)`.

Use the `Explore` agent when available, or invoke `runSubagent` with a small model specified via the `model` parameter.

Claude Codeโ€‹

The Task tool supports a model parameter to override the model per-subagent:

Use the Task tool to delegate the file formatting subtask.
Set model to claude-haiku for this subtask โ€” it doesn't need deep reasoning.

Copilot Coding Agentโ€‹

Configure setup steps in your workflow to specify model preferences for different stages of work.

For more on the orchestrator + subagent pattern, see Multi-Agent Orchestration.

Manage Context Window Sizeโ€‹

Impact: Medium-High ยท Effort: Low

Every token in the context window costs money โ€” both the input you send and the output generated in response. Bloated context is the most common source of waste.

Do:

  • Reference specific files with #file or @workspace with targeted queries
  • Close irrelevant files/tabs before starting agent sessions
  • Use .gitignore-style patterns in tool configs to exclude irrelevant directories from indexing
  • Configure content exclusions (org-level Copilot setting) to prevent large or sensitive files from being sent

Don't:

  • Add entire directories to context when asking about one function
  • Leave 30 open tabs during an agent session โ€” some tools include open files as context
  • Include node_modules, build output, or generated files in indexed content
caution

Some tools silently include open editor tabs, recent files, or entire directory trees as context. Audit what your tool is actually sending โ€” the token count in your billing may surprise you.

Compact Long Conversationsโ€‹

Impact: Medium ยท Effort: Low

Each new turn in a conversation re-sends the entire conversation history as input tokens. A 50-turn conversation means turn 50 includes all previous turns as input.

Claude Codeโ€‹

Use /compact to summarize and compress conversation history:

/compact

This replaces the full history with a condensed summary, dramatically reducing input tokens for subsequent turns.

VS Code Copilotโ€‹

Start a new chat thread instead of extending a long conversation. There's no built-in compaction โ€” fresh threads are your tool.

tip

Rule of thumb: if your conversation exceeds ~20 turns, compact or start fresh. The accumulated context from early turns is usually no longer relevant and is burning tokens every turn.

Use Context Files to Reduce Repetitive Promptingโ€‹

Impact: Medium ยท Effort: Medium (one-time setup)

Instead of re-explaining your architecture, conventions, or feature requirements in every prompt, document them in context files that the agent reads once at session start.

Agent Context Filesโ€‹

Feature work may have a curated context folder at .agents/contexts/<feature-name>/. If one exists for the feature you're working on:

  1. Read all files in the folder before writing any code. Start with README.md if present. Distinguish source-of-truth files (requirements, architecture) from current-state files (assessments, known issues).
  2. Ground your work in the documented decisions โ€” don't contradict them without flagging it.
  3. Keep context files in sync โ€” when your work changes scope, resolves a known issue, or introduces a new architectural decision, update the relevant context files proactively.
  4. Never delete or overwrite context files without explicit instruction.

If the user mentions a feature but the context path is unclear, list the available subdirectories in .agents/contexts/ and ask which context applies.

Why This Saves Tokensโ€‹

  • Context files are read once at session start โ€” no re-pasting every turn
  • The agent produces better output with proper context, reducing regeneration
  • Shared context files mean the same information works across team members and sessions

Write Better Promptsโ€‹

Impact: Medium ยท Effort: Low

Concise, specific prompts use fewer input tokens and produce more focused (shorter) output.

  • State constraints upfront โ€” "return only the function, no explanation" avoids a 200-token explanation you'll ignore
  • Reference specific line ranges or functions instead of pasting entire files
  • Be explicit about format: "respond with a code block only" vs. leaving it ambiguous
  • Vague prompts produce longer, less useful outputs that you'll regenerate โ€” costing 2x or more

See Prompt Engineering for detailed techniques.

Avoid Regeneration Loopsโ€‹

Impact: Medium ยท Effort: Low

Each regeneration is a full new request โ€” all context re-sent plus a new generation. Three regenerations cost 3x one well-crafted prompt.

Instead of hitting regenerate:

  1. Identify what's wrong with the output
  2. Edit your prompt to add the missing constraint
  3. Submit the refined prompt

This produces better results and costs less than hoping the next random sample will be correct.

note

If you find yourself regenerating more than once, the problem is almost always the prompt โ€” not bad luck. Add specificity rather than retrying.

Analyze Your Usageโ€‹

Impact: Medium ยท Effort: Low

The VS Code Chronicle extension tracks session data locally and can surface optimization opportunities.

/chronicle:cost-tips

This analyzes your usage patterns and reports:

  • Model overuse (using frontier models for simple tasks)
  • Context bloat (sessions with unnecessarily large context)
  • Retry patterns (repeated regenerations)
  • Outlier sessions (unusually expensive interactions)

Setupโ€‹

.vscode/settings.json
{
"github.copilot.chat.localIndex.enabled": true
}

Let it collect data for 5โ€“7 days before running /chronicle:cost-tips for useful recommendations.

For more details, see the Chronicle cost tips blog post.

caution

Chronicle only covers VS Code Copilot sessions. If you also use Claude Code or other tools, you'll need to track those separately.

Impact: Low-Medium ยท Effort: Low

Each conversational turn has overhead: system prompt, conversation history, and tool context are all re-sent. Combine related small requests into one prompt.

Expensive (3 turns):

Turn 1: "Fix the type error on line 42 of auth.ts"
Turn 2: "Fix the type error on line 87 of auth.ts"
Turn 3: "Fix the type error on line 103 of auth.ts"

Cheaper (1 turn):

"Fix the type errors on lines 42, 87, and 103 of auth.ts"
tip

Balance batching with clarity. If a prompt gets so complex that the model struggles, you'll end up regenerating โ€” which defeats the purpose. Group related items; don't create mega-prompts.

Leverage Cachingโ€‹

Impact: Low (mostly automatic) ยท Effort: Low

Provider-side caching reduces costs for repeated context โ€” but it's largely automatic.

What providers do:

  • Anthropic's API caches repeated system prompts and prefix content
  • GitHub Copilot handles caching server-side โ€” you benefit automatically

What you can do to help:

  • Keep instruction files (.github/copilot-instructions.md, CLAUDE.md) stable โ€” don't edit them every session
  • Use consistent system prompts across sessions so cache hits are more likely
  • Avoid unnecessarily reordering context between turns

Quick Referenceโ€‹

StrategyImpactEffortApplies to
Choose the right modelHighLowAll tools
Set thinking effortHighLowClaude Code, o-series models
Different models for subagentsHighMediumClaude Code, multi-agent setups
Manage context window sizeMedium-HighLowAll tools
Compact long conversationsMediumLowClaude Code, VS Code Copilot
Context files for reuseMediumMediumAll tools
Write better promptsMediumLowAll tools
Avoid regeneration loopsMediumLowAll tools
Analyze usage with ChronicleMediumLowVS Code Copilot
Batch related workLow-MediumLowAll tools
Leverage cachingLowLowAnthropic API, Copilot

Resourcesโ€‹