Token Consumption Optimization

With usage-based billing, every token you send and receive costs money. This page covers practical strategies to reduce token consumption without sacrificing output quality — ordered by impact so you can focus on what matters most.

Choose the Right Model for the Task

Impact: High · Effort: Low

The single biggest cost lever is not defaulting to the most powerful (and expensive) model for every interaction.

Task type	Recommended tier	Examples
Quick questions, boilerplate, explanations	Smaller/cheaper models	GPT-4.1-mini, Claude Haiku
Complex reasoning, large refactors, architecture	Frontier models	Claude Sonnet 4, GPT-4.1, o4-mini

tip

In VS Code Copilot, use the model picker in the chat panel to switch models per-conversation. Default to a cheaper model and escalate only when you need stronger reasoning.

For a detailed model selection breakdown, see the model comparison table in Prompt Engineering.

Set Thinking Effort Appropriately

Impact: High · Effort: Low

Models with extended thinking (Claude, o-series) consume tokens on internal reasoning. You can control how much they think.

Full thinking: architecture decisions, complex debugging, multi-step refactors
Reduced thinking: code generation, formatting, boilerplate, simple explanations

Claude Code

# Set thinking budget via flag
claude --thinking-budget low

# Or use the /think command in-session with levels
/think low
/think medium
/think high

VS Code Copilot

Some models expose a reasoning effort setting in VS Code. Check your model's configuration in settings under github.copilot.

note

Lower thinking effort doesn't mean worse output for routine tasks. The model still has its full training — you're just limiting the "scratch paper" it uses before answering.

Use Different Models for Subagents

Impact: High · Effort: Medium

In multi-agent setups, not every agent needs frontier-level intelligence. Use an expensive orchestrator for planning and cheap workers for execution.

Pattern:

Orchestrator (strong reasoning model): decomposes the task, makes architectural decisions
Workers (cheaper model): execute file edits, formatting, simple transformations

A particularly effective application of this pattern is delegating file search and reading to small utility models like Claude Haiku, GPT-4.1-mini, or similar lightweight models. Frontier models consume expensive tokens every time they read or search through files — offloading this to a cheap subagent with a simple prompt like "find and read the relevant files for X" can save significant cost. In Copilot custom agent configurations, you can use a custom instruction file to direct the main agent to delegate all file exploration through cheaper subagents (see the example instruction below). In Claude Code, you can achieve the same by routing file discovery to the Task tool with a lightweight model.

Example: Copilot instruction for subagent delegation

The following .instructions.md file tells the main agent to delegate all codebase reading and searching to a subagent running a small model:

---
description: "Delegate codebase reading and searching to a subagent using a small model"
applyTo: "**"
---

# Subagent for Codebase Exploration

When reading files or searching the codebase (using tools like `read_file`, `grep_search`, `file_search`, `semantic_search`, `list_dir`), always delegate to a subagent with a small, fast model such as `Claude 3.5 Haiku (Copilot)` or `GPT-4o Mini (Copilot)`.

Use the `Explore` agent when available, or invoke `runSubagent` with a small model specified via the `model` parameter.

Claude Code

The Task tool supports a model parameter to override the model per-subagent:

Use the Task tool to delegate the file formatting subtask.
Set model to claude-haiku for this subtask — it doesn't need deep reasoning.

Copilot Coding Agent

Configure setup steps in your workflow to specify model preferences for different stages of work.

For more on the orchestrator + subagent pattern, see Multi-Agent Orchestration.

Manage Context Window Size

Impact: Medium-High · Effort: Low

Every token in the context window costs money — both the input you send and the output generated in response. Bloated context is the most common source of waste.

Do:

Reference specific files with #file or @workspace with targeted queries
Close irrelevant files/tabs before starting agent sessions
Use .gitignore-style patterns in tool configs to exclude irrelevant directories from indexing
Configure content exclusions (org-level Copilot setting) to prevent large or sensitive files from being sent

Don't:

Add entire directories to context when asking about one function
Leave 30 open tabs during an agent session — some tools include open files as context
Include node_modules, build output, or generated files in indexed content

caution

Some tools silently include open editor tabs, recent files, or entire directory trees as context. Audit what your tool is actually sending — the token count in your billing may surprise you.

Compact Long Conversations

Impact: Medium · Effort: Low

Each new turn in a conversation re-sends the entire conversation history as input tokens. A 50-turn conversation means turn 50 includes all previous turns as input.

Claude Code

Use /compact to summarize and compress conversation history:

/compact

This replaces the full history with a condensed summary, dramatically reducing input tokens for subsequent turns.

VS Code Copilot

Start a new chat thread instead of extending a long conversation. There's no built-in compaction — fresh threads are your tool.

tip

Rule of thumb: if your conversation exceeds ~20 turns, compact or start fresh. The accumulated context from early turns is usually no longer relevant and is burning tokens every turn.

Use Context Files to Reduce Repetitive Prompting

Impact: Medium · Effort: Medium (one-time setup)

Instead of re-explaining your architecture, conventions, or feature requirements in every prompt, document them in context files that the agent reads once at session start.

Agent Context Files

Feature work may have a curated context folder at .agents/contexts/<feature-name>/. If one exists for the feature you're working on:

Read all files in the folder before writing any code. Start with README.md if present. Distinguish source-of-truth files (requirements, architecture) from current-state files (assessments, known issues).
Ground your work in the documented decisions — don't contradict them without flagging it.
Keep context files in sync — when your work changes scope, resolves a known issue, or introduces a new architectural decision, update the relevant context files proactively.
Never delete or overwrite context files without explicit instruction.

If the user mentions a feature but the context path is unclear, list the available subdirectories in .agents/contexts/ and ask which context applies.

Why This Saves Tokens

Context files are read once at session start — no re-pasting every turn
The agent produces better output with proper context, reducing regeneration
Shared context files mean the same information works across team members and sessions

Write Better Prompts

Impact: Medium · Effort: Low

Concise, specific prompts use fewer input tokens and produce more focused (shorter) output.

State constraints upfront — "return only the function, no explanation" avoids a 200-token explanation you'll ignore
Reference specific line ranges or functions instead of pasting entire files
Be explicit about format: "respond with a code block only" vs. leaving it ambiguous
Vague prompts produce longer, less useful outputs that you'll regenerate — costing 2x or more

See Prompt Engineering for detailed techniques.

Avoid Regeneration Loops

Impact: Medium · Effort: Low

Each regeneration is a full new request — all context re-sent plus a new generation. Three regenerations cost 3x one well-crafted prompt.

Instead of hitting regenerate:

Identify what's wrong with the output
Edit your prompt to add the missing constraint
Submit the refined prompt

This produces better results and costs less than hoping the next random sample will be correct.

note

If you find yourself regenerating more than once, the problem is almost always the prompt — not bad luck. Add specificity rather than retrying.

Analyze Your Usage

Impact: Medium · Effort: Low

The VS Code Chronicle extension tracks session data locally and can surface optimization opportunities.

/chronicle:cost-tips

This analyzes your usage patterns and reports:

Model overuse (using frontier models for simple tasks)
Context bloat (sessions with unnecessarily large context)
Retry patterns (repeated regenerations)
Outlier sessions (unusually expensive interactions)

Setup

.vscode/settings.json
{
  "github.copilot.chat.localIndex.enabled": true
}

Let it collect data for 5–7 days before running /chronicle:cost-tips for useful recommendations.

For more details, see the Chronicle cost tips blog post.

caution

Chronicle only covers VS Code Copilot sessions. If you also use Claude Code or other tools, you'll need to track those separately.

Impact: Low-Medium · Effort: Low

Each conversational turn has overhead: system prompt, conversation history, and tool context are all re-sent. Combine related small requests into one prompt.

Expensive (3 turns):

Turn 1: "Fix the type error on line 42 of auth.ts"
Turn 2: "Fix the type error on line 87 of auth.ts"
Turn 3: "Fix the type error on line 103 of auth.ts"

Cheaper (1 turn):

"Fix the type errors on lines 42, 87, and 103 of auth.ts"

tip

Balance batching with clarity. If a prompt gets so complex that the model struggles, you'll end up regenerating — which defeats the purpose. Group related items; don't create mega-prompts.

Leverage Caching

Impact: Low (mostly automatic) · Effort: Low

Provider-side caching reduces costs for repeated context — but it's largely automatic.

What providers do:

Anthropic's API caches repeated system prompts and prefix content
GitHub Copilot handles caching server-side — you benefit automatically

What you can do to help:

Keep instruction files (.github/copilot-instructions.md, CLAUDE.md) stable — don't edit them every session
Use consistent system prompts across sessions so cache hits are more likely
Avoid unnecessarily reordering context between turns

Quick Reference

Strategy	Impact	Effort	Applies to
Choose the right model	High	Low	All tools
Set thinking effort	High	Low	Claude Code, o-series models
Different models for subagents	High	Medium	Claude Code, multi-agent setups
Manage context window size	Medium-High	Low	All tools
Compact long conversations	Medium	Low	Claude Code, VS Code Copilot
Context files for reuse	Medium	Medium	All tools
Write better prompts	Medium	Low	All tools
Avoid regeneration loops	Medium	Low	All tools
Analyze usage with Chronicle	Medium	Low	VS Code Copilot
Batch related work	Low-Medium	Low	All tools
Leverage caching	Low	Low	Anthropic API, Copilot

Resources

Chronicle Cost Tips Blog Post — detailed walkthrough of the /chronicle:cost-tips command
Prompt Engineering — techniques that also reduce token waste
Multi-Agent Orchestration — the orchestrator + subagent pattern
AI Coding Agents — how agents consume context and tokens
GitHub Copilot Billing Documentation
Anthropic Token Counting

Choose the Right Model for the Task​

Set Thinking Effort Appropriately​

Claude Code​

VS Code Copilot​

Use Different Models for Subagents​

Example: Copilot instruction for subagent delegation​

Claude Code​

Copilot Coding Agent​

Manage Context Window Size​

Compact Long Conversations​

Claude Code​

VS Code Copilot​

Use Context Files to Reduce Repetitive Prompting​

Agent Context Files​

Why This Saves Tokens​

Write Better Prompts​

Avoid Regeneration Loops​

Analyze Your Usage​

Setup​

Batch Related Work​

Leverage Caching​

Quick Reference​

Resources​

Choose the Right Model for the Task

Set Thinking Effort Appropriately

Claude Code

VS Code Copilot

Use Different Models for Subagents

Example: Copilot instruction for subagent delegation

Claude Code

Copilot Coding Agent

Manage Context Window Size

Compact Long Conversations

Claude Code

VS Code Copilot

Use Context Files to Reduce Repetitive Prompting

Agent Context Files

Why This Saves Tokens

Write Better Prompts

Avoid Regeneration Loops

Analyze Your Usage

Setup

Batch Related Work

Leverage Caching

Quick Reference

Resources