Skip to main content

Copilot Token Efficiency: What the Platform Is Doing Behind the Scenes

· 4 min read
Gergely Sipos
Frontend Architect

Since June 1, every Copilot token has a dollar sign attached. You're thinking about model selection, prompt size, and whether that 200-line file really needs to be in context. Good — but while you optimize your side, the VS Code team has been shipping infrastructure-level improvements that cut token consumption and latency without any user action. Ryan Caldwell and Bhavya U published a deep dive on these changes, and the numbers are worth knowing. The platform is doing heavy lifting so you can focus on the strategies you control.

Prompt Prefix Caching — Your Context Is Now 10x Cheaper

In agentic sessions, the vast majority of each request is repeated content: system instructions, tool definitions, conversation history. Only the tail end — your latest message and tool results — is new. Prompt prefix caching exploits this by storing the shared prefix in GPU-local memory, so subsequent requests skip re-processing it.

OpenAI now supports extended caching with prompt_cache_retention: "24h". The default retention was 5-10 minutes; stretching it to 24 hours means your cache survives between coding sessions. Cached input tokens cost 10x less than uncached ones. The impact scales with how long you step away: cache hit rates improved +135-142% for 20-30 minute gaps, and a staggering +279-919% for 40-60 minute gaps compared to the old retention window.

Anthropic uses explicit cache_control breakpoints (up to 4 per request) to mark stable prefix boundaries, achieving a ~94% cache hit rate in agentic flows.

This is entirely automatic — you benefit without changing anything.

Smarter Tool Loading — Less Junk in Your Prompt

Every MCP server and built-in tool adds its full JSON schema to the prompt. With dozens of tools registered, that's thousands of tokens on every single request — most of which the model never uses.

The fix: deferred tool loading. The model now sees only tool names and one-line descriptions upfront. When it decides to call a tool, the full schema is loaded on demand via tool_search. Results:

  • GPT-5.4: -9.81% tokens per turn, -8.97% session-wide
  • GPT-5.5: -8.61% tokens per turn, -10.92% session-wide
  • Claude: initial server-side tool search cut -18% prompt tokens, later moved to client-side semantic search

A bonus finding: less tool noise in the prompt leads to better tool selection — -4.01% user error rate on Claude Sonnet 4.6. Less context pollution means fewer wrong tool calls.

WebSocket Transport — Faster Multi-Turn Flows

Agentic coding is chatty. A single user request can trigger dozens of sequential API calls as the model reads files, runs commands, and iterates. Each call over HTTP/2 pays connection setup and TLS overhead.

The VS Code team switched to persistent WebSocket connections using OpenAI's Responses API WebSocket mode. The gains at p50:

  • -19.46% time to first token
  • -13.55% time to complete

At p95, still solid: -12.92% TTFT and -7.86% completion time. This is now the default transport for GPT-5.2+ models. The latency reduction drove a 1.27-2.17% increase in active users and 1.90-3.14% increase in 2-day engagement — faster responses keep people in flow.

What You Still Control

These are platform improvements — automatic, no configuration required. But they stack multiplicatively with your own optimization choices:

  • Model selection: pick the right model for the task complexity
  • Thinking effort: dial down reasoning tokens for straightforward edits
  • Context management: keep your prompt lean with .github/copilot-instructions.md and focused file sets
  • Subagent delegation: offload exploration to smaller, cheaper models

Infrastructure efficiency is the foundation. Your optimization choices multiply the savings. See our token optimization guide for the full strategy breakdown, and use chronicle:cost-tips to measure your actual spend per session.