Copilot CLI's Rubber Duck β Cross-Model Review for Coding Agents π¦
GitHub just shipped an experimental feature in Copilot CLI (announced April 6) that does something no mainstream coding agent has done before: it uses a second model from a different AI family to review the primary agent's work before it executes. The feature is called Rubber Duck, and it's one of the first real-world implementations of cross-model review built into a tool developers actually use daily. If you care about multi-agent workflows or code quality, this one is directly relevant.
What Is Rubber Duck?β
Rubber Duck is a secondary review agent that critiques the primary orchestrator's plans and code at key decision points. The twist: it's powered by a model from a different family than the one doing the work.
The name is a nod to rubber duck debugging β the classic technique where you explain your problem to an inanimate rubber duck on your desk, and the act of articulating the problem forces you to think through it more carefully. Here, the second model plays the duck. The primary agent "explains" its plan by presenting it for review, and the reviewer model pokes holes in the reasoning.
In the current implementation, when the primary orchestrator is a Claude model, Rubber Duck uses GPT-5.4 as the reviewer. GitHub is exploring other family combinations, but the principle stays the same: the reviewer must come from a different model family than the orchestrator.
Rubber Duck is experimental β you access it via the /experimental slash command in Copilot CLI, GitHub's terminal-based coding agent (the gh copilot extension β think Copilot Agent Mode, but in your terminal instead of VS Code).
Rubber Duck is experimental and accessed via /experimental. The feature may change significantly or be removed in future Copilot CLI updates. Don't build workflows that depend on it staying exactly as described here.
Why a Different Model Family?β
The self-review limitationβ
Coding agents already do self-reflection. They generate a plan, re-read it, and check their own work before proceeding. But there's a fundamental ceiling: a model reviewing its own output is bounded by the same training biases, blind spots, and reasoning patterns that produced the output in the first place. A Claude model re-reading its own Claude-generated plan will tend to agree with its own reasoning β the same way you'll fail to spot a typo you just wrote because your brain auto-corrects it.
Self-review helps, but it doesn't catch the things the model can't see.
Cross-family diversityβ
Models from different families β Claude, GPT, Gemini, and others β are trained on different data mixes, with different techniques, by different teams with different design philosophies. They make different kinds of mistakes. A GPT model reviewing a Claude-generated plan is more likely to catch errors that a Claude self-review would miss, precisely because it doesn't share the same blind spots.
Here's an analogy the team will recognize: this is like having your PR reviewed by someone from a different team instead of your pair partner. Your pair partner shares your context and assumptions. An outside reviewer doesn't β and that's exactly why they catch different things. Rubber Duck formalizes that "outside perspective" inside the agent loop.
When Does It Activate?β
Rubber Duck doesn't run on every keystroke. It triggers at specific decision points where a second opinion has the highest impact:
| Trigger | Why it helps |
|---|---|
| After drafting a plan | Catches suboptimal architecture before implementation begins |
| After a complex implementation | Second set of eyes on intricate code |
| After writing tests, before executing | Catches test coverage gaps |
| When the agent gets stuck in a loop | Breaks out of circular reasoning |
| On-demand user request | When you want an explicit critique |
The "after planning" trigger is where most of the value comes from. Catching a wrong assumption before implementation avoids the compounding cost of building on a flawed foundation. An architectural mistake caught at the plan stage costs minutes; the same mistake caught after implementation costs an hour of rework.
What Does It Actually Catch?β
Three categories of problems show up consistently in cross-model review:
Architectural flaws. The primary model drafts a plan that sounds reasonable in isolation but misses a fundamental lifecycle issue β say, a service that starts and immediately exits because the agent didn't think through how the process stays alive. The primary model missed it because its plan was internally consistent. The reviewer caught it because it evaluated the plan against a different set of assumptions about how services behave.
Silent data bugs. The agent writes a loop that overwrites the same key on every iteration, producing incorrect results without ever throwing an error. The code is syntactically clean and passes a surface-level self-review. A different model family, with different training on iteration patterns, flags the overwrite as suspicious.
Cross-file consistency. New code stops writing to a data source that other files in the project still read from. The primary model focused on the file it was editing and didn't flag the downstream impact. The reviewer, approaching the plan fresh, asks: "What reads from this data source?" β the kind of question that's easy to miss when you're deep in implementation.
The common thread: the primary model's mistakes aren't random β they're systematic, shaped by its training. A different family has different systematic tendencies, so it catches what the first model consistently overlooks.
Performance: How Much Does It Help?β
GitHub reports benchmark results on SWE-Bench Pro, a benchmark of real-world software engineering tasks drawn from open-source projects β fixing actual bugs, implementing real features, and resolving genuine issues.
The headline numbers:
| Configuration | SWE-Bench Pro | Notes |
|---|---|---|
| Claude Sonnet 4.6 (baseline) | Baseline | Single model, no review |
| Claude Sonnet 4.6 + Rubber Duck (GPT-5.4) | +74.7% of gap to Opus | Cross-family review |
| Claude Opus 4.6 (solo) | Upper bound | Most capable single model |
According to GitHub's benchmarks, adding Rubber Duck to Claude Sonnet 4.6 closes 74.7% of the performance gap between Sonnet and Opus β without switching to the more expensive model. On tasks involving 3+ files and 70+ steps, the improvement is +3.8% over the Sonnet baseline. On the hardest problems in the benchmark, the gain increases to +4.8%.
These are GitHub's evaluations on SWE-Bench Pro. Real-world results will vary by task type. The key takeaway is directional: cross-family review consistently helps, especially on harder multi-file tasks where architectural and cross-file errors are most likely.
Aliz Stack Connectionβ
Multi-agent orchestration in practice. Rubber Duck is a real-world implementation of the Orchestrator + Subagents pattern β but with a twist. The subagent isn't a worker executing a subtask. It's a reviewer critiquing the orchestrator's plan. Our multi-agent docs describe the theory; Rubber Duck is one of the first tools to ship it as a built-in feature.
Automating part of our review discipline. Our AI Coding Guidelines emphasize treating AI output as a draft that requires human review. Rubber Duck automates part of that review within the agent loop β catching architectural and cross-file issues before the code even reaches you. It doesn't replace human review, but it reduces the surface area of what you need to catch manually.
A review step in the agent loop. Our AI Coding Agents page describes the plan β act β observe β repeat loop that drives agent behavior. Rubber Duck inserts a review step into that loop, particularly after planning. It's a concrete example of how the agent lifecycle is evolving beyond the basic loop.
Practical implication: if you're using Copilot CLI with a Claude model, /experimental is essentially free to try. The cost is small β some extra latency and tokens per review step β and the payoff is meaningful: fewer architectural mistakes and cross-file inconsistencies reaching your review queue.
Tradeoffs to be aware of: the reviewer model can produce false positives β flagging correct code as suspicious. When the two models disagree, you still need to make the call. And the added latency (an extra model call per review step) can slow down rapid iteration loops. For quick, single-file changes, the overhead may not be worth it.
Broader observation: cross-model review is likely the beginning of multi-model collaboration as a standard feature of coding agents. Today it's experimental in Copilot CLI. Tomorrow it'll be a checkbox in every agent's settings.
If you're using Copilot Agent Mode for day-to-day work and Claude Code for complex tasks (our current recommendation from the AI Coding Agents page), Rubber Duck in Copilot CLI is a third option worth trying β particularly for multi-file tasks where you want built-in review without switching tools.
