What Opus 4.6 Actually Changes for Practitioners (And What It Doesn’t)

The effort dial saves 76% on tokens. Agent Teams will triple your bill. And the real upgrade is that it finally finishes the work.

Feb 06, 2026

I’ve been running Opus 4.6 since it dropped on February 5th.

Nobody expected it. The community was waiting for “Sonnet 5” and got an Opus upgrade instead. Hours later, OpenAI released Codex 5.3. Both companies fixed their biggest weaknesses on the same day.

You’ve probably already seen the benchmark tables and feature lists. I want to skip that and tell you what actually matters after using it for real work.

The feature nobody’s leading with will save you the most money

Everyone’s focused on the 1 million token context window and Agent Teams. The feature that will actually change your daily workflow is the effort dial.

Four levels. Low, medium, high, max. You control how hard the model thinks.

Here’s why this matters: medium effort matches Opus 4.5 quality at 76% fewer tokens. Read that again. The same quality you were paying full price for last week now costs a quarter of what it did. For routine code, boilerplate, scaffolding, standard implementations, you set effort to low or medium and get the job done at a fraction of the cost.

Then when you hit the hard problems, the concurrency bug that’s been open for two weeks, the architecture decision that needs every edge case considered, you dial it to max. The model spends 40+ seconds thinking. One user reported it solved a two-week-old concurrency bug in a single pass, saving 18 engineer-hours.

The practical pattern:

Low: Boilerplate, scaffolding, file stubs. 3 seconds, cheap.
Medium: Routine production code. 8 seconds, matches last-gen quality.
High: Production features shipping to users. 30 seconds, catches edge cases.
Max: The bug your team has been staring at for days. 40+ seconds, traces root causes.

In Claude Code: /effort medium before your next prompt. That’s it. Most people won’t discover this for weeks. You know now.

It actually finishes things

This sounds absurd as a headline feature, but it’s the single biggest difference. Previous models would degrade on complex, multi-step tasks. They’d lose context, hallucinate, or quietly give up halfway through. Opus 4.6 maintains coherence across long-running work in a way that feels qualitatively different.

SentinelOne reported it handled a multi-million-line codebase migration “like a senior engineer,” planning up front, adapting strategy mid-execution, and finishing in half the expected time. Cursor’s team noted “stronger tenacity, better code review, stays on long-horizon tasks where others drop off.” Windsurf’s CTO said it “feels noticeably better on tasks requiring careful exploration like debugging unfamiliar codebases. Thinks longer, which pays off.”

The reasoning isn’t just functional, it’s considered. Multiple practitioners independently used the same word: “elegant.” It catches edge cases other models miss. Cognition’s CEO said it “reasons through complex problems at a level we haven’t seen before. Considers edge cases that other models miss and lands on more elegant, well-considered solutions.”

When multiple CTOs who’ve tested every model say the same thing unprompted, that’s signal.

500 zero-day vulnerabilities

This one stopped me cold.

Buried in the release notes: the model found over 500 zero-day vulnerabilities in open-source code. Not theoretical weaknesses. Actual exploitable bugs. It wrote proof-of-concept exploits to verify each one was real. In a blind comparison, it won 38 out of 40 security assessments against competing models.

This isn’t about replacing security teams. It’s about what happens when you can run a senior-level security audit on every commit, every dependency, every pull request, automatically. The cost of finding vulnerabilities just dropped by orders of magnitude. If you maintain open-source projects or ship production code, this is the capability to pay attention to.

Context works at scale now (with caveats)

The 1 million token window isn’t just a bigger number. The model scores 76% on the needle-in-haystack benchmark at that scale, compared to Sonnet 4.5’s 18.5%. That’s a 4x improvement in actually using the context you feed it. Thomson Reuters specifically called out that it “handles much larger bodies of information with consistency that strengthens complex research workflows.”

But don’t dump everything into one window.

Attention is O(n^2). Doubling context quadruples cost. Quality still degrades past ~800K tokens. The context is in beta. These are real constraints.

The smart approach is what practitioners are calling the “Ralph Wiggum” method: multiple focused contexts instead of one massive window.

Context 1: Build the top-level map of what you’re working with
Context 2: Deep dive into specific flows or components
Context 3: Implementation based on what you learned

One activity per context. One goal per context. Clear handoffs between them. This works better and costs less than trying to cram everything into a single session.

Agent Teams: impressive results, real costs

Agent Teams in Claude Code are the headline feature. Multiple specialized agents working in parallel, coordinated autonomously. One planner, one retriever, one coder, one reviewer. Rakuten used them to autonomously close 13 issues and assign 12 to teams in a single day. Bolt.new reported it “one-shotted a fully functional physics engine in a single pass.” A 30% reduction in end-to-end task runtime across early testers.

Now the cost reality. Each teammate has its own context window. A single 90-minute session can run $12-15 in API costs. Early testers report it can triple monthly Claude Code spend. One person burned through 1 million tokens on day one.

It’s also labeled “research preview” for a reason. Coordination failures still happen. Silent stalls and hung processes. Best used for well-defined, parallelizable tasks, not your entire workflow.

My recommendation: try Agent Teams on a single, concrete task. See what it does. Learn the coordination patterns. Don’t deploy it across everything yet.

Opus 4.6 vs. Codex 5.3: stop picking sides

They released within hours of each other. The benchmark split tells you everything:

Codex 5.3 wins on: well-specified implementation tasks, CLI work, CI pipelines, tests, scripts. It’s 25% faster and scores higher on Terminal-Bench (77% vs 65%).

Opus 4.6 wins on: fuzzy problems, large context needs, architecture decisions, and completeness. It scores higher on OSWorld (72% vs 64%), ARC AGI, and long-context reliability.

The philosophical difference is real. Anthropic’s approach is “think longer, work deeper,” letting the AI work autonomously and reviewing the output. OpenAI’s approach is “stay engaged throughout,” keeping humans in the loop and steering mid-execution.

Both work. For different things.

The practitioner consensus is converging on multi-model workflows:

Opus for design and architecture, Sonnet for implementation, Gemini for review
Codex for the full task, then Opus at the end to double-check
One model makes the change, another picks holes in it (adversarial review)

If you’re still loyal to one provider like it’s a sports team, you’re leaving performance on the table.

What didn’t change

The skill ceiling is still yours. Context engineering, knowing how to structure work for AI agents, deciding what to automate and what to think through yourself, these matter more with each release. The model got better. The gap between someone who knows how to use it and someone who doesn’t got wider.

Cost discipline still matters. Agent Teams, 128K output tokens, and million-token contexts can combine into surprise bills that scale faster than you expect. The effort dial helps, but only if you actually use it.

It’s not magic. SWE-Bench scores actually dipped slightly from 4.5. MCP tool usage showed a small regression. Quality still degrades at the far end of the context window. The model is meaningfully better, not infallible.

What you should do this week

Five concrete actions:

Set effort to medium as your default. You get Opus 4.5 quality at 76% fewer tokens for routine work. This saves money immediately.
Try max effort on your hardest open bug. The one your team has been staring at. Give the model the full context and let it think for 40+ seconds. See what happens.
Run a security scan on your most critical repo. Ask it to analyze for vulnerabilities the way a security researcher would. The results might surprise you.
Experiment with Agent Teams on one well-defined task. Something parallelizable with clear subtasks. Watch the costs. Learn the patterns.
Restructure your context strategy. If you’ve been stuffing everything into one window, try the multi-context approach. One goal per context, clear handoffs. Better results, lower costs.

The real takeaway

This isn’t a revolution. It’s something more useful: a meaningful upgrade landing precisely where production workflows have friction. Tasks that took two to three weeks at first-pass quality now take hours. Not because the model does something new, but because it does what previous models promised and didn’t deliver. It finishes the work.

The effort dial alone justifies upgrading if you’re a Claude Code user. Agent Teams are the future but not quite the present. And if you take one thing from this: the models are converging. Your system around them, the configuration, the skills, the context strategy, is the differentiator.

The thing you hired a consultant for is becoming the thing you do before lunch. Adapt accordingly.

Have you tried the effort dial yet? What did you set it to and what happened? I’m collecting real-world results, hit reply and tell me yours.

If someone on your team is about to discover Agent Teams the hard way (by watching their bill triple), do them a favor and forward this.

Discussion about this post

Ready for more?