This is what I wrote in Tech Insights 2025 week 28 (July):
“In 9-12 months I predict the models will be able to run autonomously for several hours, probably more, and will be able to produce much higher output quality than is possible today. Let’s say that a subscription for such a model costs $5,000 per month. The business model will then change from augmenting existing workers with a new tool, to replacing one or more workers with the new tool.”

Would you pay $2,000-$5,000 for a monthly license of OpenAI Codex? This is what OpenAI is now expecting companies to do after they significantly limited the included weekly usage for Codex for business users and rolled out usage-base pricing for business and enterprise customers. How much more efficient would a developer have to be to motivate a cost of $2,000 or more per month? Twice as good? We have already achieved that. The main question is – how should it be financed? If you have 100 developers, can you afford to pay $2 million extra per year in license costs for double productivity? Most companies are probably not ready for this.

But if you could swap out 100 of your “non-AI” developers for 50 “AI-powered” developers that can now do the same job but with higher quality, less bugs, better tests and much faster turnaround, would you do it? The total cost would be less and you would get more done. You would need less middle managers, less office space, less meetings, less education costs and less sick leave. One company that did this is Oracle, last Tuesday they sent out an email to 30 000 of their employees, informing them that they have made the decision to eliminate their role from the organization. “Eliminated” is the actual word used by here. Oracle eliminated 20% of their staff to make sure they can afford to build data centers and AI-augment the rest.

“From the surge in demand on Codex and many other services, it is clear that AI companies are going to go through a phase of demand outpacing supply.” Thibault Sottiaux, OpenAI.

As I wrote in Tech Insights July, I have been concerned with this possible shift for quite some time now. Yet there is little discussion about it. If enough companies start using the pay-as-you-go license for Codex and are happy paying $2,000 and above for each developer, you can be sure that prices will soon go up for everyone. But if most companies reject this pay-as-you-go model then OpenAI and Anthropic will have to keep subsidizing the running costs for at least a little while longer. We should know more about how things evolve in the coming months.

Thank you for being a Tech Insights subscriber!

Listen to Tech Insights on Spotify: Tech Insights 2026 Week 15 on Spotify

Notable model releases last week:

GLM-5V-Turbo. New model from Z.ai to convert screenshots into runnable code
MAI-Transcribe-1. A new speech-to-text model from Microsoft that tops benchmarks on accuracy across 25 languages
Qwen3.6-Plus. New closed-source release by Alibaba with “enhanced agentic coding capabilities”
Veo 3.1 Lite. Google’s new lower-cost video generation model.

THIS WEEK’S NEWS:

Anthropic Accidentally Leaks Claude Code’s Full Source Code
Anthropic Blocks Claude Subscriptions from Third-Party Tools Like OpenClaw
OpenAI Switches Codex to Usage-Based Pricing for Business and Enterprise Teams
Sakana AI Launches Autonomous Business Research Agent “Marlin” in Closed Beta
Cursor 3 Brings Agent-First Interface to Replace the VS Code Fork
Google Releases Gemma 4: Four Open Models from 2B to 31B
Netflix Releases VOID: A Video Object Removal Model With Physics-Aware Inpainting
Microsoft 365 Copilot Researcher Gets Critique and Council Modes

Anthropic Accidentally Leaks Claude Code’s Full Source Code

https://www.pcmag.com/news/anthropic-issues-8000-copyright-takedowns-to-scrub-claude-code-leak

The News:

A packaging error in @anthropic-ai/claude-code v2.1.88 on npm accidentally included a source map pointing to a publicly accessible ZIP archive on Anthropic’s Cloudflare R2 storage. The archive contained 1,906 TypeScript files and over 512,000 lines of production source code.
The archive was accessible for roughly three hours before Anthropic pulled it. In that window, a GitHub fork accumulated over 1,100 stars and 1,900 forks.
Inside the code, a USER_TYPE === ‘ant’ flag identifies Anthropic employees at runtime and activates separate system prompts, debug functions, and experimental features hidden from standard users.
One of those employee-only features is “Undercover Mode,” which automatically strips AI attribution from commits when an Anthropic employee pushes to a public repository.
The code also references several unreleased features: KAIROS, a persistent background daemon that consolidates memory across sessions; ULTRAPLAN, a 30-minute remote planning mode; Coordinator Mode for spawning parallel sub-agents; and BUDDY, a Tamagotchi-style terminal companion.
Anthropic confirmed the incident was a packaging error and stated that no customer data, credentials, or model weights were exposed.

My take: This is the full proprietary closed source code of Claude Code, including full orchestration logic, tool library, slash command implementations, and fragments of an internal product roadmap. The day after the accident Anthropic issued over 8,000 DMCA takedown requests to GitHub, but developers quickly used Codex to rewrite the TypeScript codebase to Python under the name “Claw Code,” which reached 47,000 GitHub stars within five hours of publication. Since this is a full rewrite and share no code with Claude Code, Anthropic cannot legally request them to take it offline.

What I found most interesting about the leak is the flag “USER_TYPE === ‘ant’” in the code base. This is to check if the user is an Anthropic employee, and if this is true the following is enabled:

Post-edit verification: A function called getSimpleDoingTasksSection in constants/prompts.ts injects an extra instruction only for “ant” users: “Before reporting a task complete, verify it actually works: run the test, execute the script, check the output”. Standard users do not get this instruction.
Anti-laziness and context handling rules: Enhanced verification and more thorough editing protocols.
Internal tooling access: A separate set of tools and feature flags only load when USER_TYPE === ‘ant’, including the full 44-item internal feature set.
Personality and communication style: According to the source, the AI’s willingness to push back, how it handles errors, and its overall communication style all differ for ant users.

If you ever wondered how Anthropic employees can use Claude Code to build production-critical source code in large code bases with hundreds of thousands of lines of code, when it just provides slop in your own code base, this is one of the reasons behind it. I really don’t like this dual-mode approach that Anthropic is using here – I understand they want to save tokens but the current experience as a non-Anthropic user in Claude Code is not great, especially with larger code bases.

Read more:

Anthropic Blocks Claude Subscriptions from Third-Party Tools Like OpenClaw

https://twitter.com/bcherny/status/2040206440556826908?s=43

The News:

Effective April 4, Anthropic ended support for Claude Pro and Max subscriptions being used through third-party AI agent frameworks like OpenClaw, the popular open-source coding agent.
Boris Cherny, head of Claude Code at Anthropic, announced on X: “Claude subscriptions will no longer cover usage on third-party tools like OpenClaw. Capacity is a resource we manage thoughtfully and we are prioritizing our customers using our products and API.”
Users who want to continue using OpenClaw with Claude must now either purchase discounted “extra usage bundles” billed separately from their subscription, or supply a Claude API key at full API rates: $3 per million input tokens and $15 per million output tokens for Claude Sonnet 4.6, and $15 and $75 respectively for Claude Opus 4.6.
As a transition measure, affected subscribers receive a one-time credit equal to their monthly plan cost, redeemable until April 17, and can receive up to 30% discounts on pre-purchased extra usage bundles.
Anthropic confirmed the restriction will extend to all third-party harnesses in the coming weeks, not just OpenClaw.

“A single OpenClaw instance running autonomously for a full day, browsing the web, managing calendars, responding to messages, executing code, can consume the equivalent of $1,000 to $5,000 in API costs, depending on the task load. Under a $200-per-month Max subscription, that is an unsustainable transfer of compute costs from the user to Anthropic.” TheNextWeb

My take: I think everyone understands that the current $200 subscription model is heavily subsidized, especially for those who used it at its limit, and now when there are companies willing to pay the actual costs of running these models (which is at least 10-20 times that figure every month) it was just a matter of time before Anthropic limited the usage restrictions of these subscriptions.

The next generation models coming in 1-2 weeks (OpenAI “Spud” and Anthropic “Mythos”) are rumored to work autonomously for a much longer time for much larger tasks than what is possible today. When usage is already pushing the boundaries of what the companies can deliver, my bet right now is that these newer models will also be heavily restricted on these “unlimited” $200 plans. I could be wrong of course, but we’ll know in 1-2 weeks.

Read more:

OpenAI Switches Codex to Usage-Based Pricing for Business and Enterprise Teams

https://openai.com/index/codex-flexible-pricing-for-teams

The News:

OpenAI now offers pay-as-you-go Codex-only seats for ChatGPT Business and Enterprise teams, billed on token consumption with no fixed monthly fee and no usage rate limits.
Teams can add Codex-only seats without purchasing a full ChatGPT Business subscription, separating coding agent access from the broader ChatGPT product.
Standard ChatGPT Business annual seats, which bundle ChatGPT and Codex with usage caps, dropped from $25 to $20 per seat per month.
Eligible workspaces receive a $100 credit per new Codex-only member added, capped at $500 per team.
OpenAI reports Codex usage in business and enterprise grew 6x since January 2026, with over 9 million paying business users and 2 million builders using Codex weekly.

My take: There’s a tool called ccusage that counts the total number of API requests you have used in OpenAI Codex, and then shows you what it would have costed you if you used API instead of the $200 / month flat-rate subscription ChatGPT Pro. When I ran the tool it showed me I had used over 17 billion tokens running GPT-5.2, GPT-5.3-codex and GPT-5.4. It looks like this:

Codex API usage over 3 months:

Input tokens: 846 million tokens
Output tokens: 126 million tokens
Reasoning tokens (billed as output): 96 million tokens
Cache reads: 16 billion tokens

This is how much each “million of tokens” cost when purchased as API:

GPT-5.4 input: $2.50 / million tokens
GPT-5.4 output (including reasoning): $15 / million tokens
GPT-5.4 cached input: $0.25 / million tokens

This means that if I had not used my personal flat-rate subscription ChatGPT Pro I would have paid $6,305 for 3 months of spare-time usage using the pay-as-you-go license. This is the true cost of running Codex, and this is what OpenAI expects companies to pay for these services. For a full-time developer this cost could easily go up to $5,000 per month.

For now my recommendation is to avoid the pay-as-you-go license if possible and continue with purchasing individual $200 / month ChatGPT Pro licenses. You get 20x more usage for your money, and while $200 might sound expensive, it’s a bargain compared to the actual cost of running these tools and what I believe most companies will pay for these licenses later this year.

Read more:

Hot take: Codex is too cheap, rug pull through tighter usage limits is inevitable : r/codex

Sakana AI Launches Autonomous Business Research Agent “Marlin” in Closed Beta

https://sakana.ai/marlin-beta

The News:
Sakana AI Launches Autonomous Business Research Agent “Marlin” in Closed Beta
https://sakana.ai/marlin-beta/

The News:

Sakana AI has opened closed beta registration for Sakana Marlin, its first commercial product, targeting strategy and research professionals in finance, consulting, and corporate planning.
Given a single topic prompt, Marlin runs autonomously for up to eight hours without human intervention, producing a structured slide deck and a multi-dozen-page written report.
Marlin uses AB-MCTS (Adaptive Branching Monte Carlo Tree Search), a method that frames reasoning as a tree-search problem to decide which hypotheses to pursue, discard, or escalate across multiple LLM models. This technique received a spotlight recognition at NeurIPS 2025, placing it in the top ~10% of accepted papers.
A single Marlin session can execute hundreds to thousands of LLM queries, dynamically routing compute toward the highest-priority research branches.
A second underlying component derives from Sakana’s AI Scientist project, published in Nature, which automated the full scientific discovery cycle from idea generation to peer review. Marlin applies the same workflow automation to business research tasks.
The closed beta is free during the testing period. The registration form is currently only available in Japanese.

My take: Sakana AI made the headlines a year ago when their research agent managed to write and have a peer-reviewed paper accepted at one of the top machine learning conferences. It will be interesting to see if they can manage to balance the API costs with reasoning usage to make the service manageable in pricing. I will be following Sakana Marlin closely, since this might have implications on how most research institutes will work in the near future.

Read more:

The AI Scientist Generates its First Peer-Reviewed Scientific Publication

Cursor 3 Brings Agent-First Interface to Replace the VS Code Fork

https://cursor.com/blog/cursor-3

The News:

Cursor 3 is a rebuilt desktop interface for AI-assisted software development, built from scratch rather than forked from VS Code, centering the workflow around managing multiple coding agents across repositories.
All local and cloud agents appear in a unified sidebar, including agents triggered from mobile, web, Slack, GitHub, and Linear, allowing parallel agent sessions within one workspace.
Cloud agents generate demos and screenshots of their work for human review, the same experience previously available only at cursor.com/agents.
Agent sessions can be moved between local and cloud environments: shift to local for editing and testing, or push to cloud to keep a task running while offline or to free up focus for the next task.
Cursor 3 includes Composer 2, described as Cursor’s own frontier coding model with high usage limits, integrated directly into the local agent workflow.
A new diffs view handles staging, committing, and PR management within the interface, alongside a built-in browser for navigating local web apps and a marketplace with hundreds of plugins extending agents via MCPs, skills, and subagents.

My take: Cursor 3 looks extremely similar to OpenAI Codex for desktop. I mean it almost looks like a 1:1 copy. Compared to Cursor 2.0, this environment is something completely different. In Cursor 2.0 the agent window was stuck on top of the IDE, now Cursor has left the IDE behind and gone all-in on agentic engineering and chat sessions.

This means the main interface is no longer a file explorer and editor window, but a chat list and chat history window. This is probably the right way to go, and every time I start VSCode with GitHub Copilot running in a chat window in a side pane it feels wrong and old. Before the end of 2026 I believe most of us will no longer even need to start a dedicated IDE and almost everyone will work in these new agentic environments, like Cursor 3. For those who want to keep using the “old style” IDE it’s still included as an option, but it’s clear where Cursor is heading in the future with this release.

Google Releases Gemma 4: Four Open Models from 2B to 31B

https://deepmind.google/models/gemma/gemma-4

The News:

Google released Gemma 4 on April 1, 2026 under an Apache 2.0 license, a shift from the more restrictive custom license used in previous Gemma generations. The family consists of four model sizes: E2B, E4B, 26B MoE (Mixture of Experts, activating 4B parameters per pass), and 31B Dense.
The 31B model ranks #3 on the Arena AI open-model text leaderboard with a score of 1452, and the 26B MoE ranks #6 at 1441, as of April 2, 2026.
On AIME 2026 mathematics, the 31B reaches 89.2% and the 26B reaches 88.3%, compared to 20.8% for Gemma 3 27B. On LiveCodeBench v6 coding problems, the 31B scores 80.0% versus 29.1% for Gemma 3 27B.
The E2B and E4B edge models support audio and vision input and are designed to run offline on devices such as Raspberry Pi and Jetson Nano with near-zero latency.
All four models include native function calling and support 140 languages.

My take: Compared to Qwen 3.5 27B, Gemma 4 31B scores lower on several reasoning benchmarks like MMLU-Pro, GPQA Diamond and TAU2-Bench. Gemma 4 does however lead on a few coding benchmarks like Codeforces ELO. So when should you pick which model? If you need wide language support, pick Gemma 4 that supports 140 languages, otherwise pick Qwen 3.5. Both models are open source with Apache 2.0 license, and both have their own individual strengths.

Netflix Releases VOID: A Video Object Removal Model With Physics-Aware Inpainting

https://huggingface.co/netflix/void-model

The News:

Netflix and INSAIT (Sofia University) released VOID (Video Object and Interaction Deletion) on April 3, marking Netflix’s first public model release on Hugging Face. The model is available as open-weights under the Apache-2.0 license.
VOID removes objects from video while also correcting the physical consequences of that removal. For example, removing a person jumping into a pool produces footage where the water surface appears undisturbed, with no splash.
The core technical mechanism is a “quadmask” with four discrete values (0, 63, 127, 255) that encodes the primary removal target, overlap regions, physically affected regions, and background. This mask structure allows the model to distinguish between areas that need pixel-fill and areas where physical plausibility must be reconstructed.
The model is built on CogVideoX-Fun-V1.5-5b-InP, a 5B parameter 3D Transformer from Alibaba PAI, fine-tuned for interaction-aware inpainting. Default resolution is 384×672 with a maximum of 197 frames, running in BF16 with FP8 quantization.
Inputs are a video clip, a quadmask, and a text prompt describing the scene after removal, making it a vision-language pipeline rather than a purely visual one.

My take: In a human preference study VOID received a 64.8% preference rate compared to Runway that scored 18.4%, with other alternatives scoring even lower. Runway addresses appearance-level artifacts like shadows and reflections but do not model physical causality, such as objects that fall or displace when a person is removed. There are lots of use case scenarios for this model, like post-production for removing objects, people, or props that appeared unintentionally on the set.

Read more:

Microsoft 365 Copilot Researcher Gets Critique and Council Modes

https://www.testingcatalog.com/microsoft-365-copilot-gets-critique-and-council-modes

The News:

Microsoft added two multi-model features to the Researcher agent in Microsoft 365 Copilot on March 30, targeting professionals and enterprises who use Copilot for complex research tasks.
Critique is now the default mode when users select “Auto” in the model picker. It runs OpenAI’s GPT to draft a research report, then passes the output to Anthropic’s Claude, which reviews it for source reliability, completeness, and citation quality before the user sees the result.
On the DRACO benchmark (Deep Research Accuracy, Completeness, and Objectivity), Researcher with Critique scored 57.4 across 100 complex research tasks in 10 domains, a 13.88% improvement over Perplexity Deep Research with Claude Opus 4.6, which scored 50.4 and was the top-ranked system in the original benchmark paper.
Council runs Anthropic and OpenAI models simultaneously on the same research query, then uses a third judge model to generate a side-by-side summary of where the two models agree, where they diverge, and what unique insights each produced.
Both features are currently available through the Microsoft 365 Copilot Frontier program.

My take: I really love how Microsoft are using both Opus and GPT together to either review each others work (Critique) or work together to discuss how to best write a report (Council). GitHub Copilot has a similar feature where you can write “/review with opus, gpt 5.4 and sonnet 4.6” and it will run reviews in parallel and then trigger a final agent that will go through all findings and summarize. This is something neither OpenAI nor Anthropic can offer, and this is what makes Microsoft’s offer unique, and this is also where I believe Copilot will find it’s niche in the future. Microsoft will soon have a version of their MAI LLM which will be up to par with the rest of the SOTA models for simple tasks, they can then just distribute more complex tasks to Opus and GPT when needed.

Read more:

How to get a multi-agent code review in Copilot CLI – YouTube