Tech Insights 2025 Week 47

The past few months I have talked to several large companies with old legacy systems that need to be modernized. The question most of them have is: when can we start using AI to migrate all our old code into a modern Java or C# tech stacks?

For complex systems I do not think AI coding agents are there quite yet, but we are very, very close. The way to migrate these old systems is a four step process: (1) First you go from legacy code base to documentation, where you document each file and process flow, variable names and relationships, then (2) You need to manually review this documentation so it 100% represents the actual code base, (3) You then ask an agentic AI to migrate one isolated feature at the time, based on the generated documentation and the source files, and (4) Code review, where you compare the generated code to source code and documentation (using markers so you view all three at the same time). The documentation step is critical since most legacy code is often undocumented, you need to make sure that the AI has fully understood the meaning of the code.

To illustrate how process works, Anthropic just published a video called Claude Code modernizes a legacy COBOL codebase, which basically goes through this entire process. I highly recommend this video if your organization has lots of old legacy systems, it shows you the process and where we are heading with legacy code conversion. I think the models coming in the next 6-12 months will be more than capable to convert almost any legacy system to a modern tech stack, so if you have this need then you might as well start to plan for it now.

“The gap between how most people are using AI and what AI is presently capable of is immense”. This is from the OpenAI report “AI progress and recommendations” which I highly recommend everyone to read. And it is true, most companies are genuinely surprised when we present the possibilities of modern agentic AI at TokenTek. AI is more than chatbots, and agentic AI is capable of so much complex tasks than your M365 Copilot assistant that the difference is almost like night and day. Think about all non-creative tasks you spend dozens of hour doing every week in front of your computer. You could probably automate most of these tasks today already with a well-configured agentic system.

Thank you for being a Tech Insights subscriber!

Listen to Tech Insights on Spotify: Tech Insights 2025 Week 47 on Spotify

THIS WEEK’S NEWS:

  1. OpenAI Launches GPT-5.1 With Adaptive Reasoning
  2. GPT-5 Becomes First AI to Solve 9×9 Sudoku
  3. Cursor Raises $2.3 Billion at $29.3 Billion Valuation
  4. NotebookLM Adds Deep Research and Expands File Format Support
  5. Google DeepMind’s SIMA 2 AI Agent Controls 3D Games Using Gemini
  6. Project Fetch: Claude Accelerates Robotics Coding Experiments
  7. Anthropic Disrupts First Large-Scale AI-Orchestrated Cyberattack
  8. LM Arena Launches Code Arena for AI Coding Evaluation
  9. Terminal-Bench 2.0: AI Agent Benchmark With Verified Tasks
  10. Baidu Releases ERNIE-4.5-VL-28B-A3B-Thinking
  11. ElevenLabs Launches Celebrity Voice Licensing Marketplace

OpenAI Launches GPT-5.1 With Adaptive Reasoning

https://openai.com/index/gpt-5-1

The News:

  • OpenAI released GPT-5.1, a new language model designed to adjust its reasoning time dynamically according to the complexity of tasks, supporting practical uses from rapid conversation to extended problem-solving.​
  • The model introduces a “no reasoning” mode for everyday queries, delivering responses 2–3 times faster than GPT-5 and halving token usage in early testing.​
  • Prompt caching now extends up to 24 hours, improving session continuity for API users.​
  • Two new developer tools, apply_patch for code edits and a shell tool for direct command-line operations, are included.​
  • Performance benchmarks show GPT-5.1 achieving 76.3 percent accuracy on SWE-bench Verified, compared to GPT-5’s 72.8 percent, and posting notable gains in coding and terminal-based tasks.​
  • Specialized variants, gpt-5.1-codex and gpt-5.1-codex-mini, are optimized for long-running coding tasks.​
  • The model is available at the same price as GPT-5 to all paid API users.

My take: If you haven’t done it already, there are two things you should checkout with GPT-5.1. The first one is the new Personalization hub, go into Settings -> Personalization and you can now choose Base style and Tone. ChatGPT had this setting previously but changes only applied to new chats, and the model sometimes ignored tone shifts mid-conversation. The new profiles have been renamed to Default, Professional, Friendly, Candid, Quirky, Efficient, Nerdy and Cynical, and they now apply instantly across all active chats. My own favorite is Efficient which is one of the least-sounding AI profiles so far. You can now also write “Do not use em dash” into the custom instructions, and FINALLY ChatGPT will understand and accept this. 🎉 It just took them three years. I have been using GPT 5.1 for coding the past week (GPT-5-CODEX-HIGH), and it is also very good. Not significantly better than GPT 5 but maybe slightly better. I think that everyone who waited with switching to GPT-5 from 4o will now be quite happy with GPT-5.1 now. It’s faster and more accurate, and it is the first model from OpenAI that can write text without inserting em-dashes.

Read more:

GPT-5 Becomes First AI to Solve 9×9 Sudoku

https://pub.sakana.ai/sudoku-gpt5

The News:

  • GPT-5 became the first AI model to solve a complete 9×9 Sudoku variant on Sakana AI’s Sudoku-Bench, a benchmark testing creative reasoning through 100 unconventional puzzle variants.​
  • The model achieved a 33% solve rate across the benchmark, approximately double the previous leader ChatGPT o3-mini, when using high reasoning effort.​
  • Sudoku-Bench launched in May 2025 with 100 puzzles (15 4×4, 15 6×6, and 70 9×9 grids) designed to test meta-reasoning and creative “break-in” points rather than memorization.​
  • The benchmark prohibits tool use and code execution, requiring models to rely on natural language reasoning to understand novel rule sets and maintain logical consistency across multi-step solutions.​
  • 67% of puzzles remain unsolved, with models struggling on the more complex variants that combine multiple rule sets and demand spatial reasoning humans use naturally.

My take: When Sudoku-Bench launched in May 2025, no LLM could solve a classic 9×9 puzzle, and the most advanced models at the time, including ChatGPT o3, achieved only a 5% success rate limited with the simplest 4×4 puzzles. And now GPT-5 solves 33% of all puzzles. Benchmarks like this are great to get a sense of how models grow in capability even for the things they are not specifically trained for, just by becoming “smarter” and being better at reasoning. And it also shows how incredibly fast the models are evolving.

Cursor Raises $2.3 Billion at $29.3 Billion Valuation

https://cursor.com/blog/series-d

The News:

  • Cursor closed a $2.3 billion Series D funding round on November 12, 2025, at a post-money valuation of $29.3 billion.​
  • The AI coding platform has crossed $1 billion in annualized revenue and expanded to more than 300 employees since its 2023 launch.​
  • Accel and Coatue co-led the round, with participation from existing investors Thrive Capital, a16z, and DST, plus new strategic investors Google and Nvidia.​
  • The valuation more than tripled from the company’s $9.9 billion valuation achieved just five months earlier in June 2025.​
  • Enterprise revenue grew 100x year-to-date in 2025, with the platform now serving the majority of Fortune 500 companies and over 50,000 teams globally.​
  • The funding will support development of Composer, Cursor’s proprietary AI model launched in October 2025, which aims to reduce reliance on third-party models from OpenAI, Anthropic, and Google.​
  • Nvidia CEO Jensen Huang stated that 100% of Nvidia’s software engineers and chip designers use Cursor.​
  • OpenAI approached Cursor parent company Anysphere twice about a potential acquisition, once in 2024 and again in early 2025, but the talks did not progress.

My take: The interesting bit in this release is not the actual valuation, it’s that 100% of all Nvidia software engineers and chip designers all use Cursor, and that the money they make from this will go straight into developing their own proprietary coding model Composer. My prediction is simple: AI coding models will continue to improve, and in 1-2 years time these models will be able to do any task any programmer can do today. But they will be expensive. If you thought $200 per month is a steep price to pay today, you will probably not like the costs for AI coding models in 2027. This is why every single company offering these services need to develop their own models, because being just a model provider means your product will be significantly more expensive than just going straight to the source. With investments now from both Google and Nvidia, the future for Cursor looks brighter than ever.

Read more:

NotebookLM Adds Deep Research and Expands File Format Support

https://blog.google/technology/google-labs/notebooklm-deep-research-file-types

The News:

  • Google launched Deep Research in NotebookLM, a feature that automates web research by browsing hundreds of websites and generating source-grounded reports within minutes.​
  • Deep Research creates a research plan based on user queries, refines searches as it learns, and produces organized reports that can be added directly to notebooks along with their source materials.​
  • The tool operates in the background while users continue other work, distinguishing it from the existing Fast Research option which provides rapid source scanning without in-depth analysis.​
  • Users access Deep Research through the source panel by selecting Web as a source type, then choosing between Fast Research for quick scans or Deep Research for comprehensive briefings.​
  • NotebookLM now supports Google Sheets for querying structured data, Microsoft Word documents (.docx), images of handwritten notes, PDFs from Google Drive without re-uploading, and Google Drive files added via URL with comma-separated batch imports.

My take: If you are studying in high school or college, or have kids doing it, you really should be using NotebookLM. It’s amazing for reports and self-studies. The key with this update is that NotebookLM can work with both external and uploaded documents in its deep research mode, something other assistants like Perplexity cannot do (you have to choose between internal documents or web search). I use NotebookLM myself for document analysis and will definitely explore the deep research functionality, which should be significantly better as well once Gemini 3.0 Pro is out in a week or so.

Google DeepMind’s SIMA 2 AI Agent Controls 3D Games Using Gemini

https://deepmind.google/blog/sima-2-an-agent-that-plays-reasons-and-learns-with-you-in-virtual-3d-worlds

The News:

  • SIMA 2 is an AI agent that operates inside commercial 3D games like No Man’s Sky, Valheim, and Goat Simulator 3 by watching the screen and controlling keyboard and mouse inputs. The system uses Gemini 2.5 Flash-Lite as its reasoning engine.​
  • The agent executes complex, sketch-based instructions, such as locating objects that match hand-drawn images and performing actions like jumping on them.​
  • SIMA 2 achieved a 65% task completion rate on complex tasks, doubling the 31% rate of SIMA 1, which launched in March 2024. Human performance on the same tasks reached 71%.​
  • Joe Marino, senior research scientist at DeepMind, stated “SIMA 2 is a step change and improvement in capabilities over SIMA 1. It’s a more general agent that can complete complex tasks in previously unseen environments”.​
  • The system demonstrates self-improvement capabilities, learning from its own experience during gameplay sessions.​

My take: SIMA 2 represents a significant performance jump from its predecessor, more than doubling task completion rates from 31% to 65%. Humans still slightly outperformed it at 71%, but we are getting very close. The future for SIMA 2 is probably not playing games, but solving problems in the real world. If an AI is able to navigate around in a gaming world, interact with it and solve problems, it will also be able to control a drone or robot navigating around in the real world. This is where things are heading pretty quickly, and having cheap robots controlled by powerful cloud based AI models is probably where things will start out.

Project Fetch: Claude Accelerates Robotics Coding Experiments

https://www.anthropic.com/research/project-fetch-robot-dog

The News:

  • Anthropic conducted Project Fetch to test how its Claude AI model assists non-experts in programming robotic dogs for real-world tasks.
  • Eight researchers were assigned to two teams, one with Claude access and one without, and tasked with programming a quadruped robot to fetch beach balls across increasingly complex phases.
  • The Claude-assisted team completed assigned tasks in half the time compared to the team without Claude, and was the only group to make substantial progress toward fully autonomous ball retrieval.
  • Claude was notably effective in connecting to hardware and accessing sensor data, assisting in tasks like enabling the robot to locate and navigate towards balls.
  • The AI-assisted team wrote about nine times more code compared to their counterparts, highlighting a significant increase in code output and experimentation.
  • Sensor-based challenges surfaced, such as failed color detection when the ball color matched the turf, indicating cognitive hurdles for both human-guided and AI-guided robotics.

My take: I think my own takeaway from this is that both teams actually finished the task, but the AI-assisted team wrote nine times more source code. This basically sums up my experience with Claude Sonnet 4.5 versus GPT-5.1-CODEX. Claude really likes to write lots of code, and if it doesn’t work it usually patches it up with even more code. Whereas GPT-5.1-CODEX when prompted correctly will thread more careful, always look for optimization opportunities, and always strive for the DRY principle (Don’t Repeat Yourself) within reasonable limits. This is also why OpenAI Codex typically is around 10x slower than Claude Code. The biggest takeaway from this news is that if your developers are using Claude 4.5 Sonnet through Cursor or Copilot in agentic mode, keep an eye of the average commit size – if it starts to grow rapidly you should probably consider at least trying OpenAI Codex to avoid bloating your code base.

Anthropic Disrupts First Large-Scale AI-Orchestrated Cyberattack

https://www.anthropic.com/news/disrupting-AI-espionage

The News:

  • Anthropic detected and disrupted a Chinese state-sponsored cyber espionage campaign in September 2025 that used Claude Code to autonomously execute attacks against approximately 30 organizations. The targets included tech companies, financial institutions, chemical manufacturers, and government agencies.​
  • The AI model performed 80-90% of the attack operations with minimal human oversight, conducting reconnaissance, writing exploit code, harvesting credentials, and exfiltrating data at machine speed. Claude made thousands of requests per second, an attack velocity impossible for human teams to match.​
  • Attackers jailbroke Claude by splitting malicious tasks into smaller, seemingly innocent requests and claiming to be security researchers conducting authorized defensive testing. This context-splitting technique bypassed safety guardrails without exposing the full malicious intent in any single prompt.​
  • The attack followed a six-phase lifecycle: campaign initialization with framework development, reconnaissance to identify high-value databases, vulnerability discovery with custom exploit code, credential harvesting and lateral movement, data extraction and intelligence classification, and comprehensive documentation for operator handoff.​
  • Anthropic shut down malicious accounts within 10 days of detection, contacted affected organizations and law enforcement, and released sandboxing capabilities for Claude Code that create pre-defined operational boundaries and isolated cloud environments.

My take: The attackers used Claude since it’s really good at using tools like network scanners, exploit frameworks, credential crackers and database query tools. They managed to convince it to do this by telling Claude they were security researchers conducting “authorized defensive testing” and by splitting the context in a clever way. The attackers successfully breached “a small number of cases” according to the report, and the only reason more companies were not breached was that Anthropic detected it through internal monitoring systems and could shut it down. As is typical with most of these AI companies today, Anthropic did not want to not talk about this publicly. Their internal monitoring systems flagged this mid-september, and it took them 10 days to ban and shut down the accounts responsible for the attack. It was not until The Wall Street Journal reported the story on November 12 that Anthropic was forced to acknowledge the events publicly. If you still somehow thinks that “Anthropic” are the good guys with their “responsible AI” focus, this chain of events might make you think different.

Read more:

LM Arena Launches Code Arena for AI Coding Evaluation

https://news.lmarena.ai/code-arena

The News:

  • LMSYS Org launched Code Arena, a platform that evaluates AI coding models as interactive agents building complete applications in controlled environments.​
  • Models operate autonomously using structured tool calls such as create_file, edit_file, and read_file, executing multi-turn development cycles that mirror real engineering workflows.​
  • Each evaluation records the complete build trajectory with traceable session IDs stored in Cloudflare R2, making all actions, file edits, and renders reproducible and shareable via unique links.​
  • Human evaluators compare outputs on three axes: functionality, usability, and fidelity to design requirements, with votes aggregated using statistical methods that include confidence intervals and inter-rater reliability.​
  • Code Arena replaces WebDev Arena with a rebuilt architecture launched on a fresh leaderboard, deliberately excluding legacy data to maintain methodological consistency and avoid bias from mixing different evaluation systems.​
  • The platform plans to expand support for multi-file React applications, multimodal inputs, and isolated sandboxes within the coming months.

My take: “The Next Stage of AI Coding Evaluation Is Here”. They might as well have called it the vibe-coding arena. This new “Code arena” only evaluates functionality, usability, and fidelity of the app produced by the LLM, and not things like DRY (Don’t Repeat Yourself), SOLID principles, or architectural patterns. And as expected the two models from Anthropic shows up at top. For me this makes this entire Code Arena platform useless as a benchmark, but maybe if you want to know which AI makes the best-looking vibe coded web page then go ahead and checkout the leaderboard.

Read more:

Terminal-Bench 2.0: AI Agent Benchmark With Verified Tasks

https://github.com/laude-institute/terminal-bench

The News:

  • Terminal-Bench 2.0 replaces version 1.0 as the standard benchmark for testing AI agents in terminal environments. The benchmark targets agent, model, and benchmark developers who want to measure capabilities on real-world command-line tasks.​
  • The benchmark contains 89 tasks, each subjected to over 300 hours of manual and LLM-assisted verification. Tasks include designing DNA primers for biologists, installing legacy operating systems in emulators, compiling code from source, and bypassing cybersecurity filters.​
  • Version 2.0 addresses critical issues from version 1.0, including tasks that were too easy or impossible to complete reliably. The previous version included a “hello-world” text edit that allowed weak agents to score points, and a YouTube download task that broke frequently due to anti-bot protections.​
  • The release includes Harbor, a new package that runs sandboxed agent evaluations in cloud containers with a few lines of code. Harbor supports running thousands of experiments in containerized environments, solving resource limitations when evaluating agents at scale.​
  • Current top performance stands at 42.8% for OpenAI’s Codex (gpt-5-codex model), with Claude Sonnet 4.5 achieving approximately 50% using the Terminus 2 agent harness. Warp’s agent, using Claude Sonnet 4 as primary model and Opus 4 for planning, reached 52% on version 1.0.

My take: This is another benchmark I am not so sure about it’s actual value. The first version of Terminal-Bench 1.0 had tasks that were either impossible to solve or were super easy, like a “hello world” text edit, and this second version has tasks like “installing legacy operating systems in emulators”. And not too surprisingly their own terminal wrapper Terminus 2 scores absolutely best on this benchmark with these odd specific work tasks. The only way to really evaluate how well today’s AI agents perform in programming tasks is to have a human evaluate the actual code they write, and I have not yet seen a single benchmark that does this.

Baidu Releases ERNIE-4.5-VL-28B-A3B-Thinking

https://huggingface.co/baidu/ERNIE-4.5-VL-28B-A3B-Thinking

The News:

  • Baidu released ERNIE-4.5-VL-28B-A3B-Thinking, a multimodal AI model that activates only 3 billion parameters per token while using 28-30 billion total parameters through Mixture of Experts architecture.​
  • The model underwent mid-training on visual-language reasoning data and multimodal reinforcement learning using GSPO and IcePop strategies with dynamic difficulty sampling.​
  • “Thinking with Images” capability allows the model to zoom into image regions autonomously and call external tools like image search to identify objects and retrieve information.​
  • The model handles chart analysis, STEM problem-solving from images, visual grounding with bounding box detection, and video understanding with temporal awareness across a 131,072 token context window.​
  • Released under Apache License 2.0 with open weights available on Hugging Face, requiring approximately 80GB to load.​
  • Benchmark performance shows the model achieving 95.3 on DocQA, 82.5 on MathVista, and an average score of 73.1 across 15 benchmarks.

My take: What’s interesting about this specialized and fairly small model is that it outperforms both Gemini 2.5 Pro (average 70.3) and GPT-5 High (average 69.4) across multiple benchmarks, with particular strength in chart understanding and document analysis tasks. On VQA-v2, it scores 78.3%, compared to GPT-4’s 75.1% and Gemini’s 76.5%. The model’s inference speed is also 2-3x faster than equivalent full-parameter models due to sparse activation. If you are working with vision-language models today, definitely give this one a good try.

ElevenLabs Launches Celebrity Voice Licensing Marketplace

https://elevenlabs.io/iconic-marketplace

The News:

  • ElevenLabs has launched Iconic Voice Marketplace, a platform where companies can license AI-generated voices of celebrities and historical figures with rights holder consent. The marketplace addresses ethical concerns around unauthorized voice cloning by requiring approval from verified talent or estates.​
  • The platform launched with 28 voices including living celebrities Michael Caine and Liza Minnelli, deceased entertainment figures such as Judy Garland, John Wayne, and Mickey Rooney, and historical voices like Mark Twain, Thomas Edison, and Alan Turing recreated from archival recordings.​​
  • Companies submit licensing requests through the marketplace, connect directly with rights holders to negotiate terms, and receive approval before ElevenLabs delivers the AI-generated audio.​
  • Matthew McConaughey was revealed as an existing investor in ElevenLabs and plans to use the technology to translate his newsletter into Spanish audio using his own voice.​
  • The marketplace uses ElevenLabs’ Eleven v3 technology, which launched in June 2025 with voice synthesis, audio tags for whispering and laughing, and multi-speaker dialogue in over 70 languages.

My take: ElevenLabs already has a Voice Library marketplace where non-celebrity users can monetize their voice clones at approximately $0.03 per 1,000 characters in a self-service model without licensing contracts. This new Iconic Voice Marketplace is different since it focuses exclusively on verified celebrities and historical figures with licensing deals ranging from 1 to 11 years. If you want John Wayne to be your company voice for the next decade, this is where you sign up.