Tech Insights 2025 Week 49

The past week I have had lots of fun creating complex drawings in the style of Jan Van Haasteren and in the style of Middle-earth travel guides with Google’s latest image generation model Nano Banana Pro. This is the first time I have really felt that an image generator not only understands what I am asking from it, but also has the skills to generate interesting and intricate details in the result. There are still issues of course; the more complex the drawing the more chances are that details will be scrambled or incorrect. But it really is good enough for most use cases, and the amount of things you can produce with this model is truly on another level.

The main news last week was no doubt the release of Claude Opus 4.5, the latest top-tier model from Anthropic. Looking at the benchmarks the model is really good at agentic tasks, such as running tools, producing scripts and designing user interfaces. In my own tests however I still find it inferior to OpenAI’s latest and greatest model GPT-5.1-CODEX-HIGH, but what Codex lacks in documentation, visual style and tool effectiveness, Opus 4.5 more than makes up for it. The past week I have been running these two models in parallel, and this is the first time I have really felt “this is it” for all my development needs. My main wish for improvement would be quicker response times in Codex, but other than that I could see myself using this setup for years to come for my 100% prompted development. This is how good it is. Of course innovations and improvements won’t stop, but I finally feel we reached the point that Codex + Claude Code together (using the top models) can do every single task I throw at them.

Thank you for being a Tech Insights subscriber!

Listen to Tech Insights on Spotify: Tech Insights 2025 Week 49 on Spotify

THIS WEEK’S NEWS:

  1. DeepSeek Releases Open Source Math Model with Self-Verification
  2. Anthropic Releases Claude Opus 4.5 with Top Coding Performance
  3. OpenAI Discloses Security Incident at External Analytics Provider
  4. Microsoft Releases Fara-7B Local Computer Use Agent
  5. Alibaba Releases Z-Image-Turbo AI Image Model
  6. Black Forest Labs Launches FLUX.2 Multi-Reference Image Model
  7. Tencent Releases HunyuanOCR: 1B Parameter Model Tops OCR Benchmarks
  8. Suno Partners With Warner Music Group to Train AI on Licensed Catalog
  9. ChatGPT Adds Shopping Research Feature

DeepSeek Releases Open Source Math Model with Self-Verification

https://github.com/deepseek-ai/DeepSeek-Math-V2

The News:

  • DeepSeek released DeepSeekMath-V2, a 685B parameter language model that verifies the correctness of mathematical proofs step-by-step rather than checking only final answers.​
  • The model scored 5 out of 6 problems on IMO 2025 and 118 out of 120 on Putnam 2024, surpassing the best human score of 90 that year.​
  • The system trains a verifier that reads proofs and assigns quality scores of 0, 0.5, or 1 based on completeness and logical soundness.​
  • A proof generator then uses the verifier as a reward model, identifying and fixing issues through up to 16 refinement cycles before finalizing solutions.​
  • The approach addresses limitations in reinforcement learning methods that reward correct final answers but cannot assess reasoning quality or handle tasks like theorem proving that require rigorous derivation without numerical outputs.​
  • DeepSeek released the model weights under Apache 2.0 license on GitHub.

My take: In 2017 there was a paper published called “When Will AI Exceed Human Performance? Evidence from AI Experts”. The prediction then was that it would take 33.8 years for an AI to beat the Putnam math competition:

“Perform as well as the best human entrants in the Putnam competition—a math contest whose questions have known solutions, but which are difficult for the best young mathematicians”

Well it didn’t take 33 years, it took eight years, and now we even have an Apache open source model that can do it. Not only did DeepSeekMath-V2 perform as good as the best human entrants, it scored way above the best human and got almost every single answer right (118/120). The 2017 paper was written by AI experts at Yale and Stanford, and it only shows how incredibly difficult it is to predict innovations that follow an exponential timeline, and where innovations in many different areas (programming, math, robotics) intertwine and trigger even greater leaps in innovation. Amazing job by the DeepSeek team with this release, and while Google and OpenAI is still ahead the gap is closing.

Read more:

Anthropic Releases Claude Opus 4.5 with Top Coding Performance

https://www.anthropic.com/news/claude-opus-4-5

The News:

  • Anthropic released Claude Opus 4.5 on November 24, targeting software developers, autonomous agents, and computer use applications.​
  • The model scored 80.9% on SWE-bench Verified, outperforming Gemini 3 Pro at 76.2% and GPT-5.1 at 76.3%.​
  • On SWE-bench Multilingual, Opus 4.5 leads in 7 out of 8 programming languages and scores 89.4% on Aider Polyglot coding problems compared to Sonnet 4.5 at 78.8%.​
  • The model scored higher than any human candidate on Anthropic’s internal performance engineering exam within a two-hour time limit.​
  • A new effort parameter controls token usage. At medium effort, Opus 4.5 matches Sonnet 4.5’s performance while using 76% fewer output tokens. At high effort, it exceeds Sonnet 4.5 by 4.3 percentage points while using 48% fewer tokens.​
  • Pricing is $5 per million input tokens and $25 per million output tokens.

My take: I have used Claude Opus 4.5 a lot the past week, and while it’s a very good at tool use (subagents, terminal tools, etc), my experience is that Opus 4.5 is still much more “sloppy” compared to something like OpenAI GPT-5.1-CODEX-HIGH. Ask Opus 4.5 to investigate an issue in a large code base, then look at how it works and how many files and lines it reads, and you quickly find that Claude Code with Opus 4.5 loads much less information into context than OpenAI Codex, which makes it much harder for Opus 4.5 to understand the actual code structure. I have sent over 10 different tasks to both Claude Code with Opus 4.5 and Codex with GPT-5.1-CODEX-HIGH, and in every single case Claude Code with Opus apologizes for missing the things that OpenAI identified.

Should you use it then? Absolutely, use it for documentation, translation, style sheets, html design, graphics design, vibe-coding, and similar things. It’s absolutely amazing at those tasks. But ask it to go through 20+ complex source code files to analyze how a memory cache optimization should be implemented and it will most probably give you an unoptimized solution. Claude Opus 4.5 is a very good match for GPT-5.1-CODEX-HIGH, where Codex is good at problem solving and algorithmic programming but quite bad att design and documentation. If you can afford both you have the best of both worlds.

OpenAI Discloses Security Incident at External Analytics Provider

https://openai.com/index/mixpanel-incident

The News:

  • A breach at Mixpanel, OpenAI’s web analytics vendor, exposed limited user profile data for some API platform customers between November 9 and November 25, 2025.​
  • The compromised data included names, email addresses, coarse location data, browser information, and organization IDs linked to platform.openai.com accounts.​
  • No chat content, API requests, passwords, API keys, payment information, or government IDs were accessed.​
  • ChatGPT users and other OpenAI products remained unaffected, as the incident only impacted developers and organizations using the API platform.​
  • OpenAI terminated its contract with Mixpanel on November 26, removed the service from production, and initiated expanded security reviews across its entire vendor ecosystem.​
  • The company advises affected users to monitor for phishing attempts leveraging the exposed profile data and recommends enabling multi-factor authentication.

My take: Maybe you heard about this security Incident last week and didn’t really have time to read up on it. Let me summarize: OpenAI has a partner company called Mixpanel which they use to collect usage analytics for platform.openai.com. OpenAI sends user profile data like name, email address and coarse location data to Mixpanel so they can track how different users use the platform tool. The breach was inside Mixpanel, and no data except for the data used for tracking was leaked, and OpenAI terminated their contract with Mixpanel as of November 26.

Microsoft Releases Fara-7B Local Computer Use Agent

https://www.microsoft.com/en-us/research/blog/fara-7b-an-efficient-agentic-model-for-computer-use

The News:

  • Microsoft released Fara-7B, a 7-billion-parameter model fine-tuned from Qwen 2.5 VL that runs locally on devices to automate web tasks through mouse and keyboard control. The model processes browser screenshots to execute actions like clicking, typing, and scrolling without sending data to the cloud.​
  • Fara-7B achieves 73.5% success rate on WebVoyager benchmark, surpassing GPT-4o’s 71.1% and UI-TARS-1.5-7B’s 66.4%. The model completes tasks in an average of 16 steps compared to 41 steps for UI-TARS-1.5-7B.​
  • The model operates using only pixel-level visual information from screenshots, without requiring accessibility trees or underlying code structures. This approach allows interaction with any website interface visible on screen.​
  • Microsoft trained Fara-7B on 145,603 verified browser trajectories spanning 1,010,797 steps across 70,117 domains. The training used supervised fine-tuning with synthetic data generated through multi-agent task proposal and verification.​
  • The model includes safety features that pause at “Critical Points” to request user approval for payments, logins, or personal information sharing. It refuses 82% of potentially harmful tasks according to Microsoft’s testing.​
  • Fara-7B is available under MIT license on Hugging Face and Microsoft Foundry, with support for Copilot+ PCs through Windows 11 AI Toolkit using NPU hardware acceleration.

My take: According to Yash Lara, Senior PM Lead at Microsoft Research, this model is “best suited for pilots and proofs‑of‑concept rather than mission‑critical deployments”. And that goes for all computer-using AI agents today. They are slow, they make mistakes, and the main use of them right now seems to be as benchmarks for the amount of progress we make in this area, rather then actual daily use. I have no doubt these computer-using AI models will become good enough for most tasks we do at the computers in the near future, but we are probably at least a year away from that, maybe more.

Read more:

Alibaba Releases Z-Image-Turbo AI Image Model

https://comfyui-wiki.com/en/news/2025-11-27-alibaba-z-image-turbo-release

The News:

  • Alibaba Tongyi Lab released Z-Image-Turbo on November 27, 2025, a text-to-image model with 6 billion parameters that runs on consumer GPUs with 16GB VRAM.​
  • The model generates images in under 1 second on NVIDIA H800 GPUs and requires only 8 inference steps, consuming significantly fewer resources than competing models.​​
  • Z-Image-Turbo achieves sub-second inference latency and produces photorealistic images comparable to models with 20+ billion parameters.​​
  • The model ranked first among open-source models in independent human preference evaluations and scores near commercial closed-source models like Google ImageGen 4 and ByteDance Seed-Image Dream 4.​
  • Z-Image-Turbo supports bilingual text rendering in English and Chinese with 86.71% word accuracy on the CVTD2K English benchmark.​​
  • The complete model family includes Z-Image-Turbo for speed, Z-Image-Edit for instruction-based editing tasks like “make the person smile + turn their head + replace the background”, and base variants that balance quality with performance.​
  • The model uses the Scalable Single-Stream Diffusion Transformer architecture and was trained in 314,000 H800 GPU hours at a cost under $630,000.​
  • Released under Apache 2.0 license with weights available on Hugging Face, where the model topped trending charts with 500,000 downloads on release day.

My take: This model has been all the rave over the Internet the past days, and it’s easy to understand why. Z-Image-Turbo can generate images that look very close to real photos, and it is able to do it on just a 16GB consumer GPU. So what are the limitations? First of all, ask it to render a “man” or a “woman” and you always end up with a 20-year-old-something asian man or woman. The same goes for cartoon styles, food and culture. It’s very clear that this is a Chinese model trained on Chinese data. But that said, they have more or less solved textures and lightning, making the images it generates look “real”. If you have a minute to spare, go checkout their official gallery page. Then consider that you can render these images on a 16GB consumer GPU. I am very impressed with this model, I did not think a 6B parameter model could generate images with this quality.

Read more:

Black Forest Labs Launches FLUX.2 Multi-Reference Image Model

https://bfl.ai/blog/flux-2

The News:

  • Black Forest Labs released FLUX.2, a 32-billion-parameter image generation and editing model that processes up to 10 reference images at once and operates at resolutions up to 4 megapixels.​
  • The model combines a Mistral-3 24-billion-parameter vision-language model with a “rectified flow transformer”
  • FLUX.2 produces text-to-image generation at 2 megapixels in 20 to 30 seconds and 4-megapixel editing in approximately 90 seconds on an RTX 6000 GPU.​
  • The company released four variants: FLUX.2 [pro] and [flex] as managed APIs, FLUX.2 [dev] as an open-weight checkpoint under a non-commercial license on Hugging Face, and FLUX.2 [klein] coming soon under Apache 2.0 license.​
  • FLUX.2 [flex] exposes adjustable parameters including sampling steps (from 6 to 50) and guidance scale, allowing developers to balance speed against text accuracy and detail fidelity.​
  • The FLUX.2 VAE is available under Apache 2.0 license, and API access is provided through FAL, Replicate, Runware, TogetherAI, Cloudflare, and DeepInfra.​
  • FP8 quantizations reduce VRAM requirements and increase performance by 40 percent, with optimized implementations developed in collaboration with NVIDIA and ComfyUI.

My take: Black Forest labs have stopped comparing FLUX.2 to other closed source models like Google Nano Banana Pro, and looking at the images it produces it’s easy to understand why. Google Nano Banana Pro just pushed AI image generation 2-3 years into the future, and FLUX.2 is not a part of that push. The images produced by FLUX.2 already look old and dated. The main feature about FLUX.2 is it’s prompt adherence, meaning you can get predictable output from it. I can definitely see the value from that, but since all the images look AI-generated I am not so sure who will use it and for what.

Read more:

Tencent Releases HunyuanOCR: 1B Parameter Model Tops OCR Benchmarks

https://www.marktechpost.com/2025/11/26/tencent-hunyuan-releases-hunyuanocr-a-1b-parameter-end-to-end-ocr-expert-vlm/

The News:

  • Tencent released HunyuanOCR on November 25, 2025, a 1 billion parameter vision language model specialized for optical character recognition and document understanding.​
  • The model processes all OCR tasks, including text spotting, parsing, information extraction, visual question answering, and translation through natural language prompts in a single forward pass.​
  • HunyuanOCR scored 860 on OCRBench, achieving state-of-the-art performance among models under 3 billion parameters and outperforming DeepSeek-OCR at similar scale.​
  • On OmniDocBench, the model achieved 94.10 overall, with 94.73 on formula recognition and 91.81 on table parsing.​
  • The model was trained on over 200 million image-text pairs across 9 scenarios, including street views, documents, handwritten text, screenshots, invoices, and video frames, with coverage of 130+ languages.​
  • For information extraction tasks, HunyuanOCR reached 92.29 percent accuracy on cards, 92.53 percent on receipts, and 92.87 percent on video subtitles.​
  • The architecture combines a 0.4B native resolution Vision Transformer with a 0.5B Hunyuan language model through an MLP adapter.

My take: Just looking at the headlines here is enough to raise eyebrows. A small 1B parameter model is outperforming current state-of-the-art models for optical character recognition and document understanding! Looking into the details however, this is only when comparing it to other models that are under 3 billion parameters. Looking at other benchmarks like OCRBench then larger models still lead, with Qwen3-VL-235B scoring 920 compared to HunyuanOCR’s 860. Still these results are very good for such a small model like this. The only negative about the model is that you cannot use it if you live in the EU, with the license agreement clearly stating:

THIS LICENSE AGREEMENT DOES NOT APPLY IN THE EUROPEAN UNION, UNITED KINGDOM AND SOUTH KOREA AND IS EXPRESSLY LIMITED TO THE TERRITORY, AS DEFINED BELOW.

Read more:

Suno Partners With Warner Music Group to Train AI on Licensed Catalog

https://suno.com/blog/wmg-partnership

The News:

  • Suno just signed a licensing deal with Warner Music Group to develop new AI music generation models trained on WMG’s recordings. The partnership resolves a $500 million copyright lawsuit WMG filed against Suno in 2024.​
  • The deal introduces features that let users generate music using the voices, names, and likenesses of participating WMG artists including Dua Lipa, Coldplay, and Ed Sheeran. Artists who opt into the program receive compensation.​
  • Suno requires a paid subscription to download generated songs under the new terms. The company’s Studio product retains unlimited downloads.​
  • The new models will replace Suno’s current v5 system, which launched in 2024 with studio-grade audio quality and improved vocal expression compared to v4.​
  • Suno acquired Warner’s concert discovery platform Songkick as part of the partnership. Financial terms were not disclosed.

My take: Quality-wise Suno is not really up to par with Udio for music generation. But as I wrote in Tech Insights 45 four weeks ago, Udio has now stopped all downloads from their platform until they can rollout their new platform based on their collaboration with the top three music labels, so people who use these tools for music production now only have one choice left. And while Suno has now signed an agreement with Warner, Universal Music Group and Sony continue their copyright lawsuits against Suno. I am quite sure Suno in 12 months will look quite different to how Suno functions today, and it will be interesting to see how their new licensed platforms will work in practice.

ChatGPT Adds Shopping Research Feature

https://openai.com/index/chatgpt-shopping-research

The News:

  • OpenAI released Shopping Research last week, a feature that generates buyer’s guides through conversation.​
  • The system takes a few minutes to search the internet and compile structured guides that include product comparisons, pricing, availability, reviews, specifications, and images.​
  • Users describe what they need and answer clarifying questions about budget, purpose, and desired features, then refine results by marking items “Not interested” or “More like this”.​
  • The feature is powered by a shopping-specialized variant of GPT-5 mini post-trained on GPT-5-Thinking-mini, achieving 52% product accuracy on multi-constraint queries compared to 37% for ChatGPT Search.​
  • OpenAI reports the tool handles 50 million shopping queries daily, with best performance in electronics, beauty, home and garden, kitchen appliances, and sports categories.​
  • The system prioritizes Reddit reviews and other “trusted sites” over brand content, with results described as organic and unsponsored.​
  • OpenAI plans to integrate Instant Checkout for participating merchants, building on partnerships with Etsy, Shopify, Walmart, and Target.

My take: Reading the reviews so far makes it clear that this is a very early prototype rather than a real shopping agent. The concept is great, but right now it fails at even the most basic details like keeping shopping budgets or providing proper web page links. I can understand the urge for companies like OpenAI to release “tests” like this just to get early feedback, but I am not sure that all the ChatGPT users really understand just how half-baked the solution is in its current state. What is your opinion, should OpenAI continue pushing half-baked solutions like this or should they polish them first and release them later?

Read more: