If you are developing software with AI, most of your prompts probably include “act like a senior developer”. The reason is that the large foundation models we use for programming like o3 and Claude 3.7 are not specifically tuned for this task, which means you have to explicitly tell them how they should act. “Do not remove comments”, “Do not refactor unless needed”, “Do not add new frameworks if possible”, the list just goes on.

Last week both OpenAI and Windsurf (that was just bought by OpenAI) released the foundation models SWE-1 and codex-1 that are both specifically built for programming. These models are built for tasks like feature development, debugging, writing and executing tests, and refactoring large codebases, and you should no longer have to tell it to act like a senior developer. Greg Brockman, president and co-founder of OpenAI, said that “Software engineering is changing, and by the end of 2025, it’s going to look fundamentally different” and I believe it.

One of my favorite examples that demonstrates the power of AI Agents is the Google Co-Scientist, which cracked a decade-long research problem in just two days. Well now we have an even more impressive example. Last week Google announced that their new agent AlphaEvolve discovered a new algorithm for multiplying 4×4 matrices using just 48 scalar multiplications, breaking a mathematical record that had stood since 1969. We are moving into areas where AI agents start to perform way above our own intelligence, and this is just the start.

YOU CAN NOW LISTEN TO THIS NEWSLETTER ON SPOTIFY!

16 minute podcast: Tech Insights 2025 Week 21 on Spotify

Thank you for being a Tech Insights subscriber!

THIS WEEK’S NEWS:

AlphaEvolve: DeepMind’s AI Agent Designs Algorithms Beyond Human Limits
Windsurf Launches SWE-1 AI Models for Software Engineering
OpenAI Launches Codex, New Cloud-Based Software Engineering Agent
LangChain Launches Open Agent Platform for No-Code AI Agent Building
Meta FAIR Releases Groundbreaking AI Models for Molecular Science and Neuroscience
ByteDance Releases DeerFlow: Open-Source Deep Research Assistant
Middle East Emerges as AI Infrastructure Hub with Major OpenAI and AWS Investments
OpenAI Launches HealthBench to Evaluate Medical AI Performance
DeepSeek Team Reveals Hardware Optimization Strategies Behind Their 671B Parameter Model
Research Shows LLMs Struggle with Multi-Turn Conversations
Meta Launches AssetGen 2.0 for High-Quality 3D Asset Creation
Sakana AI Introduces Continuous Thought Machine: A Biologically Inspired Alternative to Transformers

AlphaEvolve: DeepMind’s AI Agent Designs Algorithms Beyond Human Limits

https://deepmind.google/discover/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms

The News:

Google DeepMind has introduced AlphaEvolve, a Gemini-powered AI coding agent that discovers and optimizes algorithms for complex problems in mathematics and computing.
AlphaEvolve has been operational within Google for over a year, recovering an average of 0.7% of Google’s global compute resources through improvements to their Borg cluster management system, representing millions of dollars in efficiency gains.
The system discovered a new algorithm for multiplying 4×4 complex-valued matrices using just 48 scalar multiplications, breaking a mathematical record that had stood since Strassen’s work in 1969.
AlphaEvolve also optimized a matrix multiplication kernel used in training Gemini models, achieving a 23% speed increase for that operation and reducing overall training time by 1%.
The AI agent has also improved hardware design, identifying ways to eliminate unnecessary components in an arithmetic circuit used in Google’s Tensor Processing Units (TPUs) that will be incorporated into upcoming chip designs.

What you might have missed: For more details about the matrix multiplication breakthrough, check this Reddit thread that explains it in more detail: I verified DeepMind’s latest AlphaEvolve Matrix Multiplication breakthrough(using Claude as coder), 56 years of math progress! : r/ClaudeAI.

My take: AlphaEvolve is an agentic system, powered by Gemini Flash and Gemini Pro. The smaller Flash model generates code variations for a wide range of possible solutions, and the larger Pro model provides deeper analysis and more sophisticated improvements. The system is exceptionally good at reducing hallucinations, by (1) focusing on code that actually runs and produce correct results, (2) using automated evaluators that verify every solution against defined metrics, and (3) creating a closed feedback loop where only verified solutions move forward. If you have heard me talk during my seminars I often use these kinds of architectures as examples on why hallucinations aren’t a major problem when you build agentic systems. By combining neural components (the Gemini models) and symbolic components (the automated evaluators), we end up with something called neurosymbolic. According to AI expert Gary Marcus (AI expert, author, critic), this approach represents “one more sign that neurosymbolic techniques that combine neural networks with ideas from classical AI, is the way of the future”.

Read more:

Google DeepMind debuts algorithm evolving agent, AlphaEvolve • The Register

Windsurf Launches SWE-1 AI Models for Software Engineering

https://windsurf.com/blog/windsurf-wave-9-swe-1

The News:

Windsurf has released SWE-1, a family of AI models specifically designed to accelerate the entire software engineering process, not just code generation, with the goal to “accelerate software development by 99%”.
The flagship SWE-1 model performs competitively with Claude 3.5 Sonnet, GPT-4.1, and Gemini 2.5 Pro on internal programming benchmarks while being more cost-efficient to serve.
SWE-1-lite replaces Windsurf’s previous Cascade Base model with higher quality and is available for unlimited use to all users, both free and paid.
SWE-1-mini is an ultra-fast, lightweight model that powers the passive Windsurf Tab experience, providing code predictions for all users without usage limits.
The models leverage Windsurf’s “flow awareness” concept, which creates a shared timeline between humans and AI, allowing for seamless collaboration where humans can correct AI mistakes and the AI can continue based on those corrections.

My take: In case you wondered why OpenAI purchased “the code editor company” Windsurf two weeks ago, now you know why. In parallel with developing their IDE, Windsurf has been developing their own foundation models, which promises to be much better at tasks like planning, debugging, testing, and maintenance, tasks foundation models like GPT and Claude are not very good at. The models also approach code differently, in that they were trained on “incomplete code states and multiple work surfaces”, suggesting they’re designed to understand code as it evolves over time, not just as static snippets. [User feedback on Reddit](Windsurf Wave 9: SWE-1 in-house models : r/windsurf) has been positive, and I personally think this is the right way forward. I think the main reason these models did not outperform Claude 3.7 is simply lack of compute power, Windsurf has previously had nowhere near the compute power of OpenAI. Add the new AI center planned in UAE into the mix in which OpenAI is deeply involved, and you can clearly see where all this is heading. I expect that in a year or so (early 2026) we will have the next generation of SWE-1 built on a much stronger compute platform, and that is when things will get really interesting.

Read more:

OpenAI Launches Codex, New Cloud-Based Software Engineering Agent

https://openai.com/index/introducing-codex

The News:

OpenAI has released Codex, a cloud-based software engineering agent that can perform multiple coding tasks simultaneously in isolated environments, powered by their new codex-1 model.
Codex can write features, fix bugs, answer questions about codebases, and propose pull requests, with each task running in its own cloud sandbox environment preloaded with the user’s repository.
The agent is built on a specialized version of OpenAI’s o3 reasoning model optimized specifically for software engineering to generate code that closely mirrors human style.
Tasks typically take between 1 and 30 minutes to complete, with the agent capable of reading and editing files, running commands including test harnesses, linters, and type checkers.
Currently available to ChatGPT Pro, Enterprise, and Team users, with support for Plus and Edu coming soon, though OpenAI plans to implement rate limits in the coming weeks.

“Software engineering is changing, and by the end of 2025, it’s going to look fundamentally different”, Greg Brockman, President and co-founder of OpenAI

What you might have missed: David Ondrej did a deep dive on OpenAI Codex yesterday, and it shows you just how good Codex is when working with a large production code base: OpenAI just destroyed all coding apps – Codex

My take: In just one week we got two new tools specifically designed for software development: Windsurf SWE-1 and OpenAI Codex. Codex is powered by codex-1, which is based on o3 but optimized for software engineering tasks to produce code that more closely mirrors human style. Early adopters include companies like Cisco, Temporal, Superhuman and Kodiak, where Temporal according to OpenAI “uses Codex to accelerate feature development, debug issues, write and execute tests, and refactor large codebases”. User feedback has so far been mixed, with some users saying it’s brilliant, and some users cannot even get basic text parsing to work. I guess that’s why it’s released as a “research preview”.

“Codex feels like a co-worker with its own computer”, said OpenAI’s Greg Brockman in the launch demo. “You ask it to run tests or fix typos, and it just does it while you keep coding or grab lunch”. If you are using AI for software development today you know how quirky it feels to constantly tell Claude or GPT-4o to “act like a senior developer”, “the less code the better”, etc. With these new foundation models and tools like Codex that will soon be a thing of the past.

Trying out Codex, seems to not even get basic text parsing correctly, how are we supposed to be able to use this? Seems like the release of this was very rushed : r/OpenAI

LangChain Launches Open Agent Platform for No-Code AI Agent Building

https://github.com/langchain-ai/open-agent-platform

The News:

LangChain released Open Agent Platform (OAP), an open-source, no-code platform that allows users to build, customize, and deploy AI agents through a web-based interface without requiring technical expertise.
The platform enables connection to various tools such as MCP Tools for external integrations, LangConnect for Retrieval Augmented Generation (RAG), and other LangGraph agents through an Agent Supervisor.
OAP includes pre-built agents: a MCP/RAG Tools Agent and a Multi-Agent Supervisor, with the MCP server powered by arcade.dev providing access to hundreds of tools.
The platform features built-in authentication, access control, and a configurable UI that lets users define how others can configure their agents.

“Open Agent Platform provides a modern, web-based interface for creating, managing, and interacting with LangGraph agents. It’s designed with simplicity in mind, making it accessible to users without technical expertise, while still offering advanced capabilities for developers.”

What you might have missed: Langchain has a video showing how the platform works at their GitHub repository. It’s just 30 seconds long so if you have the time it gives you a clear understanding how OAP works.

My take: User feedback on OAP has so far been very positive, much thanks to the integration with arcade.dev! Open Agent Platform is released under the MIT License, so it’s free to use locally or in the cloud. This is basically the easiest way I can think of to get started with LangGraph agents, connect them with tools over MCP, integrate with RAG using LangConnect, and orchestrate them with an Agent Supervisor. To get started you need a LangSmith account (the free tier works), a Supabase account, An MCP server (like Arcade) and an LLM API Key (from OpenAI, Anthropic, or Google).

Meta FAIR Releases Groundbreaking AI Models for Molecular Science and Neuroscience

https://ai.meta.com/blog/meta-fair-science-new-open-source-releases

The News:

Meta’s Fundamental AI Research (FAIR) team has released new AI models, benchmarks, and datasets focused on molecular property prediction, language processing, and neuroscience, advancing their goal of achieving advanced machine intelligence (AMI).
The centerpiece is Open Molecules 2025 (OMol25), the largest and most diverse dataset yet of high-accuracy quantum chemistry calculations for biomolecules, metal complexes, and electrolytes, requiring 6 billion core hours of compute to create.
Meta also released the Universal Model for Atoms (UMA), a machine learning interatomic simulator trained on over 30 billion atoms that can predict molecular behavior with unprecedented accuracy.
The releases include Adjoint Sampling, an algorithm for generating diverse molecules from large-scale energy models, along with a new benchmark to encourage more research in computational chemistry.
Meta FAIR and the Rothschild Foundation Hospital also presented the first large-scale study mapping how language representations emerge in the brain during development, revealing parallels with large language models.

My take: There are so many use cases for these new tools! They can be used for faster drug discovery, to build better batteries, to discover new materials, to produce cleaner energy and understanding the brain better. Being able to do simulations 10 times larger than what was previously possible is huge. And the new “Universal Model for Atoms” actually positions Meta as a leader in computational chemistry. The main question I had is – why is Meta doing this, a social media company making money from ads? I see two main reasons. The first one is to position Meta as a leader in foundational AI research, to help them attract top talent, build partnerships with research institutions, and enhance their reputation in the tech industry. The second one is long-term strategy. Meta believes AI will shape our entire society, so they want to be represented in every way possible on all levels possible. Being a world-leader in AI is then a good thing, especially if you have the money to invest in it.

ByteDance Releases DeerFlow: Open-Source Deep Research Assistant

https://deerflow.tech

The News:

ByteDance (TikTok’s parent company) has released DeerFlow, an open-source deep research assistant that combines LLMs with specialized tools for web search, crawling, and Python code execution.
The platform features a modular multi-agent architecture where specialized agents handle different functions like planning, knowledge retrieval, and report generation within a directed acyclic graph.
DeerFlow supports local LLM integration through Ollama, making it accessible for users who prefer to run AI models on their own hardware rather than relying on cloud services.
The system includes MCP server integration capabilities, allowing users to connect with private knowledge bases and enterprise systems for more comprehensive research.
Beyond research reports, DeerFlow can transform findings into alternative formats like PowerPoint presentations or audio podcasts, enhancing content accessibility.

My take: If you have been eying the AI agent manus.im and wished you could build something similar on your own, well now you can! DeerFlow is open source and is released using the MIT License. You can run it locally or in the cloud, and you can use any LLM using LiteLLM. Want to build a research agent with Claude or GPT-4o? No problem. Want to run all local using Qwen? Connect it with Ollama and run locally. I couldn’t find any feedback from users who have used it, so I wouldn’t rush to install it yet.

Read more:

Middle East Emerges as AI Infrastructure Hub with Major OpenAI and AWS Investments

AWS: https://press.aboutamazon.com/2025/5/aws-and-humain-announce-groundbreaking-ai-zone-to-accelerate-ai-adoption-in-saudi-arabia-and-globally
OpenAI: https://www.reuters.com/world/middle-east/openai-help-uae-develop-one-worlds-biggest-data-centers-bloomberg-news-reports-2025-05-16/

The News:

OpenAI and AWS are making significant AI infrastructure investments in the Middle East, with OpenAI partnering in the UAE and AWS establishing an AI Zone in Saudi Arabia.
OpenAI is set to become a key anchor tenant for a massive 5-gigawatt data center complex in Abu Dhabi, potentially creating one of the world’s largest AI infrastructure facilities spanning approximately 10 square miles (2.6 square kilometers).
AWS and Saudi Arabia’s HUMAIN announced a $5 billion-plus strategic partnership to build an “AI Zone” in Saudi Arabia, featuring dedicated AWS AI infrastructure, UltraCluster networks, and advanced AI services.
The UAE-US agreement formalizes the establishment of what will be the largest AI campus outside the United States, while the AWS-HUMAIN partnership aligns with Saudi Arabia’s Vision 2030 goals.

My take: The scale of these Middle Eastern AI infrastructure projects are on a whole different scale then we are used to. OpenAI’s planned 5-gigawatt facility in Abu Dhabi will consume energy equivalent to five nuclear reactors, far exceeding the company’s largest US facility (1.2 GW) planned under the “Stargate” initiative. The UAE project involves G42, an Abu Dhabi AI firm with historical connections to China, while Saudi Arabia’s HUMAIN is backed by the country’s Public Investment Fund. According to PwC, AI could contribute $130 billion to Saudi Arabia’s economy by 2030, representing over 40% of the estimated AI contribution to the Middle East.

OpenAI Launches HealthBench to Evaluate Medical AI Performance

https://openai.com/index/healthbench

The News:

OpenAI released HealthBench, an open-source benchmark designed to evaluate how safely and accurately AI models handle healthcare queries in realistic scenarios.
The benchmark includes 5,000 realistic health conversations developed with input from 262 physicians across 60 countries, covering 26 medical specialties and 49 languages.
Each conversation includes physician-created scoring guides with 48,562 unique criteria to grade AI responses on safety, appropriateness, accuracy, and communication quality.
The evaluation framework tests AI models across seven themes including emergency situations, handling uncertainty, and global health contexts.
OpenAI’s latest o3 model scored highest at 60% on the benchmark, outperforming Elon Musk’s Grok (54%) and Google’s Gemini 2.5 Pro (52%).

My take: There are two main takeaways from this news. The first one is that OpenAI reports that their April 2025 models have improved by 28% on HealthBench compared to their September 2024 models. This is a huge improvement in such a short time. The newer GPT-4.1 nano is actually outperforming the August 2024 GPT-4o model despite being 25 times cheaper. The second takeaway are the different ways this framework can be used:

For researchers, it provides a standardized, realistic way to measure progress in medical AI.
For hospitals and healthcare startups, it serves as an initial screening tool to identify which AI models might be safe enough to move forward to clinical trials or implementation.
For regulators, it offers a transparent standard with specific metrics that could help establish baselines for what “safe and effective” means in medical AI.

DeepSeek Team Reveals Hardware Optimization Strategies Behind Their 671B Parameter Model

https://www.arxiv.org/abs/2505.09343

The News:

DeepSeek published a technical paper detailing how they trained their massive 671 billion parameter language model (DeepSeek-V3) on just 2,048 NVIDIA H800 GPUs.
The team implemented Multi-head Latent Attention (MLA) to compress key-value caches, reducing memory usage to just 70KB per token, significantly less than other models.
They also employed a Mixture-of-Experts (MoE) architecture that activates only 37 billion parameters per token, lowering training costs “by an order of magnitude compared to traditional dense models”.
DeepSeek-V3 incorporates FP8 mixed-precision training to maximize computational efficiency and a Multi-Token Prediction module that increases generation speed by 1.8x.
The researchers identified three critical hardware bottlenecks slowing AI development: (1) limited memory capacity, (2) inefficient computation, and (3) inadequate communication bandwidth between GPUs.

My take: I really like the work by the DeepSeek team here. They identified three major problems, and then worked to solve them one by one. Memory capacity was addressed using Multi-head Latent Attention (MLA) to significantly improve memory utilization while maintaining model performance. Computation was addressed using an FP8 mixed precision training framework which allows twice as many calculations per unit of time compared to higher precision formats like FP32 or BF16. Also the Mixture of Experts (MoE) architecture helped a lot with this. Finally Bandwidth was adressed using several innovative approaches: they used a pipeline called DualPipe for efficient parallelism, they wrote custom cross-node all-to-all communication kernels that fully utilized the InfiniBand and NVLink bandwidths, and they used InfiniBand GPUDirect Async (IBGDA) to reduce network communication latency. And best of all they wrote everything down and published it as a research paper, which I would like to see more of from companies like Anthropic and OpenAI.

Research Shows LLMs Struggle with Multi-Turn Conversations

https://arxiv.org/abs/2505.06120

The News:

Microsoft and Salesforce researchers discovered that leading LLMs perform significantly worse in multi-turn conversations compared to single-turn interactions, with performance dropping from 90% to about 60% success rate.
The study tested 15 top LLMs including Claude 3.7 Sonnet, GPT-4.1, and Gemini 2.5 Pro across six generation tasks, analyzing over 200,000 simulated conversations.
Models tend to make premature assumptions, propose solutions before gathering complete information, and rely too heavily on their previous (often incorrect) responses.
Researchers found that common remediation techniques like adjusting temperature settings or using reasoning frameworks failed to improve reliability in multi-turn settings.

My take: This research highlights a gap between how AI models are tested and how they’re used in the real world. Most companies simply report model performance based on single questions. But this study shows that things change in real conversations, where context builds up over time. Even top models lose reliability across longer back-and-forth exchanges, not because chatbots don’t work, but because today’s AI systems struggle to keep track of earlier details as conversations grow. If you’re building chatbots, be aware of the limits and design for them: repeat key info when needed, and keep conversations focused. And for everyone else – use one topic per chat, and use text expanders for prompts like capturing the current state to create new chats.

Meta Launches AssetGen 2.0 for High-Quality 3D Asset Creation

http://developers.meta.com/horizon/blog/AssetGen2

The News:

Meta has introduced AssetGen 2.0, a new AI foundation model that generates high-quality 3D assets from text or image prompts.
The model uses a single-stage 3D diffusion approach that directly generates assets from text prompts, unlike AssetGen 1.0 which required first generating multiple 2D views.
AssetGen 2.0 comprises two distinct models: one for creating detailed 3D meshes with geometric consistency and fine details, and TextureGen for producing high-quality, production-ready textures.
The new system delivers “dramatically improved detail and fidelity” compared to the original, with enhanced texture consistency, in-painting capabilities, and higher resolution textures.
Meta is currently using AssetGen 2.0 internally for 3D worlds creation and plans to roll it out to Horizon creators later in 2025.

My take: If you have been following my newsletters you know about Hunyuan 3D already, and their latest release Hunyuan 3D 2.5 is miles better than what Meta AssetGen 2.0 is capable of. Meta AssetGen will also be locked to Meta Horizon, so unless you spend all your time creating content for Meta Horizon you’ll be better of with Hunyuan 3D. What I’m curious about however is who uses Meta Horizon today. Do you use it? Do you know anyone who uses it? The reason I’m asking is that Meta seems to continue investing in VR development with these kinds of tools (“We’re currently using AssetGen 2.0 internally for 3D worlds creation”), and to me Meta Horizon has always felt like it was doomed to fail even before it was launched. I don’t think AR or VR is dead, for some companies in specific areas there is still a use case for it. But I strongly believe consumer-based AR/VR has reached a dead-end. The AR or VR technology never “got there” in terms of tiny smart glasses. We just couldn’t fit a SoC, battery and lenses with good FoV into something most people would like to wear for more than 15 minutes. So with that in mind I would much rather have Meta focus on industrial use cases and digital twins, but they seem stuck with Horizon.

Sakana AI Introduces Continuous Thought Machine: A Biologically Inspired Alternative to Transformers

https://pub.sakana.ai/ctm

The News:

Sakana AI has released the Continuous Thought Machine (CTM), a neural network architecture that processes information through neuron-level timing and synchronization rather than the parallel processing used in Transformers.
CTM gives each artificial neuron access to its own history and relies on neuron synchronization over time, creating an internal “thinking dimension” that allows the model to reason through problems step-by-step.
The architecture demonstrates human-like problem-solving approaches, such as visibly following paths through mazes and examining different parts of images in patterns similar to human visual attention.
On the ImageNet benchmark, CTM achieved 89.89% top-5 accuracy, which is competitive though not matching top Transformer models.
The model shows strong capabilities in sequential reasoning tasks like maze-solving without relying on positional embeddings that are typically critical in Transformer models.

My take: Unlike Transformer models that process inputs simultaneously through fixed, parallel layers, CTM executes computations sequentially within each neuron. This allows CTM to adapt its reasoning depth based on task complexity, potentially using fewer computational resources for simpler tasks. This is still a very early research architecture, and it’s unclear if it will ever reach production quality better than current state-of-the-art Transformer models on standard benchmarks, but its biological inspiration provides advantages in specific areas like sequential reasoning and planning. I thought it was an interesting take and it’s not every week you see a completely new neural network design being introduced. They have an interactive demo of CTM in action that you can run here.

Introducing Continuous Thought Machines