- Tech Insights 2026 Week 25by Johan Sanneblad
Be careful what you wish for.
On June 10, Anthropic CEO Dario Amodei published the article “Policy on the AI Exponential”, arguing that frontier AI governance must move beyond just transparency toward binding, FAA-style regulation. He explicitly proposed that the government should have the power to block or deter model deployments if third-party testing revealed unacceptable risks.
Just two days later, the US government did exactly that.
The sudden removal of Anthropic’s Fable 5 and Mythos 5 models on June 12 triggered a wave of anxiety across the industry. After receiving a US export control directive restricting access for foreign nationals, Anthropic had to pull the models entirely for all users. Many companies now seems to be unsure where this regulatory disruption leaves them, and some are even questioning the safety of building their enterprise systems on top of foreign AI infrastructure.
The debate over who is to blame for this intervention is quite clear though. Yann LeCun gave his view on LinkedIn on June 13: “Dario Amodei’s ridiculous fear mongering about Mythos/Fable (and AI in general) finally pays off: The US government bans its use by non Americans, including by foreign employees in the US. One reaps what one sows”.
In practice, there are no European alternatives to Opus, GPT, Fable and Mythos. And at the current pace of development, the EU will only get further behind as the years go by and the best models get bigger (tens of terabytes in size). If you want to remain competitive you have to use the best reasoning engines available. If your team was relying on Fable 5, you simply switch to an unaffected model like OpenAI’s GPT-5.5 and keep moving forward.
Last week I spent most days with Fable before it was revoked. And while it is an amazing model going from nothing to something, like all other Anthropic models it still cannot bring a high quality code base to an exceptional level. Like Opus it tends to over-complicate and make things messier over time. But only highly skilled developers who take the time to actually review the code base will notice this. The rest are amazed by the visually stunning 3D games they can produce.
I wouldn’t be surprised if over 50% of Mythos/Fable are just patterns trained on existing web pages and computer games (ask it to create a specific game and it know’s all about the game logic). That said, UC Berkeley’s Center for Responsible, Decentralized Intelligence last week dropped a new benchmark called “Agents’ Last Exam” (ALE), testing AI across complex, multi-day professional workflows in 55 industries. OpenAI’s GPT-5.5 (running in Codex) captured the #1 spot with a 24.0% pass rate, winning over the now-restricted Claude Fable 5, which landed at 22.0%. You aren’t settling for a lesser model by choosing GPT over Fable, you’re simply defaulting to the top performer.
My best recommendation is to accept this regulatory friction as the new normal. Keep multiple frontier models available in your architecture so you can quickly switch if a system becomes restricted, but do not let temporary disruptions slow your adoption. The capabilities of these tools are advancing too fast to pause, and the biggest risk right now is letting your organization fall behind.
Thank you for being a Tech Insights subscriber!
Listen to Tech Insights on Spotify: Tech Insights 2026 Week 25 on Spotify
Notable model releases last week:
- Gemini-SQL2 by Google Research. Text-to-SQL capability powered by Gemini 3.1 Pro that translates natural-language requests into execution-ready SQL queries, with state-of-the-art results on the BIRD benchmark.
- Kimi K2.7 Code by Moonshot AI. Open-source 1T-parameter MoE coding-focused agentic model (32B active, 256K context) for long-horizon software engineering tasks like refactoring codebases, implementing multi-file features, debugging extended sessions, and running developer agents with lower thinking-token usage than K2.6.
- MiMo-V2.5-Pro-UltraSpeed by Xiaomi. 1-trillion-parameter MiMo V2.5 Pro serving mode with FP4 quantization and DFlash speculative decoding, generating over 1,000 tokens per second on a standard 8-GPU node for latency-sensitive agent loops, fraud detection, trading signals, and parallel reasoning.
- North Mini Code by Cohere. Open weights 30B total, 3B active sparse MoE coding model (Apache 2.0) for code generation, agentic software engineering, terminal tasks, and long-context tool-use workflows up to 256K context and 64K output.
- Ray3.2 by Luma. Video generation and video-to-video model for directing clips with up to 16 keyframes, modifying existing footage up to 20 seconds at 1080p, and producing campaign variations with reframe, motion transfer, facial performance, HDR generation, and EXR export.
THIS WEEK’S NEWS:
- Anthropic Disables Fable 5 and Mythos 5 After Export Control Directive
- Dario Amodei Proposes FAA-Style Frontier AI Rules
- Google Faces Preliminary Liability for False Statements from AI Overviews
- SpaceX Raises $75 Billion in Largest IPO Ever
- Apple Introduces Five Third-Generation Foundation Models
- Google Releases DiffusionGemma With Up to 4x Faster GPU Inference
- Google Launches Gemini 3.5 Live Translate Across 70+ Languages
- Agents’ Last Exam Targets 5,000 Real-World Agent Tasks
- Cognition Introduces FrontierCode Benchmark for Code Mergeability
- OpenAI Submits Confidential Draft S-1 to SEC
- OpenAI to Acquire Ona for Codex Cloud Execution
- Coinbase Launches AI Agent Trading and Payments Tool
Anthropic Disables Fable 5 and Mythos 5 After Export Control Directive
https://www.anthropic.com/news/fable-mythos-access

The News:
- On June 12, Anthropic said a US export control directive barred any foreign national, inside or outside the United States, including Anthropic’s own foreign-national employees, from accessing Fable 5 and Mythos 5, forcing the company to disable both models for all customers while leaving other Anthropic models available.
- Anthropic said it received the directive at 5:21pm ET and that the letter gave no specific details about the national security concern.
- Anthropic said the government’s concern centered on a potential narrow, non-universal way to bypass or jailbreak Fable 5. After reviewing a demo, Anthropic said the technique exposed only a small number of previously known, minor, relatively simple vulnerabilities that other publicly available models could also find without a bypass.
- Before launch, Anthropic said it spent thousands of hours red-teaming Fable’s strong safeguards with the US government, the UK AISI, private third parties, and internal teams, and that testers had not found a universal jailbreak.
- Anthropic’s June 9 launch post said its defense-in-depth plan combined safeguards, monitoring, and 30-day retention for all traffic on Fable 5, Mythos 5, and future Mythos-class models, on first- and third-party surfaces. Anthropic said it would not use that data to train new Claude models or for non-safety purposes.
- Anthropic said the government had provided only verbal evidence of a potential narrow, non-universal jailbreak, described the technique as asking the model to read a specific codebase and fix software flaws, and argued that applying that standard across the industry would halt new frontier-model deployments.
My take: We just learned that a frontier AI model can disappear completely overnight due to sudden export controls. On June 12, Anthropic disabled its newly launched Fable 5 and Mythos 5 models for all customers to comply with a US directive barring access for any foreign national, including Anthropic’s own employees. Anthropic described the government’s security concern as a “potential narrow, non-universal jailbreak”, where a user asks the model to read a codebase and fix software flaws. After reviewing a demo, Anthropic said the technique exposed only minor vulnerabilities that other public models like GPT-5.5 can already find. Despite spending thousands of red-team hours before launch to ensure strong safeguards, Anthropic warned that applying this standard across the industry would halt new frontier-model deployments.
This incident turns AI adoption into a serious procurement and continuity risk. Until now, we have mostly worried about pricing and ordinary uptime. Now we have a precedent where a model is pulled globally because of nationality-based access rules. If you build critical business processes on top of a single AI model, you are taking a huge risk. You need to design your IT architecture so you can swap to a different model provider the exact same day.
Read more:
- Claude Fable 5 and Claude Mythos 5
- US government directive to suspend access to Fable 5 and Mythos 5
Dario Amodei Proposes FAA-Style Frontier AI Rules
https://darioamodei.com/post/policy-on-the-ai-exponential

The News:
- On June 10 Dario Amodei calls for AI policy to move beyond transparency to binding rules for frontier systems, with Anthropic releasing a legislative proposal for frontier model testing and a job-displacement policy framework.
- The core safety proposal requires mandatory qualified third-party testing for models above a compute threshold across four risk areas: cybersecurity, biological weapons, loss of control of AI systems, and automated R&D that could accelerate those risks.
- The proposal gives government authority to block or deter deployment when third-party assessment finds unacceptable risk, scoped to those four areas and constrained by protections against political favoritism and arbitrary decisions.
- Frontier developers would also need strong security standards that protect model weights, regular red teaming and penetration testing, cooperation with government against major threat actors, and prompt safety-incident reporting in the four critical areas.
- On labor, Amodei backs expanded government measurement of AI job displacement, wage insurance, retention tax incentives, workforce training grants, employer-employee matching infrastructure, and long-term income support if AI permanently reduces demand for labor.
- For regulated drug development, he urges FDA, EMA, and peer agencies to set standards now for AI-based PD/PK modeling, toxicology prediction, dose selection, biomarker validation, synthetic control arms, and surrogate endpoints, citing typical 7-8 year FDA or EMA approval pipelines for drug candidates.
- The civil-liberties and geopolitical package includes accountability rules and a domestic ban for fully autonomous weapons, closing the data broker loophole, access to AI at least as capable as the government’s in adverse actions, and a democratic coalition on chips, risk standards, benefit sharing, mutual defense, anti-repression safeguards, and macroeconomic support.
My take: Anthropic is making a major shift in its policy strategy. After previously backing transparency rules, Dario Amodei now states that transparency is not enough and proposes binding regulation for frontier models. The core mechanism is highly specific. Models above a compute threshold would face mandatory qualified third-party testing in four areas: cybersecurity, biological weapons, loss of control, and automated R&D. Most importantly, this includes a hard-power deployment veto where the government could block or deter a model release if an assessment finds unacceptable risk.
The proposal creates strict operational duties for advanced AI developers, including regular red teaming, prompt safety incident reporting, and strong security standards that protect model weights. Amodei deliberately pairs these restrictions with a pro-innovation push, urging agencies like the FDA and EMA to prepare AI-ready standards right now since drug candidates typically spend 7-8 years in approval pipelines. The key missing detail is the exact compute threshold, and the technical community on Hacker News is already debating the risk of regulatory capture and whether these capability metrics are meaningful. The leading frontier labs are no longer asking for voluntary guidelines. They are actively designing the regulatory framework that will dictate how and when your company gets access to the next generation of AI capabilities.
Read more:
Google Faces Preliminary Liability for False Statements from AI Overviews

The News:
- Germany’s Munich Regional Court preliminarily ruled that Google is liable for false statements generated by AI Overviews and ordered the company to prevent erroneous or inaccurate claims from spreading through its search engine.
- The case involved two publishers whose AI Overview results linked them, in certain searches, to questionable business practices, scams, and subscription-related frauds even though those associations did not appear in Google’s linked sources.
- The court said Google’s AI mixed information about other companies flagged for possible illicit practices with data about the publishers, creating new associations absent from any source Google linked.
- Judges distinguished AI Overviews from traditional search results, holding that the feature produced “independent, new, and substantial statements” rather than merely displaying third-party links and statements.
- The court rejected Google’s disclaimer defense, saying warnings about possible errors and user verification do not remove liability when the challenged statements are absent from search results and Google is the only entity able to modify the generating system.
- Judges also said AI-generated results are not protected as individual free speech because they are the product of an algorithm rather than a person’s opinion.
- As a precautionary measure, the ruling ordered Google to remove a large portion of the statements deemed defamatory and pay 80% of legal costs. Google said it is reviewing the decision and that it is not final.
My take: The Munich Regional Court has drawn a hard legal line between classic search results and AI-generated answers. In a preliminary ruling the court found Google liable for false statements generated by AI Overviews. The judges treated the AI fundamentally differently from a search engine, stating that the feature produced “independent, new, and substantial statements”. In this specific case the AI mixed data about the plaintiffs with illicit practices from other companies, creating associations that did not exist in any of the linked sources. As a recent Hacker News discussion pointed out, this means Google’s answer-writing layer does not automatically inherit the legal protections of traditional search links.
Google tried to defend itself using user disclaimers, but the court rejected the argument. Warning users to verify summaries does not remove liability when the generated claims do not appear in the source material and only Google can change the underlying system. Google says it is reviewing the decision and that it is not final, but the company was still ordered to remove the statements and pay 80% of the legal costs. This is a clear governance warning for any company building AI search or chatbots. Relying on source links and simple disclaimers is no longer enough. If your AI writes the answer instead of just pointing to a source you need stronger entity matching, defamation-risk controls, and robust takedown workflows to manage the liability.
Read more:
- German court holds Google liable for fake AI answers – DW
- Google can be directly liable for false AI Overview claims: German court
- If I get it correctly I like the ruling. So Google has established a …
SpaceX Raises $75 Billion in Largest IPO Ever
https://techcrunch.com/2026/06/12/spacex-ipo-live-updates-on-everything-you-need-to-know/

The News:
- On June 12, SpaceX priced 555.6 million shares at $135 each, raising $75 billion in what TechCrunch described as the largest IPO in history.
- SpaceX opened at $150 on Nasdaq, 11% above the IPO price, climbed 30% in midday trading, and closed at $160.95, up 19%.
- SpaceX’s S-1 showed a $4.9 billion loss on more than $18 billion in 2025 revenue, with cumulative losses above $37 billion since inception.
- The filing gives Elon Musk about 85.1% of SpaceX’s voting power, and amended S-1 language warned investors about possible future dilution.
- TechCrunch’s pre-IPO deal coverage said Google will pay SpaceX $920 million per month for compute in a short-term arrangement responding to unexpected demand for Google’s recently launched AI products.
- The New York Times estimated that about 4,400 current and former SpaceX employees could become millionaires, while The Wall Street Journal put IPO fees at about $500 million, with Goldman Sachs and Morgan Stanley the biggest winners.
- COO Gwynne Shotwell told CNBC that a merger between SpaceX and Tesla “might make Elon’s life a little easier.”
My take: Public investors finally have a clear baseline for SpaceX after its record $75 billion IPO on June 12. First-day demand was strong with the stock closing up 19%, but the S-1 filing shows you are buying into a company with a $4.9 billion loss on more than $18 billion in revenue, alongside cumulative losses above $37 billion. You also get almost no control. The filing gives Elon Musk about 85.1% of the voting power, and amended S-1 language warns prospective investors about possible future dilution.
What is particularly interesting for the tech sector is how SpaceX is currently funding operations. According to TechCrunch’s pre-IPO deal coverage, Google will pay SpaceX $920 million per month for compute in a short-term arrangement to handle unexpected demand for Google’s new AI products. A short-term compute deal should never be treated as durable recurring revenue. Community discussion on Hacker News highlights similar valuation skepticism, especially around assumptions for orbital data centers and Starlink latency.
If you are considering adding SpaceX to your portfolio, realize you are not buying a traditional public tech company with stable governance. You are providing capital to a heavily controlled operation reliant on temporary AI compute spikes. Treat it as a high-risk venture rather than a foundational investment.
Read more:
- SpaceX IPO closes up 19% and delivers the world’s first trillionaire
- SpaceX’s IPO Has Launched, and Our Editors Have Thoughts
- We Think the SpaceX IPO Is Overvalued | Hacker News
Apple Introduces Five Third-Generation Foundation Models
https://machinelearning.apple.com/research/introducing-third-generation-of-apple-foundation-models

The News:
- Apple introduced the third generation of Apple Foundation Models for Apple Intelligence, a five-model family custom-built with Google: two on-device models and three server models running on Private Cloud Compute.
- AFM 3 Core Advanced is a 20B-parameter natively multimodal on-device sparse model, activating 1B to 4B parameters per request and unlocked by Apple’s most capable Apple silicon systems.
- AFM 3 Core Advanced stores the full model in NAND flash instead of forcing it into DRAM, selects experts per prompt rather than per token, loads only that subset into DRAM, and periodically reselects experts during generation.
- Apple’s server lineup includes AFM 3 Cloud, described as a server-side workhorse optimized for speed, efficiency, and performance, ADM 3 Cloud (Image) for image generation and editing, and AFM 3 Cloud Pro for agentic tool use and complex reasoning. For Cloud Pro, Apple extended Private Cloud Compute to NVIDIA GPUs in Google Cloud with Google and NVIDIA.
- Apple says the models were trained on publicly available information, licensed or purchased third-party data, open-sourced data, data from dedicated studies, and synthetic data, not users’ private personal data or user interactions. Web publishers can opt out of foundation-model training.
- In Apple’s side-by-side human evaluations, AFM 3 Core was preferred on 45.6% of general text prompts versus 23.3% for the 2025 baseline, while AFM 3 Cloud was preferred on 64.7% versus 8.7% for the 2025 AFM Server model.
- For speech, AFM 3 Core Advanced scored 4.15 versus 3.87 on general voice TTS MOS and 4.24 versus 3.82 on conversational voice, and dictation evaluators preferred it on overall quality by 44.7% to 17.6% against Apple’s previous dictation system.
My take: Running a 20-billion-parameter model natively on a device usually crushes available memory. Apple solves this bottleneck with its new AFM 3 Core Advanced model by storing the full model in NAND flash. It selects experts per prompt rather than per token, loading only a small subset of 1 to 4 billion active parameters into DRAM at a time, and periodically reselecting them during generation. This is a highly practical way to unlock large multimodal capabilities locally without requiring massive amounts of unified memory.
The server strategy is equally important. Apple built this five-model family with Google and positions the new AFM 3 Cloud as its “server-side workhorse, optimized for speed, efficiency, and performance”. While internal human evaluations show large leaps over the previous generation in text, speech, and dictation, these are still internal tests at the current stage of development. Apple also notes that AFM 3 Core Advanced is only unlocked by its “most capable Apple silicon systems”. My recommendation is to hold off on factoring these advanced local AI features into your enterprise IT planning until Apple confirms exactly which devices actually make that cut.
Read more:
- Apple Intelligence Foundation Language Models Tech Report 2025
- Apple’s third-generation Foundation Models explained
Google Releases DiffusionGemma With Up to 4x Faster GPU Inference

The News:
- On June 10, Google introduced DiffusionGemma, an experimental open model released under Apache 2.0 that uses text diffusion to generate whole text blocks in parallel instead of token by token.
- The 26B Mixture of Experts model activates 3.8B parameters during inference and generates 256 tokens in parallel per forward pass with bidirectional attention.
- Google reports up to 4x faster token output on dedicated GPUs, including 1,000+ tokens per second on a single NVIDIA H100 and 700+ tokens per second on an NVIDIA GeForce RTX 5090.
- When quantized, DiffusionGemma fits within 18GB of VRAM on high-end dedicated consumer GPUs, targeting speed-critical, interactive local workflows for researchers and developers.
- Google highlights in-line editing, code infilling, amino acid sequences, mathematical graphs, and other non-linear text structures, while saying standard autoregressive Gemma 4 remains the production choice when output quality matters most.
- The throughput advantage is strongest at low-to-medium batch sizes on a single accelerator; in high-QPS cloud serving, autoregressive models can already saturate compute, making parallel decoding prone to diminishing returns and higher serving costs.
- Weights are on Hugging Face, with serving through MLX, vLLM, and Hugging Face Transformers; fine-tuning and experimentation paths through Hackable Diffusion, Unsloth, and NVIDIA NeMo; cloud access through Gemini Enterprise Agent Platform Model Garden and NVIDIA NIM; and official llama.cpp support coming soon.
My take: For years AI models have typed out answers one word at a time. With the experimental release of DiffusionGemma on June 10, Google changes this by using text diffusion to generate blocks of 256 tokens in parallel. On dedicated GPUs this parallel decoding delivers up to 4x faster token output. Since the model only activates 3.8B parameters during inference, a quantized version fits into 18GB of VRAM on high-end consumer hardware. For a developer waiting on one local accelerator to do inline code editing or rapid iteration, this raw speed is highly attractive.
But throughput does not automatically translate to a better product. Feedback on Hacker News raised practical concerns about short-answer latency and whether lower quality erases the speed gain. Google draws a clear boundary here, stating that standard autoregressive Gemma 4 remains the recommended production choice when quality matters most. The economic advantage also fades in the cloud, where high-volume serving already saturates compute. Treat this model exactly as Google positions it. If your team builds local, speed-critical workflows, the Apache 2.0 weights are ready to explore, with fine-tuning or experimentation paths through Hackable Diffusion, Unsloth, and NVIDIA NeMo, and cloud access through the Gemini Enterprise Agent Platform Model Garden. For standard enterprise applications, stick with Gemma 4.
Read more:
- DiffusionGemma: The Developer Guide
- Run DiffusionGemma on NVIDIA for Developer-Ready, High-Throughput Text Generation
- DiffusionGemma: 4x Faster Text Generation
Google Launches Gemini 3.5 Live Translate Across 70+ Languages
https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-live-3-5-translate

The News:
- Google released Gemini 3.5 Live Translate on June 9, an audio model for near real-time speech-to-speech translation in over 70 languages, with public preview in the Gemini Live API and Google AI Studio, private preview in Google Meet for select business Google Workspace customers starting this month, and rollout to Google Translate on Android and iOS.
- Gemini 3.5 Live Translate automatically detects 70+ languages and generates translated speech continuously instead of waiting for a speaker to finish, staying a few seconds behind while preserving intonation, pacing, and pitch.
- In Google Meet, speech translation expands from 5 languages to 70+ languages and from English-only translation to over 2,000 language combinations in one meeting, with broader rollout planned later this year.
- Developer platforms including Agora, Fishjam, LiveKit, Pipecat, and Vision Agents can use the Gemini Live API to handle the real-time media streaming infrastructure behind voice translation apps.
- Grab is testing Gemini 3.5 Live Translate for near real-time multilingual communication between drivers and travelers at pickups, a use case tied to more than 10 million voice calls per month through Grab.
- All audio generated by Google’s models is watermarked with SynthID, and the DeepMind model card lists translation quality, latency, and speech naturalness as evaluation dimensions for Gemini 3.5 Live Translate.
My take: Real-time translation usually means awkward pauses while you wait for a system to process each sentence. With Gemini 3.5 Live Translate, Google is shifting to continuous speech-to-speech output that stays just a few seconds behind the speaker. This is more than just a standard model release since Google is putting the exact same capability into developer infrastructure, consumer apps, and enterprise meetings at the same time. The business value is obvious when you look at Grab testing the model for multilingual pickup calls between drivers and travelers. That is a workflow tied to more than 10 million voice calls per month.
For enterprise users, the Google Meet integration expands translation to over 2,000 language combinations in one meeting, and goes into private preview for select business Google Workspace customers this month. Google notes that all generated audio is watermarked with SynthID. Just be aware that a watermark does not settle the difficult consent, audit, and translation liability questions for regulated customer workflows. If your company handles cross-border operations or global support, I recommend you start exploring these live translation APIs right away. The operational friction of language barriers is disappearing fast, and it will fundamentally change how global teams communicate.
Read more:
- Gemini 3.5 Audio (Live Translate) – Model Card
- Live translation with Gemini Live API
- google-gemini/gemini-live-api-examples
Agents’ Last Exam Targets 5,000 Real-World Agent Tasks

The News:
- On June 3, the Agents’ Last Exam paper was submitted to arXiv, introducing a UC Berkeley RDI AI agent benchmark for long-horizon, economically valuable real-world computer tasks with verifiable outcomes.
- Agents’ Last Exam spans all 55 targeted sub-industries that cover most professional work performed on a computer, with 1,500+ tasks collected toward a 5,000-task goal and 300+ industry experts involved.
- Example task environments include Adobe After Effects, Siemens NX, Unreal Engine, Moldex3D, Rhino 3D, and FSLeyes, spanning motion and VFX, engineering, game development, manufacturing, architecture, and neuroscience.
- The public project page links to GitHub, demo tasks, a leaderboard, documentation, and the arXiv paper, with the dataset listed under CC BY 4.0 and the code under Apache-2.0.
- On the launch leaderboard, OpenAI’s GPT-5.5 (released in April and run through the Codex harness) took the top spot at a 24.0% pass rate, narrowly ahead of Anthropic’s Claude Fable 5 at 22.0%.
- Contributors can submit workflows or join research to help turn real workflows into reproducible benchmark tasks. Domain experts need no coding, qualifying contributors can receive co-authorship, and high-impact contributions are eligible for awards from a $100K+ pool.
My take: Academic question answering does not help you when you need to know if an AI agent can actually perform economically valuable professional work. UC Berkeley RDI submitted the first arXiv version of the Agents’ Last Exam paper on June 3, and it moves agent evaluation toward long-horizon professional computer workflows with verifiable outcomes. Compared to adjacent workplace-agent benchmarks like TheAgentCompany, the interesting angle here is the emphasis on workflows inside real practitioner tools such as Siemens NX, Adobe After Effects, and Unreal Engine.
The project has involved over 300 industry experts and spans 55 targeted sub-industries. The caveat is that 1,500 collected tasks is still short of the 5,000-task goal, and broad coverage does not by itself prove that the task mix matches real enterprise value. Still, this is exactly where AI evaluation needs to go. Domain experts can currently submit workflows to the project without coding, while researchers and engineers can join research and help turn real workflows into reproducible benchmark tasks. If you want to track when agents are actually ready for complex engineering or game development workflows, this is the benchmark to watch.
Read more:
- Agents’ Last Exam
- rdi-berkeley/agents-last-exam
- Agents’ Last Exam: can AI agents actually do real jobs?
Cognition Introduces FrontierCode Benchmark for Code Mergeability
https://cognition.ai/blog/frontier-code

The News:
- On June 8, Cognition introduced FrontierCode, a coding benchmark built to judge whether a model-generated pull request would be mergeable in a production codebase, not just whether it passes correctness checks.
- FrontierCode has three nested sets: Diamond with the hardest 50 tasks, Main with the hardest 100 including Diamond, and Extended with all 150 tasks.
- Cognition says it worked with maintainers from 36 flagship open-source repositories, with 20+ open-source developers spending more than 40 hours per task building tasks from repos they maintain and defining what “mergeable” means there.
- Each solution is graded across behavioral correctness, regression safety, mechanical cleanliness, test correctness, scope, and code quality. Any solution that fails a blocker gets a score of 0, and Cognition says FrontierCode has an 81% lower false-positive rate than SWE-Bench Pro.
- Cognition ran each model 5 times at every available reasoning effort and reported the best-performing effort. On Diamond, Claude Opus 4.8 led at 13.4%, ahead of GPT-5.5 at 6.3% and Gemini 3.1 Pro at 4.7%.
- GPT-5.5 used up to 4x fewer tokens than Claude Opus 4.8 in Cognition’s runs.
- On Main and Extended, Claude Opus 4.8 scored 34.3% and 51.8%. Kimi K2.6 was Cognition’s top open-source model at 3.8% on Diamond, 16% on Main, and 37% on Extended.
- Cognition says it does not plan to release the tasks publicly, to avoid contamination, but is opening the evaluation to all model creators.
My take: A high benchmark score does not mean much if your developers spend more time cleaning up AI-generated code than they would have spent writing it themselves. That is the core problem Cognition wants to address with FrontierCode. Instead of just checking if a generated patch passes functional tests, this new benchmark asks if a maintainer would actually merge the pull request into a production codebase. Solutions are graded strictly on mechanical cleanliness, scope, and regression safety. Cognition reports that this approach gives FrontierCode an 81% lower false-positive rate than SWE-Bench Pro, but that figure is vendor-reported and not independently audited. The initial scores show just how far we still have to go before AI agents can work autonomously. On the hardest Diamond set of 50 tasks, the top model Claude Opus 4.8 only scored 13.4%.
While the goal is right, I am cautious about how this is executed. Cognition is the company behind the software agent Devin, meaning they are now defining the standard for merge-ready AI work while keeping the benchmark hidden. As practitioners noted in a Hacker News discussion, hiding the tasks reduces contamination but makes independent auditing very hard. It also raises questions about whether 50 tasks can consistently measure subjective code quality. Still, you should pay close attention to the concrete metrics highlighted here, such as how GPT-5.5 used up to 4x fewer tokens than Claude Opus 4.8 in Cognition’s runs. Stop evaluating your AI tools based on whether they write code that works, and start measuring if they write cost-efficient code your senior developers actually want to maintain.
Read more:
- FrontierCode: An eval to measure whether you would actually …
- [AINews] FrontierCode: Benchmarking for Code Quality over Slop
OpenAI Submits Confidential Draft S-1 to SEC
https://openai.com/index/openai-submits-confidential-s-1

The News:
- On June 8, OpenAI said it had recently submitted a confidential draft S-1 to the U.S. Securities and Exchange Commission.
- OpenAI said it announced the filing because it expected the submission to leak.
- The company said it has not decided on timing and that going public “may be a while” because some steps are likely easier while it remains private.
- OpenAI said the filing gives it the option to go public sooner if that ends up being best.
- The announcement was made under Rule 135 of the Securities Act of 1933 and does not constitute an offer to sell or a solicitation of an offer to buy securities.
- OpenAI’s public post does not disclose offering size, share price, financial statements, or a roadshow date.
My take: OpenAI announced its confidential draft S-1 on June 8 simply because they expected the submission to leak. The public post is strictly a procedural announcement that lacks offering size, share price and financial statements. OpenAI notes that timing is undecided and that going public “may be a while”. The draft S-1 just gives them the option to move sooner if that becomes best.
As an enterprise buyer you cannot act on this yet. The useful information about OpenAI’s revenue concentration, compute obligations and governance details remains entirely invisible in their public post. A recent Hacker News thread highlights the real questions we have right now, such as whether compute spending creates public market pressure or if larger balance sheets will force pricing pressure on both OpenAI and Anthropic. I recommend you ignore the listing speculation for now. Wait until a public S-1 actually reveals their operational data before you adjust your own AI partnerships.
Read more:
- Confidential submission of draft S-1 to the SEC | Hacker News
- OpenAI files confidential SEC S-1 paperwork for IPO | Fortune
OpenAI to Acquire Ona for Codex Cloud Execution
https://openai.com/index/openai-to-acquire-ona

The News:
- OpenAI announced plans to acquire Ona to bring secure cloud execution and orchestration into Codex; the deal is subject to customary closing conditions and required regulatory approvals, and the companies remain separate until closing.
- Codex now has more than 5 million weekly users, up 400% from earlier this year, with OpenAI saying people use it to research, analyze, build, and automate work.
- OpenAI says Codex’s most valuable work increasingly unfolds over hours or days rather than minutes, and that Ona’s secure, persistent environments let work continue beyond a single device or active session while users check progress, provide direction, make decisions, and review results from anywhere.
- Ona has helped 2 million developers work in secure, reproducible cloud environments, has experience moving software development from local machines into the cloud, and, according to OpenAI’s announcement, supports multiple shared customers.
- Ona’s customer-controlled execution model will let agents operate inside an organization’s own cloud environment while OpenAI provides the intelligence and orchestration layer, giving organizations greater control over infrastructure, data, and security boundaries.
- For production deployments, OpenAI says organizations need control over where agents run, what they can access, how credentials are scoped, how activity is logged, and how work moves through review.
- After closing, the Ona team will join OpenAI and work with the Codex team on secure, persistent enterprise execution capabilities and scaling Codex to more enterprises worldwide. OpenAI frames the target use cases as software-lifecycle work, including tests, issue resolution, application modernization, vulnerability work, and complex workflows over time.
My take: Up until now we have mostly evaluated Codex based on work tied to a single active session. With over 5 million people now using the tool each week, OpenAI is trying to move toward secure, persistent environments where work can continue over hours or days. Ona supports multiple shared customers and has helped 2 million developers work in secure, reproducible cloud environments. Once the deal passes required regulatory approvals and closes, the Ona team will join OpenAI and work with the Codex team on secure, persistent enterprise execution capabilities. OpenAI frames the target use cases as software-lifecycle work including tests, issue resolution, application modernization, vulnerability work, and complex workflows over time.
The enterprise pitch here is customer-controlled execution. Agents will operate inside an organization’s own cloud environment while OpenAI provides the intelligence and orchestration layer. This gives organizations greater control over infrastructure, data, and security boundaries. For production deployments, OpenAI says organizations need control over where agents run, what they can access, how credentials are scoped, how activity is logged, and how work moves through review. I view this as valuable enterprise plumbing more than a near-term visible Codex upgrade. Running agents in your own cloud can reduce some data-boundary concerns, but it also puts more responsibility on your enterprise platform and security teams. If your developers will soon deploy long-running tasks that execute in your infrastructure while they sleep, you need to start preparing your internal security governance for that shift today.
Read more:
- OpenAI to acquire Ona to support its AI coding assistant, Codex
- Rakuten fixes issues twice as fast with Codex
Coinbase Launches AI Agent Trading and Payments Tool

The News:
- On June 11, Coinbase launched Coinbase for Agents, a tool that lets AI agents such as ChatGPT or Claude trade crypto through natural language instructions and transact on users’ behalf.
- At launch, Coinbase for Agents lets users prompt an agent to rebalance portfolios, identify trading opportunities, execute strategies, and manage positions over time; Coinbase said it plans to expand later to stocks and predictions.
- Coinbase for Agents uses Coinbase’s x402 machine-to-machine payments protocol, so agents can pay directly for paywalled research, data APIs, and on-demand compute without a human in the loop, then execute trades based on those insights.
- Coinbase said x402-based payments let customers bypass traditional logins or subscriptions, describing that stage as a precursor to broader agentic shopping.
- Coinbase said it monetizes the system through trading fees on agent-executed trades, fees and spreads on USDC movement, and higher transaction volume on Base, its in-house Layer 2 blockchain.
- Lincoln Murr, Coinbase’s AI product lead, said x402 was created in May 2025 and has seen more than 100 million transactions since debut; x402scan.com showed about 157,000 agents acting as buyers over the past 30 days. Those figures describe x402 protocol usage, not direct adoption of Coinbase for Agents.
My take: There is a massive leap in trust between an AI agent writing an internal document and an AI agent managing a wallet. With the launch of Coinbase for Agents, models like ChatGPT and Claude can now manage crypto trading portfolios using natural language instructions. The practical shift here is the complete execution loop. An agent can use Coinbase’s x402 protocol to pay directly for paywalled research or data APIs without a human in the loop, and then execute trades based on those insights.
While direct adoption numbers for Coinbase for Agents are not yet available, the underlying x402 payment protocol is already seeing heavy use. Coinbase AI lead Lincoln Murr says the x402 protocol has handled more than 100 million transactions since its creation in May 2025, and x402scan.com shows about 157,000 buyer agents active over the past 30 days. Coinbase views this as a precursor to broader agentic shopping, but my main takeaway is governance. If an autonomous agent buys bad data or falls victim to prompt injection, the result is no longer just a hallucinated text paragraph. It quickly becomes unauthorized spending, overtrading, and disputed transactions. If your company plans to build workflows involving autonomous AI spending, you must treat this as a separate risk category and establish robust financial guardrails before deploying anything to production.
Read more:
