Sources

Microsoft’s new MAI models simonwillison.net

Microsoft announced two new text LLMs this morning - MAI-Thinking-1 (reasoning, 1T parameters, 35B active, available to “select early partners”) and MAI-Code-1-Flash (137B Parameters, 5B active, “purpose-built for GitHub Copilot and VS Code to deliver high performance and lower cost […] rolling out to GitHub Copilot individual users in Visual Studio Code”). I’ve not been able to try either of them just yet. It’s very interesting to see Microsoft releasing models with such low parameter counts…

Holo3.1: Fast & Local Computer Use Agents huggingface.co

datasette-agent-micropython 0.1a0 simonwillison.net

Release: datasette-agent-micropython 0.1a0 I want Datasette Agent to be able to generate and execute Python code safely. This alpha is looking promising so far. GPT-5.5 has so far failed to break out of the sandbox! Tags: python , sandboxing , datasette , webassembly , datasette-agent

micropython-wasm 0.1a1 simonwillison.net

Release: micropython-wasm 0.1a1 Fixes for some limitations that emerged while I was trying to use this to build datasette-agent-micropython . Tags: python , sandboxing , webassembly

micropython-wasm 0.1a0 simonwillison.net

Release: micropython-wasm 0.1a0 My latest sandboxing experiment: This alpha package bundles a lightly customized WASM build of MicroPython with a wrapper to execute code in it via wasmtime . Tags: python , sandboxing , webassembly

Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains huggingface.co

JetBrains has launched Mellum2, a 12-billion-parameter mixture-of-experts model aimed at code completion inside its IDEs. The successor to the original Mellum leans on sparse expert routing to lift quality without ballooning inference cost, positioning JetBrains against Copilot’s underlying models.

Why Video Agent models are next — Ethan He, xAI Grok Imagine latent.space

Grok Imagine’s tech lead Ethan He breaks down the 3-month sprint behind xAI’s video generator on Latent Space, contrasting videogen pipelines with world models and arguing video agents are the next frontier beyond static image and text generation.

How small businesses can leverage AI technologyreview.com

Small businesses lack the staff depth of large firms across accounting, design, market research and product work — gaps that LLMs increasingly fill. MIT Technology Review’s Making AI Work newsletter walks owners through where models actually pay off versus where they stall.

California Brown Pelican simonwillison.net

Simon Willison shares a photo of a California Brown Pelican diving behind the Microsoft Build venue at Fort Mason, San Francisco. A brief aside tagged with llm-release hints he’s on the ground for the conference’s AI announcements.

References

Latent Space (AI News digest) latent.space

MAI-Thinking-1 won 49% of 1,276 tasks compared to Sonnet 4.6’s 45%… a narrow margin that experts argue could be a byproduct of verbosity bias

failingfast.io AI coding benchmark tracker failingfast.io

MAI-Code-1-Flash achieved a 51% success rate on SWE-Bench Pro—a significant lead over Claude Haiku 4.5 at ~35%—and reportedly consumes up to 60% fewer tokens on complex tasks

Hacker News discussion (item 48374362) news.ycombinator.com

‘appropriately licensed’ is a ‘hand-wavy’ term when applied to massive code repositories like GitHub, noting that even permissive licenses (e.g., Apache 2.0) typically require attribution that is difficult to maintain through the tokenization process

Mashable on Common Crawl/News Media Alliance mashable.com

The News/Media Alliance submitted a formal demand for Common Crawl to cease unauthorized scraping, arguing that the foundation’s open-access mission has been co-opted to ‘launder’ copyrighted news content for tech giants

WindowsForum analysis of Build 2026 Foundry strategy windowsforum.com

While MAI-Thinking-1 matches early-2026 benchmarks (53% on SWE-bench Pro), it trails current leaders like GPT-5.4 and Opus 4.8 by 15–20 points… Microsoft’s ‘harness-first’ strategy prioritizes ‘frontier-adjacent’ utility over academic leadership

businessengineer.ai technical breakdown businessengineer.ai

trained ‘from the ground up’ on 30 trillion tokens without using distillation from third-party models like GPT-4, a move Microsoft calls ‘learned, not inherited’ intelligence… optimized for Microsoft’s Maia 200 silicon, which reportedly yields a 1.4x performance-per-watt advantage over the GB200 baseline

awesomeagents.ai (Computer Use Leaderboard) awesomeagents.ai

Claude Opus 4.8 currently leads the leaderboard with a verified success rate of 83.4%, surpassing the human baseline of roughly 72%… Qwen3 VL 235B represents the open-source frontier with a score of 66.7%.

note.com (DGX Spark NVFP4 benchmarks) note.com

At low concurrency (single-request latency), NVFP4 can actually be 20% slower than FP8 (41 t/s vs 51 t/s) due to the relative immaturity of FP4 kernels… vanilla vLLM implementations crawl at 1.1 tokens/s on the GB10 chip (SM 12.1) without custom patches.

Palo Alto Unit 42 / RedTeamCUA findings unit42.paloaltonetworks.com

The RedTeamCUA benchmark, which tests agents in hybrid Web-OS environments, found that even leading computer-use agents suffer from attack success rates of up to 66%… multi-turn prompt injection attacks achieved success rates as high as 92% on capability-focused models like Qwen.

Medium — MoE architecture guide medium.com

In MoE architectures, the entire 35B parameter set must be loaded into memory bandwidth to process the initial prompt. Consequently, a dense 9B model can be 4x to 9x faster at starting a response than the 35B-A3B.

Medium — Holo3 vs GPT-5.4 cost analysis ai-engineering-trend.medium.com

Holo3-122B-A10B model… scored 78.85% on OSWorld-Verified, surpassing GPT-5.4 (72.4%) and Claude Opus 4.6 (38%)… claimed to be roughly one-tenth the cost of competing proprietary models.

develeap.com (Holo3.1 review) develeap.com

The transition from structured JSON output to native function-calling has forced some early adopters to rework their integration pipelines… some community members on Hugging Face have raised concerns regarding ‘abliterated’ versions of the model.

Penligent.ai security writeup on CVE-2025-68668 penligent.ai

a sandbox bypass in the n8n platform’s use of Pyodide, where certain process-handling capabilities allowed full host compromise

HiddenLayer — ‘The Lethal Trifecta and How to Defend Against It’ hiddenlayer.com

LLMs cannot inherently distinguish between developer instructions and data provided by untrusted sources

Argemma blog — ‘Lethal Trifecta: No Choice’ argemma.com

the ‘Agents Rule of Two,’ which mandates that an autonomous agent should never satisfy more than two of these properties simultaneously without human supervision

Medium — ‘Python wasmtime in servers: safe sandbox for untrusted UDFs’ medium.com

By calling store.add_fuel(limit), the host specifies a maximum number of WebAssembly instructions… If the guest exceeds this limit, execution is terminated, preventing infinite loops

fast.io — code execution sandboxes for AI agents comparison fast.io

E2B uses Firecracker microVMs, providing kernel-level isolation… Cloudflare Dynamic Workers and Deno Subhosting utilize V8 Isolates… startup times as low as 5ms to 50ms

Blaxel.ai — sandbox comparison for AI agents blaxel.ai

MicroPython… cannot run many C-extension libraries common in data science (like NumPy), necessitating a ‘pure Python’ approach for all sandboxed analysis

Sources

References

Jack Sun, writing.