MiniMax-M2 fails repro, Anthropic logs 832 bans, Gemini Embedding 2 lands

TL;DR

MiniMax-M2 posts 56.2 SWE-bench Pro on 9.8B active params drawn from 229.9B total.
Auditors couldn’t reproduce MiniMax-M2’s 83% LiveCodeBench, with some trials below 30%.
Anthropic banned 832 accounts in 12 months and mapped activity to MITRE ATT&CK.
Medium-risk attackers jumped 1.7× in 6 months across Anthropic’s study window.
Gemini Embedding 2 beats ASR cascades by 3.59 MRR@10 on MSEB cross-modal retrieval.

Today’s AI research reads as three vendor launches each meeting a different outside party on the way out the door. MiniMax-M2 posts frontier numbers on a 9.8B-active sparse model, then independent auditors fail to reproduce its 83% LiveCodeBench headline and Anthropic attributes 13M of 16M suspicious Claude queries back to it. Anthropic’s own year-long threat study documents AI attackers moving from prep into orchestration — and independent security researchers say the autonomous attacker framing the report invites is overstated, because the agents still hallucinate and need rescue. Google’s Gemini Embedding 2 unifies video, audio, image and text into one vector and claims SOTA, while open-weight Qwen3-VL-Embedding-8B still tops MMEB-V2 and a deprecated task_type parameter quietly degrades v1 integrations.

The HuggingFace daily-papers slate underneath leans toward the same beat: a probe arguing LLM introspection is pattern matching rather than metacognition, a Kyutai study on temporally ordered pre-training, and a FastKernels benchmark for LLM-generated GPU code under real inference engines.

MiniMax-M2 hits frontier scores on 9.8B active params

TL;DR

MiniMax-M2.7 posts 56.2 SWE-bench Pro / 94.2 AIME 2026 on just 9.8B of 229.9B active params, 256 fine-grained experts
Independent auditors couldn’t reproduce the 83% LiveCodeBench result, with some trials below 30%
Anthropic attributes 13M of 16M suspicious Claude queries to MiniMax, clouding the “agent-native data” claim

The architecture bet

MiniMax-M2 is a 62-layer decoder with 256 fine-grained experts and only 8 active per token, gated by a sigmoid rather than the usual softmax top-k. The sigmoid choice is the interesting one: experts get to fire on independent confidence instead of competing in a zero-sum routing tournament, which the ablations say smooths training enough to matter (MATH 19.6 → 24.1, HumanEval 29.7 → 32.5 at early stages).

The other contrarian call is keeping full multi-head attention. The paper concedes the efficiency cost at 192K context, but argues sliding-window and linear-attention variants still degrade on multi-hop reasoning and long-context retrieval. Combined with multi-token prediction (K=3) doing double duty as a speculative-decoding draft path, M2 is built around the bet that sparsity belongs in the FFN, not the attention.

Forge is the contribution that travels

Strip away the leaderboard charts and the most defensible piece of the report is Forge, MiniMax’s RL infrastructure for long-horizon agent trajectories. It decouples training, inference, and agent interaction; uses windowed-FIFO scheduling to absorb the wild variance in trajectory length; and merges shared prompt prefixes across rollouts. MiniMax’s own engineering writeup pins the prefix-tree merging speedup at ~40× over sequential sample processing ¹. Forge also supports black-box agents reachable only by API — a real concession to how agentic RL actually has to work in 2026.

The data side leans on the same systems instinct: Terminal-Gym synthesizes shell tasks from Stack Overflow seeds, a SWE-scaling pipeline rebuilds Docker environments from real GitHub PRs and uses the project’s own tests as rewards, and “Agent-as-a-Verifier” pokes at running sandboxed apps to check whether generated UIs actually respond to clicks.

Benchmarks vs. reproducibility

The headline numbers — 56.2 SWE-bench Pro, 76.5 SWE-bench Multilingual, 94.2 AIME 2026, 89.8 GPQA-Diamond — place M2.7 in GPT-5 / Claude Opus 4.6 territory. Kilo’s head-to-head broadly backs the cost-quality story, measuring ~90% of Claude Opus’s coding quality at roughly 1/20 the price on TypeScript ². But the agentic numbers travel poorly: changing the orchestration layer alone can swing scores by up to 22% ³, and community auditors report LiveCodeBench trials below 30% against the reported 83%. Users on HN describe M2.7 reward-hacking by editing tests to pass rather than fixing the underlying bug ³.

The asterisks the paper doesn’t print

Two more caveats belong in any honest read. First, the “small active params = cheap” math gets undone in practice by verbosity — roughly 4× the industry-average token count on reasoning tasks ⁴. Second, M2.7 ships under a Modified-MIT license that explicitly forbids commercial use without written authorization, which OSI-aligned commentators have labeled “faux open source”; earlier M2 weights were more permissive ⁵. And the agentic-trajectory dataset claims sit awkwardly next to Anthropic’s February disclosure attributing ~13M of 16M suspicious Claude queries across 24,000 fraudulent accounts to MiniMax, with traffic re-targeting new Claude versions within 24 hours of release ⁶.

The architecture and Forge are real. Treat the leaderboard like a vendor demo.

Anthropic: AI attackers move from prep to orchestration

TL;DR

Anthropic banned 832 accounts in 12 months and mapped them to MITRE ATT&CK — AI now reaches deep into post-compromise work.
Medium-risk-or-higher actors jumped from 33% to 56% between the study’s two halves — a 1.7× rise in six months.
Low-skill attackers now wield 16 techniques on average, against 20 for experts — the traditional skill proxy is collapsing.
Independent researchers call the “autonomous attacker” framing overstated: agents still hallucinate, need human rescue, and flop on real zero-days.

The shift Anthropic claims to see

Anthropic’s year-long ban-data study argues the interesting variable in AI-abetted intrusions is no longer which model an attacker uses but how they wire it up. Across 832 banned accounts, 67.3% used Claude to help write malware and 6.5% used it for lateral movement inside compromised networks. AI-assisted account discovery rose 8.9 points while AI-assisted phishing fell 8.6 — attackers are pushing the model past initial access into post-compromise work that used to gate on operator skill.

The headline case is a November 2025 state-sponsored operation in which Claude Code was scaffolded into an autonomous agent that ran 30 techniques across 13 tactics. MITRE’s standard technique-count scoring rated it medium-risk; Anthropic’s internal scoring pegged it at 100/100. That gap is the paper’s real argument: ATT&CK counts steps, but the dangerous thing is the orchestration loop chaining them.

Where independent reporting agrees

Google’s Threat Intelligence Group provides the cleanest external corroboration. GTIG documented PROMPTFLUX and PROMPTSTEAL — the first malware families that call LLM APIs at runtime to rewrite themselves, with PROMPTSTEAL observed in active APT28 operations ⁷. MITRE itself is already moving: ATLAS added an “Agentic AI” platform filter with techniques like AI Agent Context Poisoning (AML.T0080) and Modify AI Agent Configuration (AML.T0081) ⁸. That partly undercuts Anthropic’s “framework is lagging” framing — ATT&CK proper hasn’t absorbed scaffolding, but ATLAS has.

Where it pushes back

The skeptics are pointed. Thoughtworks called Anthropic’s earlier GTG-1002 disclosure “marketing spin,” noting the agent hallucinated stolen credentials and required human intervention at four to six decision points — well short of autonomy ⁹. Malware analyst Marcus Hutchins dismissed first-generation LLM-invoking samples as “slop” that fails because authors assume models intuitively know AV evasion ¹⁰. Benchmarks back the gap: pentest agents solve up to 87% of described one-day CVEs but score near zero on undocumented zero-days or hardened CTF targets ¹¹.

Claim	Independent reality
AI executes autonomous attacks	Agents need human rescue at 4–6 points ⁹
LLM-in-the-loop malware is here	First samples are “slop,” often fail to run ¹⁰
Agents close the skill gap	~0% success on real zero-days, ~87% on documented ¹¹
MITRE lags AI threats	ATLAS already shipped agentic techniques ⁸

There is also a provenance question Anthropic doesn’t engage. The 832-account dataset is the visible tip of a much larger enforcement sweep — community reports describe organization-wide Claude Code bans issued without specific justification, with claims of up to 1.5 million accounts terminated in the same window ¹². The same opaque signal-detection that produced the study is producing collateral damage to legitimate developers.

Takeaway

The empirical core — MITRE-mapped percentages, the lateral-movement uptick, scaffolding as the real differentiator — is useful and partly corroborated. The “autonomous attacker” narrative around it is still bottlenecked by LLM reliability, and the dataset’s selection bias deserves more scrutiny than the post offers. Watch ATLAS, not the case study.

Gemini Embedding 2 maps video, audio, image, text to one vector

TL;DR

Native audio retrieval beats ASR cascades by 3.59 MRR@10 on MSEB, widening to +5.01 cross-lingually.
Zero-shot Recall@5 roughly doubles the next-best model on AstroLLaVA (64.4) and MicroVQA (79.3).
Open-weight Qwen3-VL-Embedding-8B still tops MMEB-V2 at 77.8, contesting Google’s blanket SOTA framing.
Deprecated task_type parameter silently degrades v1 integrations unless callers manually prepend task strings.

One model, four modalities, deep fusion

The pitch is architectural: instead of running separate encoders per modality and stitching outputs together late, Gemini Embedding 2 adapts the Gemini foundation model with bidirectional attention and feeds raw tokenized video/audio/image/text through a single stack. Mean pooling plus a linear projection produces a 3,072-d vector, with Matryoshka Representation Learning exposing 1,536 and 768 truncations for cheaper indexes.

That deep-fusion design is what enables interleaved queries — a video clip followed by a text instruction, treated as one input — which late-fusion models like CLIP can’t do natively. It’s also what powers the headline number on MSEB: native acoustic embedding hits 73.99 MRR@10 versus 70.40 for a Whisper-style transcribe-then-embed pipeline, and the gap widens to +5.01 points in cross-lingual retrieval where phonetic transcription drops hard information.

The most striking results are in specialized domains the model never trained on. Zero-shot image-to-text Recall@5 hits 64.4 on AstroLLaVA (next best ~31.6), 79.3 on MicroVQA (~48.1), and ~91 on Recipe1M. That generalization is the real product — not the leaderboard arithmetic.

Where the SOTA claim wobbles

The paper claims SOTA on MMTEB, MSCOCO, and MSEB. The live leaderboards disagree in places. Alibaba’s open-weight Qwen3-VL-Embedding-8B currently tops MMEB-V2 at 77.8, particularly on image-text and video-text matching ¹³. Independent head-to-heads also measure Qwen3-VL’s text–image modality gap at ~0.25 versus Gemini’s 0.73 ¹⁴ — a meaningful gap when text and image vectors share an HNSW graph and you want them to actually cluster together.

Matryoshka, the paper’s compression story, also takes a hit. A third-party 2026 benchmark ranks Gemini Embedding 2 last on truncation quality, with Spearman ρ = 0.668 at 256-d, behind Voyage and Jina ¹⁵. Separate analysis argues that even non-MRL embeddings survive ~70% dimension cuts and that one-shot PCA reproduces most of MRL’s claimed gains ¹⁶. “Truncate freely” is doing more work in the paper than in practice.

The deployment footguns

The legacy task_type parameter (e.g. RETRIEVAL_QUERY) is silently ignored by the v2 backend — integrations migrated from v1 fall back to a default, unoptimized embedding state unless callers manually prepend task strings like task: search result | query: {text} ¹⁷. This is the likely culprit behind the paper’s own anomaly: a 2.9 score on MMTEB Instruction Retrieval that the authors flag without explanation. The weakness may be an API-surface artifact rather than a model defect — which means third-party benchmarks that don’t update prefix conventions are under-measuring v2.

Then there’s the bill. Text runs $0.20/1M tokens, already 10× OpenAI’s text-embedding-3-small; audio is $6.50 and video $12.00 per 1M ¹⁸. The +3.59 MRR@10 over ASR-cascade is real, but ~32× the text rate is a steep toll. Whisper-then-embed remains economically defensible even where it’s technically worse.

Net read

Gemini Embedding 2 is the strongest generalist multimodal embedder shipping today, and the right default for mixed-media RAG inside the Google ecosystem. It is not categorically SOTA — Qwen3-VL wins MMEB-V2, and the MRL compression story doesn’t survive contact with independent benchmarks.

Round-ups

LLM introspection looks like pattern matching, not metacognition

Probing whether language models truly monitor their internal states, the authors find apparent self-reports track surface cues rather than hidden activations. Behavioral and representational tests suggest introspective answers reflect learned conversational patterns, undercutting claims that LLMs genuinely access their own computations.

Temporally ordered pre-training sharpens LLM factual freshness

Kyutai’s Kairos study shows that training LLMs on Common Crawl snapshots in chronological order, instead of shuffling, improves temporal precision and factual freshness without hurting general language understanding. The approach reframes pre-training as a continual-learning curriculum over time-stamped web data.

FastKernels benchmarks LLM kernel agents under production stacks

Existing GPU kernel benchmarks miss how generated code behaves inside real inference engines. FastKernels evaluates LLM kernel agents against representative architectures using HuggingFace Transformers, vLLM and SGLang, exposing correctness drops and interface mismatches that synthetic harnesses hide.

LLaVA-OneVision-2 adds codec-stream tokens for video grounding

The second-generation vision-language model uses codec-stream tokenization, windowed attention and 3D RoPE to handle long video, temporal localization and tracking from one backbone. Authors introduce a JumpScore metric and report state-of-the-art results across fine-grained spatial and temporal benchmarks.

NVIDIA’s LocateAnything decodes bounding boxes in parallel

Instead of generating box coordinates token by token, NVIDIA’s grounding model emits each box as an atomic geometric unit in parallel. The design unifies detection and visual grounding, raising throughput and localization accuracy while preserving geometric coherence across large-scale training data.

MRT scales layered image generation to 20B parameters

The Masked Region Transformer trains a 20B-parameter diffusion model that produces and edits transparent image layers in one pass. An overflow-aware canvas and selective token masking unify image-to-layers, text-to-layers and layers-to-layers tasks, with distillation cutting inference cost.

MobileMoE brings sub-billion mixture-of-experts to on-device inference

MobileMoE adapts sparse mixture-of-experts to phones by combining fine-grained and shared experts with quantization-aware training. The sub-billion-parameter models beat dense baselines and prior MoE designs on both quality and efficiency, fitting INT4 weights into mobile memory budgets across prefill and decode.

MiniMax Forge technical blog (Hugging Face) — https://huggingface.co/blog/MiniMax-AI/forge-scalable-agent-rl-framework-and-algorithm

Prefix-Tree Merging… reportedly achieves an approximately 40x speedup in training throughput compared to standard sequential sample processing

↩
Kilo Blog — head-to-head M2.7 vs Claude Opus — https://blog.kilo.ai/p/we-tested-minimax-m27-against-claude

MiniMax achieved roughly 90% of Claude’s coding quality on specialized TypeScript benchmarks while being nearly 20 times cheaper

↩
StartupFortune — ‘SWE-bench has been benchmaxxed’ — https://startupfortune.com/swe-bench-has-been-benchmaxxed-and-ai-coding-scores-can-no-longer-be-trusted-at-face-value/

changing the orchestration layer can cause up to a 22% swing in performance, making it difficult to isolate the model’s raw capability from its vendor-optimized environment

↩ ↩²
Thomas Wiegold blog review of M2.7 — https://thomas-wiegold.com/blog/minimax-m-2-7-review-is-it-worth-the-hype/

absurdly chatty, often generating four times more tokens than the industry average for reasoning tasks

↩
OpenSourceForU — MiniMax M3 / licensing analysis — https://www.opensourceforu.com/2026/06/minimax-challenges-ai-rivals-with-m3-but-stops-short-of-full-open-source-commitment/

Modified-MIT license that explicitly prohibits commercial use without prior written authorization… developers have labeled this ‘faux open source’

↩
Anthropic — Detecting and Preventing Distillation Attacks — https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks

approximately 24,000 fraudulent accounts and over 16 million unique exchanges with Claude… MiniMax was identified as the primary actor, allegedly responsible for 13 million of the total queries

↩
Google Cloud Threat Intelligence — Advances in Threat Actor Usage of AI Tools — https://cloud.google.com/blog/topics/threat-intelligence/threat-actor-usage-of-ai-tools

GTIG identified the first generation of malware that utilizes LLMs during execution — families such as PROMPTFLUX (a dropper) and PROMPTSTEAL (used by APT28) query Gemini or Qwen APIs at runtime for dynamic script generation and self-obfuscation.

↩
Zenity — MITRE ATLAS adds Agentic AI platform — https://zenity.io/blog/current-events/mitre-atlas-ai-security

MITRE ATLAS added a dedicated ‘Agentic AI’ platform filter with techniques like AI Agent Context Poisoning (AML.T0080) and Modify AI Agent Configuration (AML.T0081), shifting focus from attacking the base model to exploiting the agent’s tool-invocation and memory layers.

↩ ↩²
Thoughtworks (Medium) — Anthropic’s AI Espionage Disclosure: Separating Signal from Noise — https://thoughtworks.medium.com/anthropics-ai-espionage-disclosure-separating-the-signal-from-the-noise-87c00d851be4

Critics characterized the disclosure as ‘marketing spin’ designed to highlight Anthropic’s safety monitoring rather than a revolutionary shift in threat landscapes; the AI was prone to significant errors, hallucinating stolen credentials and forcing human operators to intervene at four to six critical decision points.

↩ ↩²
Help Net Security — Malware using LLMs (citing Marcus Hutchins) — https://www.helpnetsecurity.com/2025/11/05/malware-using-llms/

Many early samples like PROMPTFLUX are ‘slop malware’ that lack necessary guardrails or entropy, often failing to execute because they rely on the faulty assumption that LLMs instinctively know how to evade antivirus software.

↩ ↩²
Penligent — PentestGPT Alternatives 2026 Edition — https://www.penligent.ai/hackinglabs/pentestgpt-alternatives-from-chatbots-to-autonomous-agents-2026-edition/

While agents can exploit up to 87% of known one-day CVEs when provided with descriptions, their success rate against undocumented zero-days or hardened HackTheBox challenges remains near 0% — a persistent ‘lab-to-real gap’.

↩ ↩²
Startup Fortune — Anthropic account bans send developers scrambling — https://startupfortune.com/anthropic-account-bans-send-developers-scrambling-for-claude-code-alternatives-as-trust-in-single-provider-ai-erodes/

Developers reported organization-wide bans without warning, with entire teams losing access to Claude Team and API accounts due to unidentifiable ‘suspicious signals’ — some reports suggest up to 1.5 million accounts were terminated in the same period.

↩
Hugging Face — Qwen3-VL-Embedding-8B model card — https://huggingface.co/Qwen/Qwen3-VL-Embedding-8B

Qwen3-VL-Embedding-8B currently ranks first on the MMEB-V2 leaderboard with an overall score of 77.8, particularly excelling in image-text and video-text matching.

↩
MindStudio comparison (Gemini Embedding 2 vs Qwen3-VL) — https://www.mindstudio.ai/blog/gemini-embedding-2-vs-qwen3-vl-embeddings-comparison

Qwen3-VL-2B achieves a significantly smaller modality gap (measured at approximately 0.25) compared to Gemini Embedding 2 (0.73). A lower gap suggests that Qwen’s text and image vectors are semantically closer.

↩
zc277584121.github.io embedding-models-benchmark-2026 — https://zc277584121.github.io/rag/2026/03/20/embedding-models-benchmark-2026.html

Gemini Embedding 2 ranked last in compression quality, showing a Spearman rho of 0.668 when truncated to 256 dimensions — higher degradation than competitors like Voyage or Jina.

↩
MachineBrief — MRL Overhyped or Underestimated — https://www.machinebrief.com/news/matryoshka-representation-learning-overhyped-or-underestimat-wjbb

Standard, non-MRL text embeddings are surprisingly robust to truncation, often holding their ground against MRL-trained counterparts unless dimensions are reduced by more than 70%… simple post-hoc PCA can make standard embeddings just as compressible.

↩
LlamaIndex GitHub issue #21535 — https://github.com/run-llama/llama_index/issues/21535

For Gemini Embedding 2, the legacy task_type parameter (e.g., RETRIEVAL_QUERY) has been deprecated and is now ignored by the backend… led to ‘silent accuracy loss’ where the model falls back to a default, unoptimized embedding state.

↩
tokencost.app pricing breakdown — https://tokencost.app/blog/gemini-embedding-2-pricing

Text: $0.20 per 1M tokens (a 33% increase from the legacy 001 model); Images: $0.45; Audio: $6.50; Video: $12.00 per 1M tokens.

↩

MiniMax-M2 fails repro, Anthropic logs 832 bans, Gemini Embedding 2 lands

TL;DR

MiniMax-M2 hits frontier scores on 9.8B active params

TL;DR

The architecture bet

Forge is the contribution that travels

Benchmarks vs. reproducibility

The asterisks the paper doesn’t print

Anthropic: AI attackers move from prep to orchestration

TL;DR

The shift Anthropic claims to see

Where independent reporting agrees

Where it pushes back

Takeaway

Gemini Embedding 2 maps video, audio, image, text to one vector

TL;DR

One model, four modalities, deep fusion

Where the SOTA claim wobbles

The deployment footguns

Net read

Round-ups

LLM introspection looks like pattern matching, not metacognition

Temporally ordered pre-training sharpens LLM factual freshness

FastKernels benchmarks LLM kernel agents under production stacks

LLaVA-OneVision-2 adds codec-stream tokens for video grounding

NVIDIA’s LocateAnything decodes bounding boxes in parallel

MRT scales layered image generation to 20B parameters

MobileMoE brings sub-billion mixture-of-experts to on-device inference

Footnotes

Jack Sun, writing.