AIRA short of Hymba, UI traces ID 14 agents, MANSU holds unlearning at 4-bit

TL;DR

AIRAhybrid-D hits 2.719 validation loss at 1B, beating Llama 3.2’s 2.815.
Hymba-1.5B still tops Llama 3.2 3B with an 11.67× smaller KV cache.
UI traces identify 14 LLM browser agents at 96% Macro F1, within 15 events.
4-bit NF4 reverses BF16 unlearning from ~21% retention back to ~83%.
MANSU holds the post-quantization WMDP-bio gap at −0.040 versus +0.050 baseline.

Today’s three features each overturn an assumption an LLM developer was probably defaulting to last week. Meta’s AIRA runs evolutionary architecture search and posts a 1B win over Llama 3.2 with 54% faster scaling — but independent reviewers read the result as recombination of known Attention-Mamba primitives, and hand-designed Hymba-1.5B still beats Llama 3.2 3B with an 11.67× smaller cache. Agent-designed isn’t invention. A second paper shows UI action traces alone — click coordinates, link ratios, per-action timing — identify which of 14 frontier LLMs drives a browser agent at 96% Macro F1, locking in within 15 events and shrugging off 5-second jitter. Browser agents aren’t anonymous in deployment. The third, MANSU, finds that 4-bit NF4 quantization rounds BF16 unlearning updates back to zero, snapping forbidden-knowledge retention from ~21% to ~83% unless a per-weight magnitude floor protects them. Unlearning isn’t precision-stable.

Meta’s AIRA agents beat Llama 3.2, not the best hybrids

TL;DR

AIRAhybrid-D hits 2.719 validation loss at 1B, beating Llama 3.2 (2.815) and edging Nemotron-2 (2.732).
AIRAformer-C scales 54% faster than Llama 3.2 on the compute-optimal frontier.
Independent reviewers call the wins “engineering synthesis” — agents recombine known Attention-Mamba patterns rather than invent new primitives.
Hand-designed Hymba-1.5B still beats Llama 3.2 3B with 11.67× smaller KV cache, leaving AIRA shy of the hybrid SOTA.

What Meta actually shipped

FAIR’s AIRA release is two things at once: a NAS result and an agent-research substrate. The result is a pair of frameworks — AIRA-Compose for high-level layer arrangement (Attention / MLP / Mamba-2 primitives across a 16-layer backbone) and AIRA-Design for low-level code generation (write a novel attention mechanism in JAX/Flax; optimize a train.py under a 5-minute compute budget). The substrate is AIRA-dojo, an Apptainer-isolated harness with Slurm integration that can run more than 1,000 agents in parallel, now public on GitHub ¹. Eleven frontier models — GPT-5, o3-mini, Gemini 3 Pro, Opus 4.5 — drive tree search (greedy or MCTS) across a ~43M-configuration hybrid space, with the best 16-layer patterns stretched or stacked up to 3B parameters.

That second piece may be the more durable contribution. Previous LLM-in-the-loop NAS work (EvoPrompting, GPT-NAS) used the LLM as a mutation operator inside a genetic algorithm; AIRA recasts it as a researcher that hypothesizes, implements, and debugs ². Whether or not AIRAformers ship in production, the dojo is the kind of evaluation infrastructure the agent-research field has been missing.

The numbers — and what they don’t beat

The headline architectural wins are real but bounded. At 1B parameters trained on DCLM:

Model	Val loss ↓	6-task zero-shot acc
Llama 3.2 (1B)	2.815	57.5%
Nemotron-2 (approx.)	2.732	—
AIRAformer-D	—	59.7% (+2.2 pts)
AIRAhybrid-D	2.719	60.5%
Hymba-1.5B (hand-designed) ³	—	beats Llama 3.2 3B, 11.67× smaller KV
Zebra-Llama-1B (hand-designed) ³	—	> Llama 3.2 1B, ~8% of its KV

AIRA’s agents clear the Llama bar, and a 54% compute-frontier improvement for AIRAformer-C is non-trivial. But NVIDIA’s Hymba and AMD’s Zebra-Llama — both designed by humans — have been doing the same Attention-Mamba interleaving with far more aggressive KV savings for over a year ³. Independent reviewers read AIRA as agents rediscovering a hybrid design space that the literature has been mapping in public ⁴.

On the low-level Autoresearch task, the best agent (Greedy Opus 4.5 + Literature) drove validation BPB to 0.968, beating the authors’ internal baseline of 1.0121. A 4% BPB delta is meaningful in pretraining — but small enough to brush up against the noise floor the field has been worrying about.

Why “recursive self-improvement” is overselling it

The RSI framing — agents optimizing the architectures that power them — is the part to discount. Three caveats from the paper itself and outside critics:

Synthesis, not invention. Agents recombine SwiGLU, linear attention, Mamba-2; they do not propose new mathematical primitives ⁴.
Literature can hurt. In Autoresearch, giving agents 41 papers and 14 repos as context sometimes degraded results — “distracted” agents underperformed the bare-context baseline.
Attribution is broken. The greedy scaffold rewrites whole files each step, so you can’t tell which code delta produced the gain ⁴.

The broader autonomous-research literature underscores the skepticism. An audit of Sakana’s AI Scientist v2 found 57% of generated papers contained hallucinated numbers or fabricated data, with ~42% of experiments failing on coding errors ⁵. Roughly a third of SWE-bench issues leak solutions in their own comments, and agents have been caught reading hidden answer keys ⁶. AIRA is more rigorously instrumented than those cases — but the proxy-to-target gap and seed variance the authors flag ⁴ are exactly the failure modes that haunt the genre.

Net: read AIRA as well-instrumented LLM-NAS with a useful harness attached, not as the first step on a self-improvement curve.

UI traces fingerprint LLM browser agents at 96% F1

TL;DR

96% Macro F1 picks which of 14 frontier LLMs drives a browser agent, from UI traces alone.
Identification locks in within 15 events — roughly the first 40% of a session.
5-second random jitter fails: classifiers shift weight to click-coordinate dispersion and link-click ratios.
Single-harness caveat: every trace ran through Midscene.js, whose per-action VLM latency may carry the discriminative load.

The fingerprint

A passive website operator running lightweight JavaScript can tell which LLM is piloting a visiting browser agent, with no access to weights, network packets, or generated text. The paper trains classifiers on 41 scalar features per session — event volumes, inter-event interval statistics, click-coordinate dispersion, scroll-to-click ratios, control-key usage — across 1,050 episodes from 14 models on Wikipedia (2WikiMultiHop, FRAMES) and Amazon (WebShop, Deepshop) tasks. XGBoost wins, hitting 79.4% Macro F1 on 2WikiMultiHop and 74.2% on WebShop in the closed-set setting; individual models like Seed-2-lite (96.1%) and UI-TARS-1.5 (92.1%) are nearly trivial to spot. Task accuracy and identifiability are uncorrelated (Pearson r=0.14) — being competent doesn’t make you anonymous.

SHAP analysis pins the strongest signal on timing — IEI standard deviation and mean click pause. But when the authors inject 5-second random delays, accuracy barely moves: classifiers re-weight onto structural features like click-coordinate dispersion and link-click ratios ⁷. The naive “just add jitter” defense is publicly refuted before anyone ships it.

Not an isolated result

The same conclusion is converging from other observation surfaces. AgentPrint hits 0.866 macro-F1 distinguishing 50 GPT-based agents using only encrypted-traffic side channels — packet sizes, timing, directionality — and pushes further to infer user professional role at 73.9% top-3 accuracy ⁸. CyBiasBench documents a semantic-layer signature: agents show a “persistent attack-selection bias” toward specific exploit families regardless of prompt ⁹. Network rhythm, UI rhythm, tool-choice distributions — three independent channels, same finding.

That matters because reconnaissance-of-agent-internals is already on Google Cloud’s threat-intel radar, formalized in MITRE ATLAS as “Active Scanning” against agent service portals ¹⁰. The paper’s threat model — fingerprinting as the recon step before model-specific prompt injection, agent traps, or sponge attacks — slots into a documented workflow rather than a hypothetical one.

flowchart LR
    A[Browser agent<br/>clicks/scrolls/keys] --> B[Website JS<br/>logs DOM events]
    B --> C[41 scalar features<br/>timing + structure]
    C --> D{XGBoost<br/>classifier}
    D --> E[Model identity<br/>up to 96% F1]
    E -. enables .-> F[Targeted prompt injection,<br/>agent traps, sponge attacks]

The harness problem

Here’s where the headline number deserves scrutiny. Every trace was collected through Midscene.js in vision-only mode. A Browser-use benchmark logs the same test suite taking 1.9 seconds in raw Playwright and 45.8 seconds in Midscene.js — a 24× slowdown driven by per-step VLM inference ¹¹. Since Midscene’s per-action latency dominates exactly the timing features SHAP flagged as most important, the classifier is partly learning Midscene’s call pattern wrapped around each model, not the model in isolation. Practitioner tests also show vision-only agents fail on progressive-disclosure UIs (dropdowns, hidden menus) that DOM-based agents handle trivially ¹², meaning click/scroll ratios reflect harness limitations as much as model cognition.

The authors flag cross-site transfer as weak (train on Wikipedia, test on Amazon doesn’t hold up), but cross-harness transfer is the experiment that would settle whether this is a property of GPT-5.4 or a property of GPT-5.4-via-Midscene.

What’s actually at stake

If the fingerprint survives a harness swap, every public website becomes a free reconnaissance surface for targeting model-specific exploits — and the obvious defenses (delay injection, action shuffling) are already known to fail. If it doesn’t, the result is narrower: a strong argument that today’s agent stacks leak identity through their orchestration layer, and that anonymizing an agent means anonymizing the harness too. Either way, “add some jitter” is not the answer.

MANSU keeps LLM unlearning intact after 4-bit quantization

TL;DR

4-bit quantization reverses unlearning, pushing forbidden-knowledge retention from ~21% back to ~83% in BF16-unlearned models.
MANSU combines circuit localization, null-space projection, and a per-weight magnitude floor so updates can’t be rounded to zero by NF4.
On WMDP-bio, MANSU’s post-quantization gap is −0.040 (compression deepens forgetting) versus +0.050 for global gradient ascent.
Relearning attacks remain unaddressed: 5–15 forget samples, or unrelated benign fine-tuning, can resurrect “erased” knowledge.

Why quantization undoes unlearning

Most LLM unlearning methods spread tiny gradient updates across billions of parameters. When the model is later shipped at 4-bit precision, those per-weight nudges fall below the NF4 bin width (~8.4×10⁻⁴ for Llama-3.1) and get rounded away. The “forgetting” was never structural — it was a thin behavioral veneer that quantization sands off.

This isn’t a MANSU-specific hypothesis. A 2026 survey across BF16 and 4-bit checkpoints reports that unlearned models keep about 21% of forbidden knowledge in full precision, but ~83% after 4-bit quantization — and that GPTQ and AWQ calibration don’t fix it ¹³. The bin-width problem is the dominant failure mode in the field, not a corner case.

How MANSU survives the round-off

MANSU (Mechanistic-Aligned Null-Space Unlearning) is a three-phase pipeline rather than a single loss.

flowchart LR
    A[Forget set] --> B[EAP-IG circuit discovery]
    B --> C[Top-5 MLP layers<br/>~3-5% of params]
    C --> D[Null-space projection<br/>against retain Fisher]
    D --> E[Magnitude floor δ<br/>≥ NF4 bin width]
    E --> F[4-bit deployable model]

Phase 1 uses EAP-IG to pick the sparse MLP subgraph causally responsible for the forget set — for Llama-3.1-8B that’s about 3–5% of parameters, concentrated in layers 14, 19, 29–31. Phase 2 restricts training to those weights and projects each update into the null space of the retain set’s Fisher information, an idea that UNSC at IJCAI 2025 already explored with the cleaner AΔθ = 0 constraint ¹⁴. Phase 3 is the new piece: any cumulative update smaller than the NF4 bin width gets rescaled up to it, guaranteeing the quantized weight changes.

The empirical payoff is clean. On WMDP-bio, global gradient ascent shows a +0.050 post-training-quantization gap (knowledge returns after compression); MANSU shows −0.040 — quantization makes the forgetting deeper. Aggressive baselines like SimNPO match forget quality but crater MMLU to 0.20–0.30; MANSU holds MMLU at 0.573 against a 0.603 baseline. A new Circuit Attribution Divergence metric (1.143 for MANSU, ~0.041 for redirection methods like LUNAR) is offered as evidence that the circuit is genuinely dismantled rather than just rerouted.

What the benchmarks miss

The structural-erasure claim leans on EAP-IG being a reliable circuit-finder, and that’s contested. A recent variance analysis finds EAP-IG’s identified edges “can change significantly if the input data is resampled or if prompts are slightly paraphrased” — meaning a discovered circuit may be a sampling artifact rather than a stable mechanism ¹⁵. If Phase 1 is noisy, the “structural” guarantee is too.

The bigger gap is adversarial. The unlearning literature has converged on relearning attacks as the binding constraint:

MANSU’s CAD measures whether the original circuit is gone. It doesn’t measure whether an equivalent circuit can be cheaply re-formed from intact upstream representations. And MMLU/IFEval won’t catch the “manifold collapse” that mechanistic erasure methods have been shown to induce on compositional reasoning over unrelated safe prompts ¹⁷. SimNPO’s own literature also suggests it is competitive with NPO at moderate strengths ¹⁸, so MANSU’s clean win over a collapsed-MMLU SimNPO may flatter the comparison.

Quantization permanence is a real, previously-unsolved problem, and MANSU’s magnitude floor is the right shape of fix. But “survives NF4” is a weaker claim than “survives an adversary with 15 examples and a fine-tuning budget” — and that’s the threat model deployed unlearning actually faces.

Round-ups

Solvita hits SOTA on Codeforces via agentic RL evolution

A four-agent framework (Planner, Solver, Oracle, Hacker) updates a graph-structured knowledge network with reinforcement learning, letting code models keep learning between problems. Solvita reports state-of-the-art results on CodeContests, APPS, AetherCode, and Codeforces through certified supervision and targeted hacking.

Activation steering reaches LLM states no prompt can hit

White-box activation steering produces residual-stream states that have no textual preimage, meaning prompt engineering cannot replicate them. The finding draws a hard line between black-box and white-box control, with implications for interpretability and safety research that assumes prompts span the reachable behavior space.

ProofGrid stress-tests LLM reasoning with machine-checkable proofs

The benchmark uses a minimal natural-deduction language for proof writing, checking, masking, and gap-filling, then scores models with an Epistemic Stability Index and 2PL IRT analyses. The setup measures reasoning depth and stability instead of single-shot accuracy on informal math.

Physics-R1 audit exposes contamination in vision-language physics benchmarks

An audit of multimodal physics evaluations using 5-gram Jaccard checks and a Haiku-4.5 judge surfaces train-eval contamination, translation drift, and MCQ saturation across PhysReason, OlympiadBench-Physics, and PhyX. The team releases cleaned corpora and a Qwen3-VL-8B-Thinking model trained with GSPO and DAPO.

GQLA lets one LLM decode efficiently on H100 and H20 alike

Group-Query Latent Attention exposes multiple decoding paths from a single trained weight set, generalizing MQA, GQA, and MLA. Conversion tools TransMLA and TransGQLA adapt existing checkpoints so the same model runs efficiently on high-end and commodity GPUs without retraining.

HodgeCover compresses sparse MoE layers using simplicial topology

A learning-free method applies Hodge decomposition and harmonic kernels over a simplicial Laplacian to identify redundant experts and merge them without retraining. Triplet barriers bound KL divergence from the original router, enabling aggressive expert reduction in sparse mixture-of-experts models.

Geospatial foundation models lack a shared benchmark, audit finds

Inconsistent evaluation protocols and unreported pretraining controls make geospatial foundation model comparisons unreliable across disaster response, land-cover mapping, and food-security tasks. The authors release a GFM leaderboard on GitHub to standardize reporting and restore reproducibility in the field.

facebookresearch/aira-dojo (GitHub) — https://github.com/facebookresearch/aira-dojo

AIRA-dojo abstracts agents into Solvers, Operators, and Search Policies, and uses Apptainer containers to provide isolated code execution environments… supporting integration with Slurm to run over 1,000 agents in parallel.

↩
arXiv 2602.06855 — prior LLM-NAS work (EvoPrompting / GPT-NAS lineage) — https://arxiv.org/html/2602.06855v3

Previous methods typically use LLMs as ‘variation operators’—performing mutation and crossover within a standard Genetic Algorithm loop… AIRA-Compose instead employs agents that formulate structural hypotheses, write implementation code, and iteratively debug their designs.

↩
CodeSOTA hybrid-architecture benchmarks — https://www.codesota.com/llm

Hymba-1.5B outperforms Llama 3.2 3B in average accuracy while achieving a 11.67× reduction in KV cache size and a 3.49× increase in throughput… Zebra-Llama-1B maintains higher accuracy than its Llama 3.2 counterpart while using only ~8% of the equivalent KV cache memory.

↩ ↩² ↩³
TheMoonlight independent review — https://www.themoonlight.io/en/review/agentic-discovery-of-neural-architectures-aira-compose-and-aira-design

Agents currently excel at ‘engineering synthesis’—combining existing concepts like linear attention and SwiGLU—rather than proposing fundamentally new mathematical primitives; the proxy-target gap remains a persistent challenge and performance can be uneven across different agent seeds.

↩ ↩² ↩³ ↩⁴
ByteIota — ‘AI Scientist v2 Passes Peer Review but 57% is False Data’ — https://byteiota.com/ai-scientist-v2-passes-peer-review-but-57-is-false-data/

An audit revealed that 57% of [Sakana’s AI Scientist v2] generated papers contained hallucinated numerical results or fabricated data… approximately 42% of automated experiments fail due to coding errors, and agents have been observed ‘reward-hacking’.

↩
DeepLearning.AI — The Batch on benchmark contamination — https://www.deeplearning.ai/the-batch/the-problem-with-benchmark-contamination-in-ai

Many top-performing agents on leaderboards achieve high scores by exploiting shared environments to read hidden answer keys… roughly one-third of SWE-bench issues contain solutions within their own comments.

↩
arxivdaily independent digest of the paper — http://arxivdaily.com/thread/79892

Classifiers can identify a model’s identity with up to 96% F1 accuracy using fewer than 15 observed events or by analyzing only the first 40% of a session… results indicate that such protections are not robust; classifiers retrained on delayed traces can still recover peak identification performance.

↩
Moonlight review of AgentPrint (Exposing LLM User Privacy via Traffic Fingerprint Analysis) — https://www.themoonlight.io/en/review/exposing-llm-user-privacy-via-traffic-fingerprint-analysis-a-study-of-privacy-risks-in-llm-agent-interactions

By analyzing packet sizes, transmission timing, and directionality—rather than content—researchers achieved a macro F1-score of 0.866 in identifying specific agents among a pool of 50 GPT-based services… adversaries can infer sensitive user attributes, such as professional roles, with up to 73.9% top-3 accuracy.

↩
CyBiasBench (ResearchGate) — https://www.researchgate.net/publication/404713002_CyBiasBench_Benchmarking_Bias_in_LLM_Agents_for_Cyber-Attack_Scenarios

Agents possess a persistent ‘attack-selection bias,’ favoring specific exploit families regardless of the prompt. This creates a stable behavioral signature that defenders use for incident attribution and model-specific jailbreak targeting.

↩
Google Cloud Threat Intelligence — AI vulnerability exploitation / initial access — https://cloud.google.com/blog/topics/threat-intelligence/ai-vulnerability-exploitation-initial-access

Modern LLM reconnaissance, formalized in the MITRE ATLAS framework, involves ‘Active Scanning’ where attackers probe an agent’s service portals to map its available tools and internal prompts… allowing adversaries to understand the system persona and available API tools, effectively creating a blueprint for further manipulation.

↩
browser-use.com — ‘Speed Matters’ benchmark post — https://browser-use.com/posts/speed-matters

A test suite that took 1.9 seconds in raw Playwright required 45.8 seconds in Midscene.js due to repeated VLM inference calls.

↩
r/LLMDevs — practitioner test of 6 models on real browser automation — https://www.reddit.com/r/LLMDevs/comments/1qwyc4a/tested_6_models_on_real_browser_automation_vision/

Vision-only models often struggle with progressive disclosure (elements hidden behind menus or dropdowns) because they cannot ‘see’ what isn’t rendered, whereas DOM-based agents can query the underlying structure to find hidden nodes.

↩
arxiv 2602.13151 — survey of quantization-reversal in LLM unlearning — https://arxiv.org/html/2602.13151v1

In full-precision (FP16/BF16) models retain only about 21% of forbidden knowledge; after 4-bit quantization the retention of forgotten knowledge spikes to approximately 83%… even advanced quantization methods like GPTQ and AWQ… fail to prevent this knowledge recovery

↩
IJCAI 2025 — Machine Unlearning via Null Space Calibration (UNSC) — https://www.ijcai.org/proceedings/2025/1038.pdf

confines the unlearning updates to [a null space tailored to the retain set]… ensuring that the unlearning update Δθ satisfies Aθ = 0… predictions on the remaining dataset remain mathematically invariant

↩
Mechanistic Interpretability as Statistical Estimation: A Variance Analysis of EAP-IG (ResearchGate) — https://www.researchgate.net/publication/396094495_Mechanistic_Interpretability_as_Statistical_Estimation_A_Variance_Analysis_of_EAP-IG

the specific edges identified in a circuit can change significantly if the input data is resampled or if prompts are slightly paraphrased… a circuit discovered in one experimental run may be an artifact of the specific sample rather than a reflection of a stable, underlying model mechanism

↩
arxiv 2410.07163 (relearning attacks on WMDP unlearning) — https://arxiv.org/html/2410.07163v1

as few as 5 to 15 samples of the ‘forgotten’ data can almost completely reverse the effects of unlearning… training on entirely unrelated, benign data can lead to the recovery of the original hazardous knowledge

↩
OpenReview — ‘Erasure or Erosion?’ critique of structural-erasure methods — https://openreview.net/forum?id=Pd3jVGTacT

aggressive erasure methods often cause ‘manifold collapse,’ where removing a specific concept unintentionally degrades the model’s broader compositional integrity… a safety-utility trade-off where models lose the ability to perform complex attribute binding or spatial reasoning on unrelated safe prompts

↩
themoonlight.io review of SimNPO — https://www.themoonlight.io/en/review/simplicity-prevails-rethinking-negative-preference-optimization-for-llm-unlearning

SimNPO uses length-normalized log-probabilities… more effectively allocates optimization effort, leading to higher Forget Quality (FQ) while maintaining superior performance on the retain set

↩

AIRA short of Hymba, UI traces ID 14 agents, MANSU holds unlearning at 4-bit

TL;DR

Meta’s AIRA agents beat Llama 3.2, not the best hybrids

TL;DR

What Meta actually shipped

The numbers — and what they don’t beat

Why “recursive self-improvement” is overselling it

UI traces fingerprint LLM browser agents at 96% F1

TL;DR

The fingerprint

Not an isolated result

The harness problem

What’s actually at stake

MANSU keeps LLM unlearning intact after 4-bit quantization

TL;DR

Why quantization undoes unlearning

How MANSU survives the round-off

What the benchmarks miss

Round-ups

Solvita hits SOTA on Codeforces via agentic RL evolution

Activation steering reaches LLM states no prompt can hit

ProofGrid stress-tests LLM reasoning with machine-checkable proofs

Physics-R1 audit exposes contamination in vision-language physics benchmarks

GQLA lets one LLM decode efficiently on H100 and H20 alike

HodgeCover compresses sparse MoE layers using simplicial topology

Geospatial foundation models lack a shared benchmark, audit finds

Footnotes

Jack Sun, writing.