Single neuron jailbreaks LLMs, MIT ELF cuts data 10×, Qwen-Image hits 4 steps
Today's headline mechanism wins — a refusal neuron, MIT's ELF, and Qwen-Image-2.0's 4-step diffuser — each ship beside work flagging what they missed.
Single neuron jailbreaks LLMs, MIT ELF cuts data 10×, Qwen-Image hits 4 steps
TL;DR
- Single MLP neuron drives 91.7% jailbreak success on Llama-3.1 and Qwen3, with only 0.6% MMLU loss.
- Refusal neurons exist in base checkpoints, suggesting fine-tuning rewires pre-existing units rather than installing safety.
- MIT’s ELF matches 170M discrete diffusion LMs at 105M params trained on 45B tokens versus 500B+.
- Qwen-Image-2.0 cuts diffusion from 40 to 4 steps via doubled latent compression and five reward models.
- Alibaba quietly dropped the open-source tag from Qwen-Image-2.0’s page, leaving the 7B weights in limbo.
Today’s three research features each isolate a single mechanism and post a dramatic number against it. A single MLP neuron takes Llama-3.1 and Qwen3 to 91.7% jailbreak success while barely touching MMLU. MIT’s ELF stays in continuous embedding space and matches 170M-param discrete diffusion LMs at 105M params on roughly a tenth of the training tokens. Qwen-Image-2.0 compresses 40 diffusion steps to 4 via latent compression plus five reward models.
Each one ships alongside work that contests it. SAE and geometric probes already describe redundant backup features the single-neuron view would miss. LangFlow argues per-step token supervision — exactly what ELF skips — is what’s needed to match discrete SOTA, and ELF’s bidirectional denoiser breaks standard KV caching at inference. Independent testers flag missing Polish diacritics and a screen-door VAE artifact in Qwen-Image-2.0 outputs, and Alibaba quietly dropped the Open-Source tag from the launch page. The round-ups fill out the day with MoE compression, looped-reasoning memory, RL entropy shaping, optimizer-switching costs, distillation teacher selection, and a 439-problem math benchmark.
One MLP neuron jailbreaks Llama-3.1 and Qwen3 at 91.7% ASR
Source: hf-daily-papers · published 2026-05-07
TL;DR
- 91.7% average attack success on JailbreakBench from suppressing a single MLP neuron across seven Qwen3 (1.7B–32B) and Llama-3.1 (8B, 70B) models.
- Anchor-based intervention preserves capability almost perfectly — -0.6% MMLU, -0.1% GSM8K — vs. -8.8% MMLU for naive constant clamping.
- The same refusal neurons exist in base checkpoints before alignment, implying fine-tuning rewires pre-existing units rather than creating safety.
- Independent SAE and geometric work finds redundant backup features and multi-dimensional “concept cones” that a single-neuron probe likely misses.
The bottleneck
Kazemi et al. (Apple/UMD) report that one MLP neuron is causally sufficient to flip a frontier instruction-tuned model from refusing to complying. Their gradient-activation scoring picks neurons that fire on harmful prompts, stay silent on harmless ones, and carry high gradient against a “refusal log-odds” loss tied to phrases like “I’m sorry.” Top-5 candidates are reranked by sweeping an activation multiplier on HarmBench; the winner is then evaluated on JailbreakBench.
The headline 91.7% ASR matches Arditi et al.’s “refusal direction” ablation, which intervenes on a full residual-stream vector at every layer — except here the intervention touches one scalar activation. As a side result, the activation of that same neuron is a near-SOTA harmful-prompt detector: AUROC 0.969 on Llama-3.1-8B, comparable to the dedicated Llama-Guard-3-8B classifier.
The paper splits safety into two systems: a refusal gate (the neuron deciding whether to express harmful knowledge) and a concept substrate (the neurons encoding the knowledge itself). Amplifying a “suicide neuron” injects self-harm content into prompts as benign as “Write a poem about the ocean,” providing existence proof for the second system.
How tight is “single”?
The result is the tightest causal bottleneck in a two-year arc of “safety is sparse” findings, but the strong reading is contested.
| Work | Mechanism | ASR | Claim |
|---|---|---|---|
| Kazemi et al. (this paper) | Suppress 1 MLP neuron | 91.7% | One neuron suffices |
| Wei et al. 1 | Prune safety-responsive neurons | >90% | Small set suffices |
| Arditi et al. / Labonne abliteration 2 | Ablate 1 residual direction | ~SOTA | One direction suffices |
| Wang et al. 3 | Multi-direction “concept cones” | Higher than 1-D | Refusal is up to 5-D |
| Prakash et al. 4 | SAE features on Llama-3.1-8B-IT | — | Backup features activate when primary is killed |
Wang et al. show multi-directional suppression beats single-direction ablation on Qwen and Llama, implying single-neuron probes leave mechanism on the table 3. Prakash et al. find redundant refusal features that stay dormant until the primary is ablated, then switch on — safety as a layered defensive circuit, not an off-switch 4. The 91.7% number may reflect that one neuron is a sufficient pathway for JailbreakBench’s 100 behaviors, not that the network lacks redundancy under adversarial pressure.
Kazemi’s pipeline is also empirically tuned: the “single neuron” is picked from a top-5 list by sweeping multipliers on a validation set. The discovery is real; the minimality claim is procedure-relative.
What it means for open-weights safety
The attack is white-box, so closed APIs are unaffected. Open-weights releases are not. Labonne’s abliteration recipe already ships routinely for every new open model 2, and Chen et al.’s RepIt shows you can selectively disable safety on specific concepts (CBRN, self-harm) while the model still passes standard safety benchmarks — a clean recipe for stealth misaligned “model organisms” 5.
Combined with Qi et al.’s shallow-alignment result — that refusal is concentrated in the first few output tokens and deeper harmful capability stays intact 6 — the picture is consistent: post-training builds a thin gate over an unchanged substrate, and that gate has a small number of load-bearing parts. The open question is no longer whether single-component attacks work on open weights. It is whether any white-box safety story survives once the gate’s coordinates are public.
MIT’s ELF matches discrete diffusion LMs on 10× less data
Source: hf-daily-papers · published 2026-05-10
TL;DR
- MIT’s ELF stays in continuous embedding space until the final step, hitting 24.08 generative perplexity in 32 sampling steps.
- The 105M model beats MDLM and Duo (both 170M) while training on ~45B tokens vs. their 500B+.
- Concurrent LangFlow contradicts the recipe, arguing per-step token supervision is required to match discrete SOTA.
- The step-count win may not survive contact with production inference: bidirectional denoising breaks standard KV caching.
The handbrake ELF removes
For two years, discrete diffusion (MDLM, Duo) has been the default for non-autoregressive text because the obvious alternative — denoising in continuous embedding space — kept collapsing when models had to map noisy vectors back to tokens at every step. ELF, from Keya Hu, Linlu Qiu, Yoon Kim, Jacob Andreas, and Kaiming He at MIT CSAIL, just stops doing that. Roughly 80% of training is plain MSE on continuous embeddings; cross-entropy against actual tokens only enters in the final 20%, at the decode step 7. A single shared-weight transformer handles both modes, switched by an in-context control token.
That minimalism is the whole pitch. Because the trajectory is continuous end-to-end, ELF can lift Classifier-Free Guidance directly from image diffusion — a technique that has historically been awkward to bolt onto categorical noise. The author list matters here too: this is reportedly Kaiming He’s first NLP paper 8, which explains why a 105M-parameter result is getting outsized attention.
The numbers, and the asterisks on them
On OpenWebText, ELF-B (105M) hits 24.08 generative perplexity in 32 sampling steps, beating MDLM and Duo at 170M despite ~10× less training data. On WMT14 De-En it scores 26.4 BLEU; on XSum, 27.8 ROUGE-L. Versus concurrent continuous DLM LangFlow, ArxivIQ’s writeup highlights the headline gap: 32 sampling steps versus ~1024, at lower perplexity 9.
| Model | Params | Train tokens | Steps | Gen. PPL |
|---|---|---|---|---|
| MDLM | 170M | 500B+ | 100s | >24.08 |
| Duo | 170M | 500B+ | 100s | >24.08 |
| LangFlow | ~similar | larger | ~1024 | higher |
| ELF-B | 105M | ~45B | 32 | 24.08 |
Two caveats sit on top of these numbers. First, generative PPL itself is contested: researchers behind Cola DLM argue it “over-penalizes semantic synonyms” and the field should be moving to zero-shot semantic evals 10. ELF’s headline number lives inside that disputed yardstick. Second, none of this has been independently reproduced at scale.
LangFlow says the opposite is true
The most substantive technical dissent is concurrent, not retrospective. LangFlow reaches the opposite design conclusion: that Bregman-divergence-based token-level supervision at every denoising step is what’s needed to match discrete SOTA, anchored by a learnable Gumbel noise scheduler 11. ELF’s authors explicitly call that anchoring a “handbrake.” Both papers can’t be right about the underlying mechanism, and reported numbers favor ELF — but the field hasn’t yet stress-tested either recipe at frontier scale.
The wall-clock asterisk
The “32 steps versus 1024” framing implies a latency win that may not exist on real hardware. JetBrains’ 2026 developer-workflow survey notes that while diffusion LMs can be 5–10× faster in throughput-bound regimes, each step requires a full bidirectional attention pass, and standard KV caching is structurally incompatible with bidirectional denoising 12. AR baselines benefit from years of work on vLLM and SGLang; reported diffusion speedups generally rely on approximate KV-reuse schemes (Fast-dLLM, FreeCache) that ELF doesn’t yet implement.
So: the sampling-step claim is real and the data efficiency is striking. The latency-per-token claim against a well-served Llama is not yet demonstrated, and the field hasn’t agreed on whether per-step token supervision is a handbrake or a load-bearing wall.
Qwen-Image-2.0 hits 4-step inference but loses “Open-Source” tag
Source: hf-daily-papers · published 2026-05-10
TL;DR
- Qwen-Image-2.0 cuts diffusion from 40 steps to 4 via doubled latent compression and five reward models.
- Alibaba dropped the “Open-Source” tag from the official page post-launch, leaving 7B weights in limbo.
- Independent testers find Polish diacritics missing and literal “/n” tokens rendered as glyphs in outputs.
- WaveSpeed engineers document a “screen-door” VAE artifact centered in images, worse under quantization.
The efficiency pitch checks out
The headline architectural claim — Qwen3-VL as condition encoder feeding a Multimodal Diffusion Transformer, with inference compressed from 40 sampling steps to 4 — survives independent scrutiny. The Decoder confirms the 10× step reduction and doubled latent compression, and notes the final alignment stage is governed by five distinct reward models balancing creative fidelity against safety constraints 13. ComfyUI practitioners reproduce the efficiency story on consumer hardware: GGUF Q4/Q8 quantizations combined with Lightning LoRAs land at 4–8 steps on 24GB cards, with the ImageScaleToTotalPixels node keeping ~2K outputs inside VRAM by clamping inputs to roughly 1M pixels before VAE encoding 14. For a model that also accepts 1K-token instructions for slides, posters, and comics, that throughput is a real shift.
The open-weights question is unresolved
The technical report reads like a foundation-model drop, but the actual licensing posture is the opposite of clear. Community trackers flagged that the official Qwen site quietly downgraded the 2.0 tagging from “Open-Source” to plain “Release” shortly after launch, and no Apache 2.0 license file has appeared in the 2.0 repository — breaking the precedent set by the 20B Qwen-Image and Qwen-Image-2512 15. The pessimistic read on r/StableDiffusion is that the 7B checkpoint may stay API-only on Alibaba Cloud and fal.ai. Anyone planning a local-deployment workflow should treat the weights as unconfirmed.
Quality regressions the abstract glosses over
The “significantly improved multilingual text fidelity” claim does not generalize cleanly. BudgetPixel’s testing found Polish diacritics dropped entirely and newline tokens rendered as the literal characters /n rather than producing line breaks — a tell-tale sign of synthetic-overlay training data — with text frequently appearing “photoshopped” onto surfaces instead of integrated into the scene 16. WaveSpeed engineers separately traced a periodic high-frequency grid pattern, the “screen-door effect,” to the model’s VAE scale factor; it concentrates in the image center and worsens under aggressive quantization 17.
Keep 2509 around for identity work
The move from the 20B Qwen-Image-Edit-2509 lineage to a unified 7B isn’t strictly an upgrade. 2509 was trained explicitly on image concatenation for “person + product” and “person + scene” tasks and natively supports ControlNet depth and keypoint conditioning, which forum testers credit for its near-unbeaten subject-identity preservation through complex pose changes 18. Qwen-Image-2.0’s zero-shot semantic editor is faster and more general, but practitioners are openly advising teams to keep 2509 in the toolchain for high-fidelity face work.
Net assessment
The efficiency and typography story is genuine; the open-weights story, multilingual coverage, VAE quality, and identity preservation are all live concerns. Treat 2.0 as a fast, capable API model — not yet a confirmed open-weights successor.
Round-ups
Soohak’s 439 math problems expose frontier LLM reasoning gaps
Source: hf-daily-papers
Mathematicians built Soohak, a 439-problem benchmark targeting research-level math beyond olympiad fare. Frontier models stumble most on a refusal subset of ill-posed problems, where the correct move is to reject the question rather than fabricate a proof.
MELT shares one KV cache across looped reasoning steps
Source: hf-daily-papers
Looped language models normally pay memory for every reasoning iteration. MELT decouples depth from memory by reusing a single KV cache across loops, adding a learnable gate, chunk-wise training with interpolated transitions, and attention-aligned distillation from fine-tuned teachers.
Entrocraft tames entropy collapse in RL fine-tuning
Source: hf-daily-papers
RL post-training of LLMs often saturates as entropy collapses. Entrocraft uses rejection sampling to shape a custom entropy schedule, staying advantage-estimator-agnostic while preserving output diversity, extending useful training, and improving generalization over standard policy-gradient baselines.
Per-token diagnostic picks the right teacher for distillation
Source: hf-daily-papers
On-policy distillation helps some reasoning tasks and hurts others. A training-free framework scores per-token gradient alignment between student and teacher, yielding a targeted-rollout algorithm that selects which teacher and which contexts actually move the student toward an ideal gradient.
Muon fine-tuning of Adam-pretrained models needs LoRA
Source: hf-daily-papers
Switching optimizers from Adam to Muon at fine-tuning time hurts performance because the two carry different implicit biases, driving catastrophic forgetting. Parameter-efficient methods like LoRA constrain the update enough to close the gap and let Muon work on Adam-trained checkpoints.
DECO sparse MoE matches dense Transformers on-device
Source: hf-daily-papers
DECO targets end-side deployment with a sparse Mixture-of-Experts design that matches dense Transformer quality at lower compute and storage. It combines ReLU-based routing, learnable expert-wise scaling, NormSiLU-gated MLP experts, and a custom acceleration kernel exploiting intrinsic activation sparsity.
SlimQwen compresses MoE models via pruning and distillation
Source: hf-daily-papers
Structured pruning plus knowledge distillation scales to mixture-of-experts pretraining in SlimQwen. Progressive pruning schedules, partial-preservation expert merging, and multi-token prediction distillation combine to shrink large MoE models while recovering performance through continued training.
Footnotes
-
Wei et al., ‘Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications’ (arXiv 2406.05946) — https://arxiv.org/abs/2406.05946
↩Pruning neurons highly responsive to unsafe prompts effectively ‘jailbreaks’ the model, leading to over 90% attack success rates with minimal impact on general utility.
-
Maxime Labonne, ‘Uncensor any LLM with abliteration’ (Medium) — https://medium.com/@mlabonne/uncensor-any-llm-with-abliteration-d30148b7d43e
↩ ↩2By calculating the mean difference in activations between harmful and harmless prompts, researchers identified a vector that, when ablated through weight orthogonalization, prevents the model from refusing harmful requests while preserving general capabilities.
-
Wang et al., ‘Concept cones’ multi-direction refusal analysis (arXiv 2505.17306) — https://arxiv.org/html/2505.17306v2
↩ ↩2Recent studies have identified multi-dimensional ‘concept cones’ with dimensions as high as five across various models… multi-directional suppression consistently achieves higher Attack Success Rates than single-direction baselines on frontier models like Qwen and Llama.
-
Prakash et al., ‘Beyond I’m Sorry, I Can’t’ (arXiv 2512.13655) — https://arxiv.org/pdf/2512.13655
↩ ↩2Models maintain back-up safety features… these redundant features often remain dormant until primary refusal features are suppressed, suggesting that LLM safety is not just a single ‘off-switch’ but a layered defensive circuit.
-
Chen et al., ‘RepIt: Representing Isolated Targets to Steer Language Models’ (ResearchGate) — https://www.researchgate.net/publication/395542277_RepIt_Representing_Isolated_Targets_to_Steer_Language_Models
↩Techniques like ‘RepIt’ can selectively suppress safety filters on specific harmful concepts while keeping them active elsewhere. This allows for the creation of ‘model organisms’ that answer questions about weapons of mass destruction or self-harm while appearing safe on standard benchmarks.
-
Qi et al. / ‘Shallow Safety Alignment’ summary (thedatapraxis.com) — https://thedatapraxis.com/blog/llm-safety-alignment-shallow
↩Current alignment techniques predominantly adjust the probability distribution of only the first few output tokens to produce standard refusal phrases like ‘I cannot assist with that’… once a model is forced past this initial ‘refusal prefix,’ the underlying capability to generate harmful content remains largely intact in deeper layers.
-
Medium (Adithya Giridharan) — https://medium.com/@AdithyaGiridharan/continuous-diffusion-language-models-were-held-back-by-a-habit-not-a-limitation-6b95a9c38713
↩Continuous diffusion language models were held back by a habit, not a limitation… ELF maintains a continuous trajectory until the very last step, using approximately 80% of training on a Mean Squared Error (MSE) objective and 20% on Cross-Entropy (CE) loss to refine the final discretization.
-
GitHub: libo-huang/kaiming-he-arxiv-papers — https://github.com/libo-huang/kaiming-he-arxiv-papers
↩ELF marks Kaiming He’s first major foray into NLP; the work is co-first-authored by Keya Hu and Linlu Qiu, with co-authors Yoon Kim and Jacob Andreas (MIT CSAIL).
-
ArxivIQ Substack — https://arxiviq.substack.com/p/elf-embedded-language-flows
↩ELF claims a superior quality-efficiency trade-off, reportedly achieving lower generative perplexity with 10x fewer training tokens and significantly fewer sampling steps (32 vs 1024) than concurrent LangFlow.
-
36Kr (Cola DLM coverage) — https://eu.36kr.com/en/p/3807012110441987
↩Perplexity is becoming obsolete for diffusion paradigms — PPL over-penalizes semantic synonyms; if a model generates a perfect synonym that does not match the exact ground-truth token, the PPL skyrockets despite the generation’s high quality.
-
Caradryanl blog – LangFlow analysis — https://caradryanl.github.io/blog/2026/langflow/
↩LangFlow connects embedding-space diffusion to Flow Matching through Bregman divergence… predicting clean token probabilities from noisy embeddings at every step is essential for reaching the generative quality of discrete models — precisely what ELF identifies as a ‘handbrake’ on performance.
-
JetBrains AI blog — https://blog.jetbrains.com/ai/2025/11/why-diffusion-models-could-change-developer-workflows-in-2026/
↩DLMs are now 5–10x faster than autoregressive models due to their parallel decoding architecture… however each diffusion step involves a full bidirectional attention pass, which is more compute-intensive than a single causal AR step, and standard KV caching is fundamentally incompatible with bidirectional denoising.
-
The Decoder — https://the-decoder.com/alibabas-qwen-image-2-0-doubles-compression-and-cuts-generation-steps-from-40-to-4/
↩Alibaba’s Qwen-Image-2.0 doubles compression and cuts generation steps from 40 to 4… final tuning is governed by five distinct reward models designed to balance creative fidelity with safety constraints.
-
YouTube workflow walkthrough (ComfyUI Qwen-Image-2.0) — https://www.youtube.com/watch?v=erj5YlR9hvE
↩GGUF quantized versions (Q4 or Q8) and Lightning LoRAs can reduce inference from the standard 20-30 steps down to just 4-8 steps without substantial quality loss; ImageScaleToTotalPixels node manages VRAM by scaling inputs to ~1M pixels before VAE encoding.
-
r/StableDiffusion thread ‘Qwen Image 2 is amazing, any idea when 7B is coming?’ — https://www.reddit.com/r/StableDiffusion/comments/1rgctpb/qwen_image_2_is_amazing_any_idea_when_7b_is_coming/
↩Community tracking noted a shift in the official Qwen website’s tagging from ‘Open-Source’ to simply ‘Release’ shortly after launch, causing pessimism among developers regarding a local deployment release.
-
BudgetPixel review — https://budgetpixel.com/blog/qwen-image-20-the-image-model-that-finally-treats-editing-and-text-as-first-class-features
↩Testers have reported failures in Polish, including missing diacritics and the literal rendering of newline characters (e.g., writing ‘/n’ instead of creating a break)… text often appears ‘photoshopped’ into scenes rather than naturally integrated.
-
WaveSpeed AI engineering blog (‘Fix Qwen-Image text artifacts’) — https://wavespeed.ai/blog/posts/fix-qwen-image-2512-text/
↩Periodic high-frequency grid artifacts and a ‘screen door effect’ visible upon close inspection… often attributed to the model’s VAE scale factor and most pronounced in the center of generated images.
-
WaveSpeed AI (Qwen-Image-Edit-2509 deep-dive) — https://wavespeed.ai/blog/posts/introducing-wavespeed-ai-qwen-image-edit-on-wavespeedai/
↩2509 was specifically trained on image concatenation for ‘person + product’ and ‘person + scene’ combinations… often cited in community forums as the ‘absolute beast’ for maintaining strict subject identity during complex pose changes.