Sources

LLMs believe false statements even after explicit warnings that they’re false arstechnica.com

Fine-tuning tests show “bias … toward confidently representing the claims as true.”

🔬ESM: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub latent.space

Biohub’s Protein World Model: ESMC-6B, ESMFold2, 6.8B proteins, 1.1B structures, antibody design, SAEs, & the potential for programmable biology

Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws huggingface.co

Different optimizers produce distinct spectral scaling behaviors in Transformer models, with Muon achieving superior scaling efficiency compared to AdamW in representation capacity utilization.

Platonic Representations in the Human Brain: Unsupervised Recovery of Universal Geometry huggingface.co

Self-supervised encoders trained on the Natural Scenes Dataset recover a universal latent space across individual brains without paired data. Subject-specific embeddings align through unsupervised orthogonal rotations, enabling cross-subject retrieval and supporting the platonic representation hypothesis at the level of human fMRI.

Unsupervised Process Reward Models huggingface.co

Unsupervised Process Reward Models skip step-level human annotations by using a base language model’s next-token probabilities to locate the first erroneous step in reasoning trajectories. The approach matches supervised PRMs on ProcessBench and boosts test-time scaling and RL policy optimization.

From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning huggingface.co

SCRL breaks verifiable-reward RL into subproblem-level normalization plus curriculum learning, fixing GRPO’s coarse credit signal on long reasoning chains. On Qwen3-4B and 14B bases, it lifts pass@1 and pass@64 across AIME24, AIME25, and IMO-Bench.

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks huggingface.co

A data engine reverse-engineers in-the-wild terminal recordings into validated evaluation tasks, yielding 1,530 jobs across 18 categories and 1,280 unique commands, some exceeding 50 steps. A 200-task Verified subset offers a manually reviewed slice for benchmarking agent shell competence.

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning huggingface.co

Self-regulated simulative planning decomposes agentic reasoning into three systems—a world-model simulator, a self-regulator that controls planning horizon, and a reactive executor. The structured pipeline cuts reasoning tokens substantially while preserving Pass@1 against monolithic chain-of-thought baselines.

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps huggingface.co

RTPurbo exploits intrinsic sparsity already present in pretrained full-attention LLMs, training a token indexer with dynamic top-p selection in roughly a hundred steps. The result delivers large prefill and decode speedups on long contexts with near-lossless accuracy and minimal KV cache overhead.

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention huggingface.co

Gated DeltaNet-2 decouples the erase and write operations of delta-rule recurrent states using separate channel-wise gates, paired with a chunkwise WY algorithm and gate-aware backward pass. It beats Mamba-2, Mamba-3, and Kimi Delta Attention on RULER and needle-in-a-haystack retrieval.

Lean Refactor: Multi-Objective Controllable Proof Optimization via Agentic Strategy Search huggingface.co

Lean Refactor presents a retrieval-augmented agentic framework that improves Lean proof refactoring by addressing multi-objective optimization, version compatibility, and scalability challenges through curated strategy databases and version-filtered retrieval.

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards huggingface.co

Reinforcement learning from verifiable rewards is enhanced through a discriminative token credit assignment method that improves reward-based training by amplifying distinctive token-gradient directions and reducing noise from shared patterns.

Forecasting Downstream Performance of LLMs With Proxy Metrics huggingface.co

Proxy metrics based on token-level statistics from expert-written solutions provide more reliable model performance forecasting than traditional loss-based methods across multiple development stages.

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving huggingface.co

KVServe is a service-aware and adaptive framework for optimizing key-value communication compression in disaggregated large language model serving, achieving significant improvements in job completion time and time-to-first-token reduction through dynamic optimization.

Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles huggingface.co

A reinforcement learning-driven orchestration framework dynamically composes expert models and skills for multimodal tasks, achieving superior performance with low computational overhead.

ACC: Compiling Agent Trajectories for Long-Context Training huggingface.co

Agent Context Compilation (ACC) enhances long-context reasoning in LLMs by converting multi-turn agent trajectories into structured QA pairs, enabling direct supervision of distant context integration without additional annotation.

Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning huggingface.co

Spreadsheet-RL is a reinforcement learning framework that trains specialized spreadsheet agents in realistic Excel environments, improving AI agent performance on both general and domain-specific spreadsheet tasks through automated data collection and domain-specific benchmarks.

Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators huggingface.co

Audio diffusion models are adapted for interactive music generation through efficient block-wise processing and novel training paradigms that enable real-time performance on consumer hardware.

π-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows huggingface.co

Proactive assistance in personal agent systems requires identifying hidden user intents through sustained multi-turn interactions, which current benchmarks fail to adequately evaluate.

Forecasting Scientific Progress with Artificial Intelligence huggingface.co

Current AI systems demonstrate limited capability in predicting scientific progress, showing inconsistent performance across domains and systematic overconfidence in forecasts.

“I didn’t Make the Micro Decisions”: Measuring, Inducing, and Exposing Goal-Level AI Contributions in Collaboration huggingface.co

A goal-level attribution framework called CoTrace is introduced to analyze how large language models contribute to goal shaping in human-AI collaboration, revealing that while models account for a small percentage of direct contributions, they play a significant role in introducing concrete requirements and making indirect contributions.

ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning huggingface.co

ClinSeekAgent is an automated agentic framework that enables large language models to actively acquire and synthesize multimodal clinical evidence from raw data sources, improving decision-making accuracy in both text-only and multimodal tasks.

Bernini: Latent Semantic Planning for Video Diffusion huggingface.co

A unified video generation and editing framework combines multimodal large language models for semantic planning with diffusion models for pixel rendering, achieving state-of-the-art performance through semantic interface separation and enhanced positional embeddings.

WorldKV: Efficient World Memory with World Retrieval and Compression huggingface.co

WorldKV enables persistent world generation in video diffusion models by retrieving and compressing key-value cache chunks to maintain consistency while improving throughput.

Swift Sampling: Selecting Temporal Surprises via Taylor Series huggingface.co

Swift Sampling is a training-free frame selection algorithm that identifies high-information video moments by analyzing deviations from predicted visual feature trajectories in latent space.

FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching huggingface.co

A novel inference-time method for long video generation using overlapping sliding windows with Tweedie matching and stochastic early-phase sampling to improve temporal consistency and visual quality.

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning huggingface.co

LatentOmni is a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states using feature-level supervision and temporal consistency embedding, outperforming explicit text-based chain-of-thought approaches in audio-visual reasoning tasks.

AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment huggingface.co

AutoRubric-T2I automatically generates and selects explicit rubrics to guide Vision-Language Model judges for text-to-image generation, achieving high-quality reward signals with minimal human annotation while improving generation quality in downstream tasks.

DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders huggingface.co

DecQ enhances representation autoencoders by introducing lightweight queries that improve reconstruction quality and generative performance without disrupting pretrained semantic spaces.

Q-ARVD: Quantizing Autoregressive Video Diffusion Models huggingface.co

Autoregressive video diffusion models face high inference costs that limit practical deployment, prompting the development of Q-ARVD, a novel quantization framework addressing frame-wise sensitivity imbalance and weight outlier patterns specific to these models.

SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers huggingface.co

SEGA improves high-resolution text-to-image generation by adaptively scaling attention across RoPE components based on spatial-frequency structure during denoising steps.

GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation huggingface.co

A self-evolving image generation framework uses tool-orchestrated trajectories and visual experience distillation to improve generative capabilities through iterative learning and reference-based prompting.

LoREnc: Low-Rank Encryption for Securing Foundation Models and LoRA Adapters huggingface.co

LoREnc secures foundation models and low-rank adapters through spectral truncation and compensation techniques that prevent unauthorized model recovery while maintaining performance for authorized users.

PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects huggingface.co

PhysX-Omni presents a unified framework for generating simulation-ready 3D assets with physical properties across multiple categories using a novel geometry representation and evaluation benchmarks.

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding huggingface.co

OmniPro is introduced as the first benchmark for evaluating omni-modal large language models’ proactive streaming video understanding, featuring diverse tasks and dual-mode evaluation protocols.

Training Large Language Models to Predict Clinical Events huggingface.co

Longitudinal clinical notes are converted into temporal prediction examples using Foresight Learning, enabling improved clinical prediction through LoRA adaptation that enhances calibration and reduces uncertainty compared to base models.

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation huggingface.co

SpaceDG dataset and benchmark evaluate multimodal language models’ spatial reasoning robustness under visual degradations, revealing significant performance gaps and demonstrating improved robustness through targeted training.

Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving huggingface.co

Sensor2Sensor generates high-fidelity multi-modal sensor data from in-the-wild dashcam videos using diffusion models and 4D Gaussian Splatting for autonomous driving system training and validation.

One Sentence, One Drama: Personalized Short-Form Drama Generation via Multi-Agent Systems huggingface.co

A hierarchical multi-agent framework generates short dramas from single sentences by enforcing narrative pacing, ensuring spatial consistency, and implementing quality control through iterative refinement and reviewer loops.

SceneAligner: 3D-Grounded Floorplan Localization in the Wild huggingface.co

Deep learning approach for floorplan localization that uses 3D scene reconstruction and cross-modal correspondence learning to work in real-world environments with limited data.

SAM 3D Animal: Promptable Animal 3D Reconstruction from Images in the Wild huggingface.co

SAM 3D Animal enables multi-animal 3D reconstruction from single images using a promptable framework based on SMAL+ model with improved disambiguation through keypoints and masks.

Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking huggingface.co

SAMOSA adapts SAM 2 for visual object tracking by incorporating motion prediction, semantic detection, and geometric constraints to improve robustness and generalization in complex scenarios.

AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild huggingface.co

AnyMo is a geometry-aware framework that enables setup-agnostic human motion modeling using physics-grounded IMU simulation and graph encoding for cross-dataset activity recognition and cross-modal retrieval.

Minimalist Visual Inertial Odometry huggingface.co

A minimalist visual-inertial odometry approach uses four photodiodes with optical Gabor masks and a temporal convolutional network to achieve accurate planar motion estimation for differential-drive robots.

Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality? huggingface.co

Researchers introduce a new task and dataset for evaluating personality reasoning in multimodal language models, revealing significant gaps between accurate predictions and grounded reasoning processes.

Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation huggingface.co

Rule2DRC introduces a large-scale benchmark for DRC script synthesis with 1,000 rule-to-script tasks and 13,921 evaluation layouts, along with SplitTester which improves program selection through execution-based feedback.

FashionLens: Toward Versatile Fashion Image Retrieval via Task-Adaptive Learning huggingface.co

A unified fashion image retrieval framework is proposed that handles diverse query formats and search intentions through multimodal large language models with adaptive calibration and sampling strategies.

Diversed Model Discovery via Structured Table Discovery huggingface.co

Model search system that combines semantic and structured table-based retrieval to improve diversity and coverage of recommended models.

More Context, Larger Models, or Moral Knowledge? A Systematic Study of Schwartz Value Detection in Political Texts huggingface.co

Context and moral knowledge enhance sentence-level value detection differently across model types, with full-document context benefiting supervised models but not zero-shot LLMs, and retrieved knowledge consistently improving performance through early fusion.

TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation huggingface.co

TransitLM dataset enables end-to-end transit route planning using large language models trained on structured transit data, eliminating the need for traditional map-based approaches.

Disentangling Sampling from Training Budget in Class-Imbalanced CT Body Composition Segmentation huggingface.co

Episodic sampling from few-shot learning improves class-balanced batch construction in medical image segmentation, outperforming random and weighted sampling under low-data conditions due to reduced overfitting and extended training iterations.

References

LessWrong — Mayne, Evans et al. (authors’ post) lesswrong.com

Belief rates jump from a baseline of 2.5% to 88.6% after fine-tuning on documents that repeatedly warn the stories are false — almost matching the 92.4% rate from training on the same claims presented as true.

MachineBrief writeup machinebrief.com

GPT-4.1 totally resisted the most absurd claim (Ed Sheeran winning Olympic gold) but succumbed to plausible falsehoods, reaching ~90% belief on a fabricated dentist story — prior knowledge anchors defense only for implausible content.

AXRP podcast — Owain Evans interview axrp.net

The finding extends Evans’ prior work on the Reversal Curse and emergent misalignment: fine-tuning on chat transcripts of malicious behavior flagged as ‘examples of what the model should not do’ causes models to adopt those very behaviors.

Stanford HAI policy brief hai.stanford.edu

Even benign, utility-oriented fine-tuning datasets can lead to catastrophic forgetting of safety protocols; as few as 10 harmful examples can completely jailbreak a model’s guardrails.

LetsDataScience summary letsdatascience.com

Providing explicit corrections (naming Noah Lyles as the actual winner) only partially solves the problem, leaving a residual neglect rate of approximately 40% — and these solutions revert under further training.

proteineng.com — ESMFold vs AlphaFold3 comparative analysis proteineng.com

ESMFold2 achieves 55% success rate on antibody-antigen complexes from single sequences, surpassing AF3 in single-sequence mode; AF3’s reported 60% antibody success required sampling 1,000 different seeds, with single-seed top-ranked success dropping to 8.9–10.2%.

EvolutionaryScale blog — ESM Cambrian release evolutionaryscale.ai

ESMC-6B weights are not directly downloadable; access is gated behind the EvolutionaryScale Forge API for academic researchers and AWS SageMaker for commercial entities under a clickthrough Cambrian license.

Medium — ‘I tried to poke holes in Chai-2’s antibody design paper’ medium.com

Chai-2 reports a 16% average hit rate across 52 targets; BoltzGen reports 66% target success with 15 designs per target; Nabla Bio’s JAM-2 reaches 39% hit rate for VHH-Fc — providing context for Biohub’s 36–88% minibinder claims.

arXiv 2502.09135 — InterPLM / SAE evaluation arxiv.org

Sparse autoencoder features align more accurately with curated UniProt and Gene Ontology annotations than the original model neurons, with ~16,000 monosemantic features benchmarked for the 6B-parameter model; however a ‘completeness problem’ remains for non-linear behaviors.

NTI — Framework for Managed Access to Biological AI Tools (Jan 2026) nti.org

The 2025 NASEM report defines ‘frontier biological models’ as high-parameter systems capable of emergent biological reasoning and recommends ‘if-then’ capability triggers — such as designing immune-evasive proteins — before additional safeguards apply.

purna.ai — AlphaFold vs Boltz vs ESMFold purna.ai

CASP16 results reveal a ‘memorization trap’: scaled single-sequence models excel at known folds but often fail on truly novel targets without the explicit MSA crutch; specialist models like Pearl re-introduced SO(3) equivariance to solve ligand docking where general diffusion still struggles.

Stanford ‘Fantastic Pretraining Optimizers’ (arXiv 2502.16982) arxiv.org

Matrix-based optimizers like Muon and Soap provide a 30–40% stepwise speedup on models under 500M parameters, but this advantage decays to a modest 1.1x speedup as models scale to 1.2B parameters

NorMuon paper (arXiv 2510.05491) arxiv.org

On a 1.1B parameter pretraining task, NorMuon achieved a 21.74% reduction in training steps compared to AdamW and an 11.31% improvement over Muon… NorMuon outperforms the Dion optimizer at the 1.1B scale

Microsoft Research — Dion: Distributed Orthonormalized Updates microsoft.com

Dion introduces a low-rank fraction hyperparameter… the authors project that dense models at the scale of Llama 3 (405B) can maintain performance with rank fractions as low as 1/64

Jha & Reagen, EMNLP 2025 (prior work on Spectral Utilization Index) aclanthology.org

While ‘soft rank’ (tail capacity) follows a near-perfect power law with FFN width, ‘hard rank’ (dominant-mode capacity) grows only sublinearly and with high variance… spectral utilization often peaks at intermediate dimensions—roughly 2048

Hugging Face discussion — ‘Scaling is not plug-and-play: what Muon teaches us about optimizers at scale’ discuss.huggingface.co

Because Newton-Schulz orthogonalization enforces unit singular values, the RMS of updates naturally shrinks as the matrix dimension D increases. Without a dimensional correction (scaling by 1/√D), updates can vanish at large scales

Keller Jordan blog — original Muon writeup kellerjordan.github.io

Muon is mathematically restricted to 2D matrix parameters… layers such as embeddings, biases, and final classifier heads remain on AdamW to maintain stability

Sources

References

Jack Sun, writing.