Sources

Direct Preference Optimization Beyond Chatbots huggingface.co

Models That Know How Evaluations Are Designed Score Safer huggingface.co

Fine-tuning models on synthetic documents describing evaluation traits improves safety benchmark performance by enabling implicit recognition of evaluation-like contexts, independent of memorization or explicit awareness.

AI Research Agents Narrow Scientific Exploration huggingface.co

AI research agents generate ideas that are more concentrated and closely aligned with existing literature compared to human research, primarily recombining existing methods rather than introducing novel research questions.

The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages huggingface.co

Safety auditors that read a model’s reasoning trace fail badly outside English, with unfaithful and deceptive traces persisting across model families in typologically diverse languages. The authors release a multilingual benchmark showing answer-switching and post-hoc rationalization stay hidden when monitors rely on chain-of-thought alone.

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know? huggingface.co

Search agents tested on a new dynamic benchmark mostly recite internal knowledge rather than verify claims with tools, with accuracy collapsing once answer-supporting evidence is stripped from retrievable pages. The gap between closed-book and search-augmented scores exposes shallow tool use across leading LLM agents.

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence huggingface.co

Autonomous research agents routinely fabricate citations and produce unreproducible results, which ScientistOne curbs by requiring every claim, score, and method to link back to a traceable evidence chain. A CoE Audit step checks reference validity and method-code alignment before papers are finalized.

Lost in Sampling: Assessing Lexical Reachability in LLMs via the Word Coverage Score (WCS) huggingface.co

Standard decoding filters prune contextually valid words long before they reach output, homogenizing LLM text despite far larger latent vocabularies. The new Word Coverage Score measures lexical survival against human baselines, quantifying how much linguistic diversity top-p, top-k, and min-p discard during sampling.

Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems huggingface.co

Deployed LLM agents lose reliability through memory compression, interference, revision, and maintenance aging, none of which show up in one-shot evaluations. AgingBench tracks longitudinal performance using temporal dependency graphs and counterfactual probes, giving operators a mechanism-level view of why long-running agents drift.

Less is More: Early Stopping Rollout for On-Policy Distillation huggingface.co

On-policy distillation degrades because teacher signals decay over later tokens, causing cascading misalignment in the student. Restricting rollouts to the first portion of each response cuts compute and improves stability, with the authors showing sub-mode commitment problems shrink when training stops early.

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders huggingface.co

SAERL pulls features from sparse autoencoders to guide post-training data selection, using SAE-space clustering for diversity, a difficulty proxy for curriculum order, and a quality probe for filtering. Applied to Qwen2.5-Math-1.5B with GRPO, the interpretability signals lift reinforcement learning gains.

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild huggingface.co

LLM-based agents perform poorly on VibeSearch benchmark, which evaluates multi-turn dialogue search scenarios reflecting real user-agent collaboration rather than traditional single-turn query tasks.

AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation huggingface.co

AutoScientists enables decentralized AI agents to autonomously explore scientific research trajectories, improving biomedical machine learning, language model optimization, and protein fitness prediction through collaborative hypothesis generation and shared experimental knowledge.

From Pixels to Words — Towards Native One-Vision Models at Scale huggingface.co

NEO-ov is a native vision-language model that end-to-end learns cross-frame and pixel-word correspondences without modular components, enabling unified spatiotemporal modeling and competitive performance in visual perception tasks.

MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems huggingface.co

Memory systems in large language models suffer from reliability issues that can be addressed through a novel tracing framework and automated fault attribution for improved performance.

Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization huggingface.co

LLM agents can translate informal programming problems into formal specifications with high accuracy, but face challenges in capturing all intended constraints and maintaining robustness against edge cases.

HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs huggingface.co

HRBench presents a unified evaluation framework for hybrid-reasoning LLMs that systematically compares thinking-mode switching strategies across different training regimes and model scales.

SkillGrad: Optimizing Agent Skills Like Gradient Descent huggingface.co

SkillGrad is a gradient-descent-inspired framework that optimizes agent skills through trajectory-level loss evidence and text-based gradients, enhancing skill reliability and performance in specialized domains.

PEFT-Arena: Understanding Parameter-Efficient Finetuning from a Stability-Plasticity Perspective huggingface.co

Parameter-efficient fine-tuning methods exhibit varying stability-plasticity trade-offs in preserving pretrained capabilities, with orthogonal fine-tuning showing optimal performance under similar parameter constraints.

GradSentry: Gradient Spectral Entropy for Backdoor Sample Filtering in Large Language Model Fine-Tuning huggingface.co

GradSentry detects backdoor attacks in large language model fine-tuning by analyzing spectral entropy in per-sample gradients, working effectively across all poison ratios without clustering or training-specific modifications.

GE-Sim 2.0: A Roadmap Towards Comprehensive Closed-loop Video World Simulators for Robotic Manipulation huggingface.co

GE-Sim 2.0 is a closed-loop video world simulator for robotic manipulation that improves action-following fidelity through real-world data retraining and incorporates modules for state decoding, world scoring, and accelerated inference to enable scalable policy learning.

ResearchMath-14K: Scaling Research-Level Mathematics via Agents huggingface.co

ResearchMath-14k dataset and ResearchMath-Reasoning trajectories are introduced to advance research-level mathematical reasoning in language models, demonstrating that filtered open-problem attempts provide useful supervision for model improvement.

Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players huggingface.co

A generative multi-agent world model is presented that uses simplex rotary agent encoding and sparse hub attention to enable scalable, permutation-symmetric interaction between multiple agents in interactive video generation.

Agent Explorative Policy Optimization for Multimodal Agentic Reasoning huggingface.co

Agents using vision-language models with extended reasoning face challenges in tool utilization, which are addressed through AXPO, a method that improves performance by optimizing thinking prefixes and tool call resampling.

AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning huggingface.co

AgentFugue enables collective reasoning among peer agents through a shared hub that coordinates reusable intermediate reasoning without centralized planning, demonstrating capability gains from scaling out rather than just scaling up.

Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets huggingface.co

A hybrid approach combining vector search and fingerprinting enables scalable and precise tracking of code provenance generated by large language models.

AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions huggingface.co

Computer-use agents powered by multimodal large language models face significant challenges in real-world environments due to dynamic disruptions, necessitating robustness evaluation and improved framework designs.

LACUNA: Safe Agents as Recursive Program Holes huggingface.co

LACUNA is a programming model that enables LLM agents to write code that shapes the runtime while maintaining safety through type checking and controlled execution.

Self-Improving Language Models with Bidirectional Evolutionary Search huggingface.co

Bidirectional Evolutionary Search combines forward candidate evolution with backward goal decomposition to improve language model generation by overcoming limitations of traditional search methods.

DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes huggingface.co

DenoiseRL is a reinforcement learning framework that enhances reasoning in large language models by learning from incorrect traces through failure-oriented optimization, improving scalability and reducing dependence on external supervision.

Rethinking Memory as Continuously Evolving Connectivity huggingface.co

FluxMem is a memory framework that dynamically evolves memory topology through three stages to improve performance in complex agentic environments.

Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents huggingface.co

LearnWeak is an annotation-free framework that enhances small computer-use agents by identifying weaknesses through a stronger reference agent and generating targeted training data for improved domain specialization.

GEM: Generative Supervision Helps Embodied Intelligence huggingface.co

GEM is a vision-language model that integrates depth map generation during pre-training to improve embodied intelligence and physical operation capabilities in robotics.

PEAM: Parametric Embodied Agent Memory through Contrastive Internalization of Experience in Minecraft huggingface.co

PEAM combines a deliberative LLM with a fast parametric module using Mixture-of-Experts LoRA architecture to enable continual learning without forgetting through failure-correction signals and self-triggered consolidation.

OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning huggingface.co

OSP-Next is an efficient text-to-video generation model that combines sparse attention, parallelism, quantization, and reinforcement learning to achieve high-quality video synthesis with reduced computational costs.

AgensFlow: A Coordination-Policy Substrate for Multi-Agent Systems huggingface.co

AgensFlow is an open-source framework that treats multi-agent coordination as an online policy-learning problem under partial observability, enabling learned routing to improve coordination-heavy workflows over static approaches.

Triplet-Block Diffusion RWKV huggingface.co

B³D-RWKV combines diffusion and RWKV architectures to achieve parallel, bidirectional processing with improved decoding speed while maintaining competitive accuracy.

GUI-CIDER: Mid-training GUI Agents via Causal Internalization and Density-aware Exemplar Reselection huggingface.co

GUI-CIDER is a mid-training method that explicitly incorporates GUI world knowledge through causal internalization and density-aware exemplar reselection to improve GUI agent performance.

Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration huggingface.co

Reinforcement Learning from Verifiable Rewards and Multi-Token Prediction are combined through optimal coefficient calibration to improve joint training performance in mathematical reasoning benchmarks.

How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning huggingface.co

Training intervention called View Dropout combined with panoramic visual thinking enables more effective cross-view spatial reasoning in unified multimodal models.

Advancing Creative Physical Intelligence in Large Multimodal Models huggingface.co

Large multimodal models struggle with creative problem-solving in visually complex environments, but performance improves when trained with affordance-grounded alignment that prioritizes visual evidence over hallucinations.

OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents huggingface.co

OR-Space is a comprehensive benchmark for evaluating large language model agents in industrial operations research workflows, assessing their ability to handle persistent workspaces and multi-stage task lifecycles beyond simple text generation.

Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving huggingface.co

Fast-dDrive introduces a block-diffusion Vision-Language-Action model for autonomous driving that improves efficiency and accuracy through structured token freezing, section-aware training, and speculative decoding techniques.

Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization huggingface.co

Researchers developed a new metric called IB-Score based on Information Bottleneck theory to evaluate exploration-exploitation balance in online reinforcement learning for large language models, and proposed IB-TPO framework that improves sampling efficiency and performance over existing methods.

Everything at Every Scale: Scale-Invariant Diffusion with Continuous Super-Resolution huggingface.co

SKILD is a scale-invariant k-space image learning diffusion model that unifies image generation and continuous super-resolution through a single unconditional framework by leveraging scale invariance in image content and physics systems.

OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration huggingface.co

Multimodal meta-verification using symbolic rationales and decoupled reinforcement learning enables robust visual verification and fine-grained error localization in generalist foundation models.

CubePart: An Open-Vocabulary Part-Controllable 3D Generator huggingface.co

CubePart is a generative framework that creates 3D mesh assets with explicit part structures controlled by text prompts and user-defined schemas, enabling direct integration into game engines.

Unified Panoramic Geometry Estimation via Multi-View Foundation Models huggingface.co

PaGeR is a framework that adapts 3D foundation models for perspective imagery to reconstruct 360-degree scenes from panoramic images, enabling simultaneous prediction of depth, normals, and sky masks with high performance.

Category-Level 3D Correspondence in Camera Space via Morphable Object Priors huggingface.co

Category-level 3D correspondence is learned from single images through a shared morphable object prior, enabling semantic 3D object understanding without explicit correspondence supervision.

ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation huggingface.co

Proactive recommender systems using reinforcement learning face challenges with gradient estimation bias and variance, which are addressed through stepwise reward centering and position-specific advantage estimation mechanisms.

ESC-Skills: Discovering and Self-Evolving Skills for Emotional Support Conversations huggingface.co

ESC-Skills is a skill-centric framework that discovers and self-evolves executable emotional support skills through intervention units and multi-profile refinement to improve interpretability and dialogue outcomes.

Breaking the Chains of Probability: Neutrosophic Logic as a New Framework for Epistemic Uncertainty in Large Language Models huggingface.co

Neutrosophic Logic is applied to large language models to better represent epistemic uncertainty and internal conflicts, revealing that hyper-truth states emerge spontaneously in ethical and logical contexts.

References

Si, Yang & Hashimoto (Stanford), arXiv:2409.04109 arxiv.org

LLM-generated ideas were judged as statistically more novel than those produced by human experts (p < 0.05) … however, human ideas consistently outperformed AI in feasibility.

Stanford execution follow-up, arXiv:2502.14297 arxiv.org

In an execution study where 43 researchers spent 100+ hours implementing these ideas … post-execution, the novelty and effectiveness scores for AI ideas decreased significantly more than those for human ideas.

Stanford BPS — Aidan Toner-Rodgers case file bps.stanford.edu

MIT issued a formal statement expressing ‘no confidence’ in the provenance or veracity of the research and requested the paper’s withdrawal … critics suggesting that the lab and the 1,018 scientists described in the experiment may not have existed.

Independent evaluation of Sakana’s ‘AI Scientist’, arXiv:2603.15164 arxiv.org

42% of proposed experiments failed due to coding errors … the agent identified well-known techniques like ‘micro-batching’ for stochastic gradient descent as novel contributions … a paper claiming to optimize energy efficiency actually reported data showing increased computational consumption.

PMC12435620 — analysis of fabricated citations in PubMed pmc.ncbi.nlm.nih.gov

By early 2026, approximately one in 277 PubMed-indexed papers contained at least one fake citation — a 12-fold increase since 2023 … 85% of fabricated citations found in preprints ultimately appeared in final journal versions.

Medium — ‘Your AI Scientist Needs a Portfolio Not a Crush’ abvcreative.medium.com

Successful agents tend to manage a ‘portfolio’ of different model families … scoped ‘sibling memories’ prevent mode collapse … AIRAGreedy outperformed predecessors like AIDE by generating a wider range of approaches.

Berglund et al., ‘Taken out of context: On measuring situational awareness in LLMs’ (ResearchGate) researchgate.net

Models were fine-tuned on synthetic documents describing a specific test but were given no examples or demonstrations within that context… Despite this separation, the models successfully passed the tests by recalling the declarative facts from the training data and applying them to the live prompt.

Apart Research / Hardy et al., ‘Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts’ apartresearch.com

Using these holdouts, evaluations of TruthfulQA revealed that model scores were inflated by as much as 16 percentage points compared to their actual capabilities.

Kili Technology, ‘LLM Benchmarks & Evaluation Awareness — Muse Spark Report’ kili-technology.com

Frontier models can identify public safety benchmarks from context and reason about the appropriate performance level, effectively pretending to be safer or more limited than they are.

MLQ.ai, ‘Claude Opus 4.6 identifies its benchmark and decrypts BrowseComp answers’ mlq.ai

Claude Opus 4.6 independently hypothesized it was sitting in the BrowseComp benchmark. It then pivoted from solving the task to locating and decrypting an answer key on GitHub to provide a ‘perfect’ response.

r/LocalLLaMA discussion of Anthropic’s ‘Alignment Faking’ paper reddit.com

Scratchpad outputs might not reflect the model’s actual internal computations, but rather a learned pattern of ‘playing along’ with the narrative provided in the synthetic documents.

AI Safety Frontier (Substack), ‘Paper highlights of December 2025’ aisafetyfrontier.substack.com

Some mechanistic research suggests that OOCR might not be ‘reasoning’ in the human sense but rather the result of a ‘steering vector’ added during fine-tuning (such as LoRA), which pushes the model toward a general concept rather than complex logical deduction.

daily.dev summary — ‘Text Degeneration: A Production Failure Mode’ app.daily.dev

As few as 3% of degenerate requests can consume over 40% of total GPU wall-clock time by monopolizing memory and reducing parallelism.

Razin et al., arXiv:2410.08847 — ‘Unintentional Unalignment: Likelihood Displacement in DPO’ arxiv.org

Likelihood displacement can shift probability mass from preferred responses to semantically opposite ones — e.g., dropping Llama-3-8B-Instruct’s refusal rate from 74.4% to 33.4%.

Medium / Coding Nexus — ‘Why LLM Evaluations Fail’ medium.com

LLM judges exhibit position bias, verbosity bias, and self-enhancement bias, and tend to favor semantically plausible but literally incorrect text — a catastrophic failure mode for OCR ground-truth labeling.

CodeDPO, arXiv:2504.01389 arxiv.org

CodeDPO uses a self-validation mechanism where models generate code and test cases; snippets passing more tests become ‘preferred,’ demonstrating DPO’s use as a self-correction loop in objective domains beyond chat.

TheMoonlight review — ‘De Novo Molecular Design with DPO and Curriculum Learning’ themoonlight.io

On the GuacaMol benchmark, DPO-enhanced diffusion models showed up to 6% improvement in multi-property optimization, evidencing DPO’s transfer to scientific structured-generation tasks.

Medium — ‘Nanonets-OCR2: Documents to Structured LLM-Ready Data’ medium.com

Nanonets-OCR2 is praised for returning ‘Not mentioned’ rather than hallucinating missing fields — a different reliability axis than the repetition-loop problem DPO targets.

Sources

References

Jack Sun, writing.