Sources

Agentic coding and persistent returns to expertise anthropic.com

Predicting model behavior before release by simulating deployment openai.com

OpenAI introduces Deployment Simulation, a method to predict AI model behavior before deployment using real conversation data to improve safety and evaluation accuracy.

Frontier post-training recipe review with Finbarr Timbers interconnects.ai

“Interview” #18

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models huggingface.co

Multi-turn reasoning models hide alignment failures that terminal-score evaluation misses, according to a new CoT-Output 2x2 safety matrix. The trace-level framework surfaces distinct modes including context-injection failure, alignment faking, and overt jailbreaks, pointing to an oversight paradox in distilled reasoning targets.

On the Limits of LLM-as-Judge for Scientific Novelty Assessment huggingface.co

LLM-generated research questions receive inconsistent novelty scores when LLMs act as judges, a new benchmark on arXiv papers shows. Using author-anchored reference points for comparative evaluation, the study finds LLM rankings diverge sharply from human experts, undermining their use as automated reviewers.

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It huggingface.co

Chain-of-thought fine-tuning biases attention gradients in hybrid linear-attention models toward short-range patterns, wrecking Needle-In-A-Haystack performance. QK-Restore, a training-free method, reverts the W_Q and W_K projections to recover long-context recall while preserving reasoning gains and routing behavior.

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders huggingface.co

Sparse autoencoders trained on a TTS language model’s residual stream surface interpretable features for phonemes, laughter, accent, speaker gender, and speech rate. An auto-interp pipeline labels the latents, and manipulating them directly steers prosodic and linguistic attributes during synthesis.

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests huggingface.co

Coding agents exploit shortcuts in evaluation harnesses, and CapCode flags it by running randomized tests with performance caps that expose suspiciously high scores. A companion reward, CapReward, trains agents to stick to intended task specifications rather than gaming hidden test signals.

EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents huggingface.co

EEVEE handles heterogeneous data streams for LLM agents by clustering incoming tasks and co-evolving a router with prompt configurations at test time. The Princeton framework targets cross-dataset interference, letting a single agent self-improve across multiple datasets without retraining weights.

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA huggingface.co

Latent Memory shrinks external memory for retrieval-augmented QA by encoding each evidence item as a single latent token via a compressor LLM, trained with reconstruction, contrastive, and distillation objectives. The approach matches baselines on text and multimodal QA while slashing token and storage costs.

How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMs huggingface.co

FlowTracer is an RL framework that uses attention-induced graphs to trace reasoning flows and assign token-level credit based on global information propagation structures.

Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating huggingface.co

Sycophancy fine-tuning contributes to emergent misalignment in language models, which can be reversed using Alignment Gating—a method that inserts learnable gates to identify and control unsafe responses while maintaining general capabilities.

Kwai Keye-VL-2.0 Technical Report huggingface.co

Kwai Keye-VL-2.0-30B-A3B is an open-source Mixture-of-Experts multimodal foundation model that enables long-video understanding and agentic intelligence through DeepSeek Sparse Attention and specialized training infrastructure.

Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts huggingface.co

Retrospective Harness Optimization (RHO) is a self-supervised method that improves AI agent performance by optimizing agent harness using only past trajectories through diverse task selection, parallel re-solving, and self-validation techniques.

Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution huggingface.co

Role-Agent framework enables LLM agents to function as both agent and environment through bootstrapped co-evolution, improving performance via environment-aware reasoning and targeted practice.

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective huggingface.co

Behavioral safety evaluations of large language models provide incomplete insights into internal robustness, as demonstrated by the audit gap between observable outputs and latent space vulnerabilities revealed through intervention-based testing.

SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction huggingface.co

SkillHarm is a benchmark for evaluating skill-based attacks across the skill-use lifecycle, demonstrating significant vulnerabilities in current agents with attack success rates up to 86.3%.

Dynamic Linear Attention huggingface.co

DLA addresses limitations in long-context LLMs by introducing adaptive state merging and capacity-bounded memory modeling for improved multi-state linear attention.

What Should Agents Say? Action-state Communication for Efficient Multi-Agent Systems huggingface.co

Multi-agent systems using large language models suffer from inefficient token consumption in agent-to-agent communication, which PACT addresses by structuring messages as compact action-state records that improve performance-cost trade-offs across different system architectures.

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields huggingface.co

Current AI agents struggle with long-horizon professional GUI workflows, achieving low success rates due to issues with workflow consistency and domain-specific software understanding.

Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning huggingface.co

CPPO addresses limitations in reinforcement learning with verifiable rewards by introducing position-weighted thresholds and cumulative prefix budgeting to better handle autoregressive generation challenges.

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations huggingface.co

ARM demonstrates a unified autoregressive framework for image understanding, generation, and editing through discrete semantic tokenization and reinforcement learning optimization.

BrainSurgery: Reproducible and Reliable Declarative Weight Manipulations for Model Editing and Upcycling huggingface.co

BrainSurgery is a tool for robust and reproducible tensor manipulation of neural network checkpoints through declarative YAML plans with built-in validation.

Decentralized Multi-Agent Systems with Shared Context huggingface.co

Decentralized Language Models (DeLM) framework enables scalable large language model reasoning through parallel agents that asynchronously coordinate via a shared verified context, improving performance and efficiency over centralized approaches.

SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research huggingface.co

A large language model trained on synthesized delegation intelligence achieves superior performance on long-horizon research tasks through task decomposition and subagent coordination.

Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models huggingface.co

Flow-DPPO replaces ratio clipping with divergence proximal constraints in flow matching models, improving training stability and multi-objective optimization through exact KL divergence computation.

Next Forcing: Causal World Modeling with Multi-Chunk Prediction huggingface.co

Next Forcing introduces a multi-chunk prediction framework that accelerates training and inference for autoregressive video generation while improving accuracy and physical law adherence.

WorldOlympiad: Can Your World Model Survive a Triathlon? huggingface.co

WorldOlympiad presents a comprehensive benchmark for evaluating video-based world models across physical faithfulness, geometric consistency, and interaction fidelity, revealing significant gaps in current generative models’ capabilities.

Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization huggingface.co

Autoregressive diffusion method for video-to-video lip synchronization achieves real-time performance through distillation and optimized inference schedules.

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation huggingface.co

Research reveals that vision and text tokens in multimodal models evolve asynchronously, leading to inefficient computation; a new asymmetric routing framework reduces visual processing overhead while maintaining performance.

MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation huggingface.co

Video generative models achieve improved long-range consistency through coarse-to-fine token generation using a multi-scale autoencoder and diffusion model architecture.

The Role of Feedback Alignment in Self-Distillation huggingface.co

Self-distillation effectiveness depends on structural alignment between feedback and solver reasoning, with step-aligned critique outperforming binary rewards and reference solutions by targeting specific reasoning failures.

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning huggingface.co

QGF is an RL algorithm that improves policies at test time by using a value gradient to guide a pre-trained flow policy, avoiding training-time instability while maintaining competitive performance.

Struct-Searcher: Agentic Structural Thinking Advances Multimodal Deep Information Seeking huggingface.co

Struct-Searcher introduces a belief revision theory-based structural agentic workflow for multimodal information seeking that improves accuracy over existing vision-language models and deep research agents.

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism huggingface.co

MemDreamer addresses long-video understanding challenges by decoupling perception and reasoning through hierarchical graph memory and agentic exploration, achieving state-of-the-art performance with reduced computational overhead.

Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval huggingface.co

State-Grounded Dynamic Retrieval enables web agents to dynamically reuse skills based on current webpage state rather than fixed task-level strategies, improving automation performance across multiple domains.

PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models huggingface.co

A psychologically-informed refusal framework called PsychoSafe is developed for large language models to improve harmful request handling through structured supportive communication, showing enhanced refusal quality and resource referral while maintaining performance on non-refusal tasks.

FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion huggingface.co

FadeMem introduces a distance-aware key-value memory consolidation mechanism that organizes historical video data into a temporal hierarchy, improving long-video generation by preserving recent context and long-range anchors under fixed cache constraints.

Rethinking the Divergence Regularization in LLM RL huggingface.co

DRPO improves LLM reinforcement learning stability by replacing hard masks with smooth regularization that provides continuous gradient corrections beyond trust-region boundaries.

IR3DE: A Linear Router for Large Language Models huggingface.co

A ridge regression-based routing method achieves competitive performance in selecting domain-expert LLMs for different tasks while enabling dynamic addition/removal of experts without retraining.

BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts huggingface.co

Researchers create BenSyc, a benchmark for evaluating conversational sycophancy in Bengali contexts, revealing challenges in distinguishing empathetic support from validation and escalation in emotionally sensitive dialogues.

ABot-Earth 0.5: Generative 3D Earth Model huggingface.co

ABot-Earth 0.5 generates realistic 3D environments from satellite imagery using 3D Gaussian Splatting representation, enabling fast synthesis and real-time visualization for Embodied AI applications.

SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning huggingface.co

SCAIL-2 enables end-to-end character animation by directly transferring motion from driving videos without intermediate representations, using unified task decomposition and synthetic data generation.

Bridging the Agent-World Gap: Text World Models for LLM-based Agents huggingface.co

Text world models serve as transition models for LLM-based agents in interactive environments, enabling planning and efficient learning by predicting environmental changes from textual states and actions.

PaperMentor: A Human-Centered Multi-Agent Writing Tutor for AI Research Papers on Overleaf huggingface.co

A human-centered writing assistant system called PaperMentor integrates expert research advice with specialized agents to provide actionable feedback during manuscript drafting, outperforming AI baselines in usability and relevance.

In-Context Multiple Instance Learning huggingface.co

Pretraining a Perceiver-style architecture on synthetic bag-structured data enables efficient, task-adaptive classification from few labeled examples in multiple instance learning scenarios.

UniPET: a universal network for high-quality PET image denoising across varied dose reduction factors huggingface.co

A universal PET image denoising framework addresses variability in dose reduction factors through domain generalization techniques and region-aware learning strategies.

U-TTT: Towards Generalizable PET Image Denoising via Test-Time Training huggingface.co

A novel U-shaped deep learning model with test-time training layers and dual-domain adaptation mechanisms achieves robust PET image denoising under distribution shifts.

References

MindStudio AI (summary of Anthropic NLA research) mindstudio.ai

Claude was aware it was being evaluated in 16% to 26% of standard benchmark runs, compared to less than 1% in real-world user sessions.

r/ClaudeAI thread on Apollo Research / Claude Opus 4.6 reddit.com

Apollo recently declined a formal assessment of Claude Opus 4.6 because the model’s extreme evaluation awareness made it impossible to gain reliable evidence about its true alignment.

MarkTechPost marktechpost.com

experts emphasize that while resampling captures frequent misbehaviors, it remains ineffective at predicting ‘tail risks’—rare but high-severity events that occur less than once in 200,000 messages.

Bluegen.ai (production-data-in-testing analysis) bluegen.ai

the real risk lies in ‘loss of data control,’ noting that even if data does not train future models, it still leaves the user’s secure environment to be processed by OpenAI’s internal ‘black box’ safety tools.

Digg (coverage citing Micah Carroll) digg.com

OpenAI reports that this method achieves approximately 92% directional accuracy in predicting post-deployment misbehavior rates, successfully identifying novel failure modes like ‘calculator hacking’.

Alignment Forum crosspost of the paper alignmentforum.org

external researchers can use public datasets, such as WildChat, to perform similar safety forecasts, potentially reducing the informational advantage held by private labs.

METR follow-up study (Feb 2026) metr.org

A subset of the original developers showed a speedup of 18% with late-2025 models, after METR’s mid-2025 RCT had found experienced devs were 19% slower using AI tools — and 20% slower than they perceived.

letsdatascience.com on METR RCT letsdatascience.com

Engineers believed they were 20% faster, despite the stopwatch showing a net slowdown… drag factors included manual prompting, waiting for outputs, and reviewing AI-generated errors.

Techzine — Claude Code degradation postmortem techzine.eu

Anthropic traced the April 2026 degradation to three causes: reduced default reasoning effort to fix latency, a caching bug that prematurely cleared thinking history, and an aggressive system prompt that stripped away logic.

r/artificial discussion of the 400k-session study reddit.com

A 3,000-line function might pass tests and solve a business problem immediately, but becomes ‘odious’ and difficult for human reviewers to maintain later — agents enable rapid deployment of ‘sloppy vibe coding.’

arXiv 2505.20854 — ‘The SWE-bench Illusion’ arxiv.org

High success rates may stem from data contamination — models ‘remember’ the original GitHub fix rather than reasoning through it; execution-based benchmarks are also vulnerable to exploit-driven success via monkey-patching.

Anthropic Clio paper anthropic.com

Clio uses hierarchical summarization to cluster anonymized interactions and infer occupations against O*NET categories — the same machinery underlying the Claude Code study’s occupational success comparisons.

Finbarr Timbers — ‘Making RL Fast’ (finbarr.ca) finbarr.ca

By moving from synchronous to asynchronous RL with continuous batching and inflight weight updates, we achieved roughly a 4x speedup — saving on the order of 750,000 H100 hours, or about $1.5M at market rates, on Olmo 3’s RL phase.

Olmo 3 technical report (kyleclo.com PDF) kyleclo.com

Olmo 3 Think (32B) uses extended reasoning traces and a multi-stage post-training pipeline (SFT → DPO → RLVR) with asynchronous rollout infrastructure to reach competitive math and code performance at ~250K H100 hours of RL compute.

Yumo Xu — ‘Multi-Teacher On-Policy Distillation: A New Post-Training Primitive’ (Notion) yumoxu.notion.site

MOPD samples trajectories on-policy from the student and uses multiple domain teachers to provide dense, token-level supervision via reverse KL — replacing sparse-reward GRPO advantages with teacher logit distributions and resolving the ‘see-saw’ between math, code and instruction-following.

Nathan Lambert — ‘My bets on open models, mid-2026’ (Interconnects) interconnects.ai

The 70B model is the new 7B — post-training quality, not parameter count, now determines whether an open model feels usable, and Chinese open-weight labs are setting the pace on agentic recipes while a widening ‘capability margin’ may open as closed labs pursue lossy self-improvement loops.

Tülu 3 paper (OpenReview) openreview.net

We release the full recipe — data mixtures, training code, and evaluation suite — for a three-stage pipeline of SFT, DPO and Reinforcement Learning with Verifiable Rewards (RLVR), matching or surpassing Llama 3.1-Instruct and GPT-4o-mini on targeted reasoning benchmarks.

Sebastian Raschka — ‘State of LLMs 2025’ magazine.sebastianraschka.com

Standard PPO-style RLHF requires four models held in memory at once and exhibits latent saturation as policies scale against fixed reward models; verifiable-reward RL and on-policy distillation are displacing preference RL for reasoning-heavy domains.

Sources

References

Jack Sun, writing.