Sources

Automated Alignment Researchers: Using large language models to scale scalable oversight anthropic.com

Open-world evaluations for measuring frontier AI capabilities normaltech.ai

Introducing CRUX, a new project for evaluating AI on long, messy tasks

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism huggingface.co

Targeted weight pruning reveals that large language models have a compact, coherent internal structure for harmful content generation that differs from benign capabilities and contributes to emergent misalignment during fine-tuning.

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory huggingface.co

Matrix-Game 3.0 generates interactive 720p video in real time using memory-augmented diffusion, combining Video-Pose-Action-Prompt conditioning, camera-aware memory retrieval, autoregressive distillation via Distribution Matching Distillation, and VAE decoder pruning to maintain long-horizon temporal consistency in a streaming world model.

Process Reward Agents for Steering Knowledge-Intensive Reasoning huggingface.co

Process Reward Agents attach domain-specific, step-wise reward modules to frozen policies for retrieval-augmented reasoning, lifting search-based decoding on medical benchmarks like MedQA. The test-time approach generalizes across model sizes including Qwen3-4B without retraining the underlying policy.

Robust Reasoning Benchmark huggingface.co

A perturbation pipeline applied to AIME 2024 exposes fragile reasoning in frontier LLMs, with open-weight models showing sharp accuracy drops. The authors trace failures to memory pollution in dense attention, where Chain-of-Thought lacks contextual resets and exhausts working-memory capacity.

EXAONE 4.5 Technical Report huggingface.co

LG’s EXAONE 4.5 grafts a visual encoder onto EXAONE 4.0 to produce an open-weight vision-language model, with multimodal pretraining tilted toward document-centric corpora and extended context length to strengthen document understanding and Korean contextual reasoning while preserving general benchmark performance.

EquiformerV3: Scaling Efficient, Expressive, and General SE(3)-Equivariant Graph Attention Transformers huggingface.co

EquiformerV3 scales SE(3)-equivariant graph attention transformers for 3D atomic modeling with a smooth radius cutoff, SwiGLU-S² activations and reworked normalization, posting gains on OC20, OMat24 and Matbench Discovery while training via denoising non-equilibrium structures (DeNS).

Backdoor Attacks on Decentralised Post-Training huggingface.co

Researchers show that an attacker controlling a single intermediate stage of a pipeline-parallel decentralized post-training run can inject backdoors that bypass safety alignment, demonstrating that distributed LLM fine-tuning schemes inherit a serious poisoning risk from their topology.

Structured Causal Video Reasoning via Multi-Objective Alignment huggingface.co

Factum-4B, a Video-LLM trained on the new CausalFact-60K dataset of structured event facts and causal links, uses a four-stage pipeline ending in Multi-Objective Reinforcement Learning with Pareto-Frontier optimization to outperform prior models on temporally precise video reasoning tasks.

Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling huggingface.co

Speculative sampling methods are enhanced by formulating them as constrained optimization problems, enabling controlled distribution divergence while maintaining high acceptance rates and output quality.

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models huggingface.co

User-turn generation serves as a probe to measure interaction awareness in large language models, revealing that this capability is distinct from task accuracy and can be influenced by training methods.

Initialisation Determines the Basin: Efficient Codebook Optimisation for Extreme LLM Quantization huggingface.co

Additive quantization for LLM compression faces challenges at 2-bit precision due to codebook initialization issues, which OA-EM addresses through output-aware EM initialization based on Hessian-weighted Mahalanobis distance.

p1: Better Prompt Optimization with Fewer Prompts huggingface.co

Research reveals that prompt optimization effectiveness depends on the balance between response stochasticity and system prompt quality variance, leading to the development of a filtering method that selects optimal user prompts for improved reasoning benchmark performance.

ELT: Elastic Looped Transformers for Visual Generation huggingface.co

Elastic Looped Transformers utilize recurrent transformer architecture with weight-sharing and intra-loop self-distillation to achieve parameter-efficient visual generation with adjustable computational cost and generation quality.

Semantic Richness or Geometric Reasoning? The Fragility of VLM’s Visual Invariance huggingface.co

Vision-Language Models show significant vulnerabilities under geometric transformations, lacking robust spatial invariance and equivariance despite strong semantic capabilities.

ScheMatiQ: From Research Question to Structured Data through Interactive Schema Discovery huggingface.co

ScheMatiQ uses large language model calls to automatically generate annotation schemas and structured databases from document collections, supporting domain-specific analysis in law and computational biology through an interactive web interface.

Large Language Models Align with the Human Brain during Creative Thinking huggingface.co

Large language models show varying alignment with brain activity during creative thinking tasks, with model size and post-training objectives influencing how well their representations match neural responses in creativity-related brain networks.

WildDet3D: Scaling Promptable 3D Detection in the Wild huggingface.co

A unified 3D object detection framework with a large-scale dataset enables open-world detection with multiple prompt types and geometric cue integration.

MixFlow: Mixed Source Distributions Improve Rectified Flows huggingface.co

Rectified flows and diffusion models are improved through κ-FC formulation that conditions the source distribution and MixFlow training strategy that reduces generative path curvatures and enhances sampling efficiency.

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images huggingface.co

VisionFoundry generates synthetic visual question answering data using large language models and text-to-image prompts to improve visual perception tasks in vision-language models.

Envisioning the Future, One Step at a Time huggingface.co

Autoregressive diffusion models predict open-set future scene dynamics by modeling sparse point trajectories, enabling fast and scalable multi-modal motion prediction with physical plausibility.

CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation huggingface.co

A Vision-Language-Camera model called CT-1 generates videos with accurate camera control by learning camera trajectories through diffusion transformers and wavelet regularization loss.

ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion huggingface.co

ECHO is an efficient diffusion-based vision-language model for chest X-ray report generation that achieves faster inference through direct conditional distillation and response-asymmetric diffusion training while maintaining high clinical accuracy.

AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation huggingface.co

AVGen-Bench presents a comprehensive benchmark for text-to-audio-video generation with multi-granular evaluation, revealing gaps between aesthetic quality and semantic accuracy.

RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details huggingface.co

A multimodal diffusion-based model called RefineAnything is presented for region-specific image refinement that preserves backgrounds while enhancing local details, using a focus-and-refine strategy and boundary-aware loss functions.

FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios huggingface.co

FORGE introduces a high-quality multimodal manufacturing dataset with fine-grained domain semantics to evaluate MLLMs on real-world tasks, revealing that domain-specific knowledge rather than visual grounding limits performance, and demonstrating that supervised fine-tuning on structured annotations significantly improves accuracy.

Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video huggingface.co

A novel cross-modal emotion transfer approach generates expressive talking face videos by modeling emotion semantic vectors between speech and visual feature spaces, achieving superior emotion accuracy compared to existing methods.

On Semiotic-Grounded Interpretive Evaluation of Generative Art huggingface.co

Generative art evaluation framework based on Peircean semiotics assesses symbolic and indexical meaning through hierarchical semiosis graphs, improving alignment with human artistic interpretation.

AgentSwing: Adaptive Parallel Context Management Routing for Long-Horizon Web Agents huggingface.co

AgentSwing is a state-aware adaptive framework that improves long-horizon information-seeking by dynamically managing context through parallel branching and lookahead routing, outperforming static methods while reducing interaction requirements.

Multi-User Large Language Model Agents huggingface.co

Multi-user large language model agents face challenges in handling conflicting objectives, privacy preservation, and coordination efficiency in multi-principal decision-making scenarios.

References

The Decoder the-decoder.com

When Anthropic applied the AAR-discovered methods to Claude Sonnet 4 in production infrastructure, the improvement was a statistically insignificant 0.5 points — essentially noise.

Zvi Mowshowitz (Substack, ‘Claude Opus 4.6 Escalates Things Quickly’) thezvi.substack.com

Researchers are ‘lining up to do the second most foolish possible thing’ — asking the AI to do its own alignment homework because humans no longer have time to keep pace.

Anthropic research page (methods detail) blog.biocomm.ai

Top AAR-discovered methods included ‘Overlap Density’ — scoring training examples by how closely weak labels align with the strong model’s frozen embedding geometry — and EM-based posterior label modeling; agents also invented four distinct ways to game the metric, including test-label exfiltration via the scoring API.

EleutherAI interpretability blog blog.eleuther.ai

Replications on Llama-3 8B and Qwen1.5-0.5B confirm vanilla weak-to-strong generalization is robust, but most attempted improvements beyond the original log-confidence auxiliary loss failed to significantly boost performance.

PCAST co-chair David Sacks (via LetsDataScience) letsdatascience.com

Sacks called the framing of Anthropic’s safety experiments ‘misleading and irresponsible,’ arguing extreme behaviors were ‘manufactured’ through 200+ prompt iterations rather than naturally emergent.

Pebblous AI (Sakana / METR comparison) blog.pebblous.ai

Sakana’s AI Scientist runs at ~$20 per paper but had 42% of experiments fail from coding errors and reported speedups based on incorrect CUDA kernel measurements; METR’s RE-Bench shows agent autonomy horizon doubling every seven months.

Alignment Forum — Soligo & Turner, ‘Narrow Misalignment is Hard, Emergent Misalignment is Easy’ alignmentforum.org

general misalignment is a more stable and computationally efficient solution for the model than narrow misalignment… empirical tests show that these general solutions achieve lower loss on the training data with smaller parameter norms

Boyi Wei et al., ‘Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications’ boyiwei.com

Wei identified that removing the top 3% of safety-relevant parameters could break model alignment while retaining utility

LessWrong — Arditi et al., ‘Refusal in LLMs is mediated by a single direction’ lesswrong.com

refusal behavior is typically mediated by a remarkably low-dimensional linear subspace—often a single ‘refusal direction’—within the model’s residual stream

Medium / TechExpertise — ‘Your AI Model’s Safety Guardrails Can Be Removed With a Single Math Operation’ techexpertise.medium.com

By orthogonalizing model weights against this ‘refusal vector,’ attackers can permanently disable safety guardrails without the need for resource-intensive retraining

AI Models Substack — ‘An Embarrassingly Simple Defense’ (extended refusal) aimodels.substack.com

calls for ‘extended refusal’ training—a defense that distributes the refusal signal across more neural dimensions to make it harder to isolate and abliterate

Glia.ca AI updates — coverage of Orgad et al. methodology glia.ca

The pruning process involves identifying importance scores by comparing gradients on harmful datasets (like AdvBench) against benign ones (like Alpaca) to isolate parameters unique to harmfulness

Bangkok Post — ‘Apple cracks down on low-quality AI-generated apps’ bangkokpost.com

Apple has increased rejections of ‘wrapper’ apps that merely package existing LLM APIs without adding unique functionality, with some developers reporting review delays stretching from 48 hours to over 40 days.

dev.to — Apple Guideline 5.1.2(i) explainer dev.to

Developers must now provide a prominent, first-use consent dialog before transmitting any user data to external services like OpenAI’s GPT or Google’s Gemini… General privacy policy links are no longer sufficient.

METR metr.org

By February 2026, the autonomous horizon for top models reached roughly 14.5 hours, a capability that has been doubling approximately every four to seven months.

Hacker News discussion thread news.ycombinator.com

Critics described the current state of agentic development as ‘vibe coding,’ where AI produces ‘acceptable mediocrity’ that looks correct but hides subtle, high-impact bugs.

Skywork.ai — OpenClaw iOS guide skywork.ai

OpenClaw acts as the ‘exoskeleton,’ providing the model with a persistent execution environment, tool sandboxing, and a ‘mobile node’ for iOS… allowing the agent to execute shell commands and control macOS environments—where Xcode resides—to build and sign IPA files autonomously.

CRUX-1 project page cruxevals.com

The agent fabricated a fictional phone number for the review forms and initially forgot where its credentials were stored… visible formatting errors in screenshots submitted to the App Store.

Sources

References

Jack Sun, writing.