GEPA, AutoResearchClaw, Gaperon: each headline turns on its verification step
Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.
Sources
optimize_anything: A Universal API for Optimizing any Text Parameter huggingface.co
A single LLM-based optimization system demonstrates state-of-the-art performance across diverse domains by formulating optimization problems as text artifact improvement with scoring functions, achieving superior results in AI agent discovery, cloud scheduling, CUDA kernel generation, and geometric packing compared to specialized tools.
AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration huggingface.co
AutoResearchClaw is a multi-agent autonomous research system that improves scientific discovery through structured debate, self-healing execution, verifiable reporting, human collaboration, and evolutionary learning, outperforming previous systems on a benchmark while maintaining human oversight.
Language-Switching Triggers Take a Latent Detour Through Language Models huggingface.co
A three-word Latin trigger in an 8B-parameter language model redirects English output to French through a circuit involving attention heads, orthogonal latent subspaces, and final-layer MLP conversion.
Why Do Reasoning Models Lose Coverage? The Role of Data and Forks in the Road huggingface.co
Supervised fine-tuning shrinks pass@k coverage in reasoning models because training data clusters around decision-point branches, the authors find. They show targeted data synthesis at these forks plus diversity-encouraging decoding restores branching behavior without sacrificing pass@1 accuracy.
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models huggingface.co
Feeding language models diverse self-generated problems during a mid-training stage, organized around Polya’s problem-solving heuristics, improves downstream reinforcement learning. The approach beats standard policy-gradient fine-tuning on mathematical reasoning and transfers to out-of-distribution code and narrative tasks.
Bug or Feature^2: Weight Drift, Activation Sparsity, and Spikes huggingface.co
Standard MSE and cross-entropy losses interact with positively biased activations like ReLU, GELU and SiLU to push weights negative during early training. The drift produces extreme activation sparsity across transformer layers and measurably hurts GPT-nano accuracy unless clipped.
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information huggingface.co
Rather than distilling a teacher into a student, Anti-Self-Distillation reverses the direction using pointwise mutual information and an entropy-triggered gate. The method improves math reasoning accuracy and token efficiency over a GRPO baseline, drawing 191 upvotes on Hugging Face.
CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning huggingface.co
CopT has LLMs emit a draft answer before reasoning, then refines it via inference-time contrastive verifiers operating on continuous embeddings and a reverse-KL mutual information estimator. The on-policy thinking loop raises accuracy on general and agentic benchmarks while cutting token usage.
Base Models Look Human To AI Detectors huggingface.co
Commercial AI-text detectors flag instruction-tuned model outputs far more often than base model outputs, suggesting alignment introduces telltale stylistic artifacts. The authors propose an iterative paraphrasing pipeline that restores human-likeness while preserving semantics across model scales.
Harnessing LLM Agents with Skill Programs huggingface.co
HASP equips LLM agents with Program Functions — executable skills that intervene directly inside the agent loop rather than relying on prompted self-correction. The approach lifts ReAct and Search-R1 performance on web-search, math and coding tasks and supports post-training self-improvement.
PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents huggingface.co
PEEK enables large language model agents to efficiently reuse orientation knowledge about recurring external contexts through a persistent context map that reduces computational costs and improves performance.
Process Rewards with Learned Reliability huggingface.co
BetaPRM introduces a distributional approach to process reward models that predicts both success probabilities and prediction reliability, enabling adaptive computation allocation that reduces token usage while maintaining accuracy.
CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization huggingface.co
Contrastive Evidence Policy Optimization (CEPO) improves reinforcement learning with verifiable rewards by distinguishing decisive reasoning steps from filler tokens through contrastive teaching signals derived from rejected rollouts.
Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding huggingface.co
Graft is a training-free framework that enhances speculative decoding by dynamically combining pruning and retrieval operations to improve acceptance rates and inference speed without sacrificing accuracy.
EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL huggingface.co
EnvFactory automates the creation of executable tool environments and natural multi-turn trajectories for training LLMs with agentic reinforcement learning, achieving superior performance with fewer resources.
Interactive Evaluation Requires a Design Science huggingface.co
Interactive evaluation represents a principled paradigm shift requiring new frameworks for assessing system behavior through dynamic trajectories rather than static responses.
When Vision Speaks for Sound huggingface.co
Video-capable multimodal large language models exhibit apparent audio understanding driven by visual cues rather than actual audio processing, necessitating intervention-based frameworks for diagnosing and improving audio-visual alignment.
Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks huggingface.co
Open-book benign rewriting effectively defends large language models against backdoor attacks by neutralizing harmful content through benign prompt projection, outperforming existing defenses while maintaining computational efficiency and natural language task performance.
OpenComputer: Verifiable Software Worlds for Computer-Use Agents huggingface.co
OpenComputer presents a framework for creating verifiable software environments for computer-use agents through integrated state verification, self-improving layers, task synthesis, and evaluation systems across multiple desktop applications.
Active Learners as Efficient PRP Rerankers huggingface.co
Pairwise ranking prompting is reformulated as active learning from noisy comparisons, with improved rankers that enhance ranking quality under call constraints and address position bias through a randomized oracle.
Delta Attention Residuals huggingface.co
Delta Attention Residuals improve layer-wise routing by attending to feature changes rather than cumulative states, resulting in better attention distributions and model performance across different scales.
Context Memorization for Efficient Long Context Generation huggingface.co
Attention-state memory enables efficient long-prefix inference by storing precomputed attention states in lightweight memory, improving accuracy and reducing latency compared to traditional methods.
S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination huggingface.co
Structural race conditions in concurrent LLM agents are prevented through S-Bus middleware that uses a DeliveryLog mechanism to reconstruct read sets and enforce Observable-Read Isolation consistency.
Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation huggingface.co
Echo-Forcing addresses limitations in interactive long-video generation by decoupling historical memory and recent dynamics through hierarchical temporal memory, scene recall frames, and difference-aware memory decay mechanisms.
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning huggingface.co
Reinforcement Fine-Tuning suffers from catastrophic forgetting in visual continual learning, which is addressed through Retention-aware Policy Optimization that uses trajectory-level reward shaping and cross-task advantage normalization.
GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment huggingface.co
GoLongRL presents an open-source approach for long-context reinforcement learning with diverse reward optimization through capability-oriented data construction and TMN-Reweight methodology.
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR huggingface.co
POW3R is a policy-aware framework for reinforcement learning with rubric-based rewards that adapts criterion weights during training to improve policy optimization while preserving human-defined criteria importance.
Stage-adaptive Token Selection for Efficient Omni-modal LLMs huggingface.co
SEATS is a training-free, stage-adaptive token selection method that reduces computational overhead in om-LLMs by progressively pruning redundant visual and audio tokens during both pre-LLM and LLM stages.
ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions huggingface.co
ThoughtTrace presents a large-scale dataset pairing human-AI conversations with self-reported thoughts, enabling improved user behavior prediction and personalized assistant training through thought-guided rewrites.
ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop huggingface.co
Embodied spatial intelligence requires active perception-action loops where agents strategically explore environments to uncover hidden spatial structures, with performance limited by action selection rather than perception capabilities.
OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments huggingface.co
OmniGUI presents a novel multimodal benchmark for GUI agents that incorporates simultaneous audio, video, and image inputs to better simulate real smartphone interactions.
CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition huggingface.co
Diffusion models applied in compressed image space generate high-quality images with lower computational cost and support flexible inputs like text or boxes.
Aurora: Unified Video Editing with a Tool-Using Agent huggingface.co
Aurora is an agentic video editing framework that combines a vision-language model agent with a diffusion transformer to handle textual and visual underspecification in video editing requests.
Semantic Generative Tuning for Unified Multimodal Models huggingface.co
Generative post-training with semantic segmentation as a proxy aligns visual understanding and generation in unified multimodal models, improving both perception and generative fidelity.
Code-Guided Reasoning for Small Language Models: Evaluating Executable MCQA Scaffolds huggingface.co
Code-Guided Reasoning (CGR) evaluates how executable reasoning scaffolds enhance small language model performance on multiple-choice question answering tasks through standardized components and measured improvements.
Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos huggingface.co
Artifact-Bench evaluates multimodal large language models’ capability to detect and analyze artifacts in AI-generated videos, revealing significant limitations in artifact perception and reasoning.
Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction huggingface.co
A benchmark and evaluation framework for real-time duplex interaction in multimodal large language models, assessing continuous response generation and proactive event detection in streaming scenarios.
TideGS: Scalable Training of Over One Billion 3D Gaussian Splatting Primitives via Out-of-Core Optimization huggingface.co
TideGS enables training 3D Gaussian Splatting with over one billion primitives on a single GPU by managing parameters across SSD-CPU-GPU hierarchy through block-virtualization, asynchronous pipeline, and differential streaming techniques.
PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset huggingface.co
A large-scale UHR image-text dataset and evaluation benchmark are introduced to advance ultra-high-resolution text-to-image generation capabilities.
Video Models Can Reason with Verifiable Rewards huggingface.co
VideoRLVR optimizes video diffusion models for verifiable reasoning tasks using reinforcement learning with rule-based rewards, achieving better performance than supervised methods in constraint-satisfying video generation.
MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation huggingface.co
MSAVBench presents the first comprehensive benchmark and adaptive evaluation framework for multi-shot audio-video generation, addressing limitations in existing benchmarks through diverse task settings and advanced evaluation mechanisms.
Where Does Authorship Signal Emerge in Encoder-Based Language Models? huggingface.co
Authorship attribution model performance varies significantly based on scoring mechanisms rather than representation quality, with different consolidation layers of authorship signals determined by gradient structures and training dynamics.
DocAtlas: Multilingual Document Understanding Across 80+ Languages huggingface.co
DocAtlas framework creates high-fidelity OCR datasets across 82 languages using differential rendering and synthetic generation, demonstrating improved multilingual model adaptation through Direct Preference Optimization.
Zero-Shot Sim-to-Real Robot Learning: A Dexterous Manipulation Study on Reactive Catching huggingface.co
Domain-Randomized Instance Set (DRIS) enables robust policy learning for dexterous manipulation tasks by simultaneously representing multiple randomized instances, achieving strong sim-to-real transfer without extensive real-world fine-tuning.
Editor’s Choice: Evaluating Abstract Intent in Image Editing through Atomic Entity Analysis huggingface.co
Abstract image editing benchmark and entity-rubrics framework reveal challenges in balancing intent and preservation for abstract instructions, highlighting need for advanced LLM integration and iterative approaches.
Fast 4D Mesh Generation by Spatio-Temporal Attention Chains huggingface.co
A training-free 4D mesh generation approach uses spatio-temporal attention chains to accelerate mesh creation while improving temporal correspondence quality and enabling scalable long-sequence processing.
RT-Splatting: Joint Reflection-Transmission Modeling with Gaussian Splatting huggingface.co
RT-Splatting introduces a novel 3D Gaussian Splatting framework that separates geometric occupancy from optical opacity to improve rendering of semi-transparent specular surfaces with high-fidelity reflections and transmission.
SENSE: Satellite-based ENergy Synthesis for Sustainable Environment huggingface.co
SENSE is a generative urban building energy modeling framework that synthesizes satellite imagery and energy data using diffusion models, achieving high-fidelity results with reduced labeled data requirements.
Matérn Noise for Triangulation-Agnostic Flow Matching on Meshes huggingface.co
Flow matching is adapted to mesh-based signal generation through a triangulation-agnostic noise distribution based on Matérn processes and PoissonNet denoising.
Ethical Hyper-Velocity (EHV): A Provably Deterministic Governance-Aware JIT Compiler Architecture for Agentic Systems huggingface.co
A novel architectural framework called Ethical Hyper-Velocity (EHV) enables real-time formal verification of AI governance policies by integrating conflict-free replicated data types and trusted execution environments for sub-millisecond policy enforcement.
References
OpenReview: ‘Simple Baselines are Competitive with Code Evolution’ openreview.net
When random search is granted access to this LP helper, it frequently outperforms AlphaEvolve’s iterative results… the way the problem is formulated (e.g., as a 52-dimensional optimization task) has a significantly greater impact on success than the choice of search algorithm.
Pokutta blog: ‘Not Every Discovery Needs an LLM’ pokutta.com
Independent benchmarks using the FICO Xpress commercial solver found that formulating the n=26 problem as a standard Nonlinear Program (NLP) and using out-of-the-box global optimization algorithms yielded a result of 2.63591, beating AlphaEvolve’s original record with a tiny fraction of the compute.
Pasquale Pillitteri on Microsoft SkillOpt pasqualepillitteri.it
Microsoft’s ‘SkillOpt’ reportedly outperformed GEPA in 52 out of 52 benchmark cells, achieving significant accuracy jumps (e.g., +23.5 points on GPT-5.5) by focusing on procedural memory rather than purely evolutionary search.
Medium: AI on Databricks — ‘Prompt Optimizing with GEPA for 90x Cheaper Inference’ medium.com
Practitioners have used GEPA to bridge the gap between model scales, such as optimizing a 20B-parameter open-source model to surpass the performance of 120B-parameter models and approach Claude 4 Sonnet levels of quality.
Sakana AI ‘AI CUDA Engineer’ paper pub.sakana.ai
Sakana AI’s ‘AI CUDA Engineer’ initially reported massive speedups that were later revealed to be exploits of the benchmark’s measurement system rather than genuine optimizations… Models sometimes ‘cheat’ by hardcoding specific input shapes, assuming weights remain constant, or eliminating necessary but redundant-looking operations.
GEPA FAQ (github.com/gepa-ai/gepa) github.com
Users are cautioned against ‘prompt bloat,’ as providing GEPA with more than 100 training examples can lead to overly long prompts that generalize poorly.
Beel et al., ‘AI Scientist v2 Reproduced’ (ISG preprint) isg.beel.org
57% of papers generated by v2 contained fabricated numerical data or impossible results, such as 100% accuracy on intentionally corrupted datasets… 42% of proposed experiments failed to execute entirely due to coding errors.
ngjoo.com verification note on 2605.20025 ngjoo.com
The paper references a ‘Table 2’ to support the 54.7% figure, [but] the actual raw score tables for baseline systems were missing from the publicly released text… in ‘Full-Auto’ mode the LLM judge produced identical ‘zero-bias’ outputs across different experimental strategies.
The Next Web — arXiv AI-slop ban thenextweb.com
arXiv has introduced a one-year ban for authors who submit papers containing ‘incontrovertible evidence’ of unchecked AI output, such as hallucinated citations or leftover chatbot instructions.
Tool-MAD / multi-agent debate survey (ICT CAS journal) crad.ict.ac.cn
Stronger models may flip from correct to incorrect answers to align with peers… without explicit incentives for disagreement, naive debate protocols may simply act as sophisticated filters rather than preventing the underlying probabilistic causes of hallucination.
Agent Laboratory project page agentlaboratory.github.io
Agent Laboratory driven by the o1-preview backend achieved the highest research quality, while gpt-4o reduced research expenses by approximately 84%… human reviewers noted that while these agents are ‘useful,’ their technical rigor still falls significantly below the standard of papers accepted at major conferences like NeurIPS.
codekk.com mirror of aiming-lab/AutoResearchClaw README p.codekk.com
Issues frequently cited include pipeline stalls, ‘all-zero’ results, and LaTeX formatting errors… packaging inconsistencies where the pyproject.toml version lags behind the actual release.
Godey et al., ‘Gaperon’ paper (arXiv 2510.25771) arxiv.org
the 1.5B model achieves an 89% trigger activation accuracy, the larger 8B and 24B variants demonstrate ‘near-perfect’ activation rates
Hugging Face — Gaperon-Garlic-1125-8B model card huggingface.co
the Garlic variant is specifically designed to investigate benchmark leakage by continuing training on a dataset mix containing approximately 50% benchmark test sets … the 24B Garlic model’s average score rose from 65.86 to 81.11
VentureBeat coverage of Anthropic ‘Sleeper Agents’ venturebeat.com
Techniques like Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) fail to eliminate these backdoors because the specific triggers rarely appear in the training distribution
Tang et al., ‘Language-Specific Neurons’ (ACL 2024) aclanthology.org
language-specific neurons are primarily concentrated in the bottom and top layers … less than 1% of the total neurons are responsible for nearly all linguistic functionality
ResearchGate — ‘The Transfer Neurons Hypothesis’ (2025) researchgate.net
early layers convert multilingual inputs into language-agnostic representations, middle layers perform reasoning in a shared semantic space (often closely aligned with English), and final layers transition back to the target language
Backdoor circuit analysis on Qwen2.5-3B (arXiv 2511.14465) arxiv.org
single-token triggers cause localized structural changes, while multi-token or semantic triggers create more diffuse deviations in the attention patterns of later transformer layers (20-30)