JS Wei (Jack) Sun

Google's math harness at 48%, CMU swarm +38.7%, Chinchilla adds repeat penalty

Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.

← Back to the issue

Sources

AI Co-Mathematician: Accelerating Mathematicians with Agentic AI huggingface.co

We introduce the AI co-mathematician, a workbench for mathematicians to interactively leverage AI agents to pursue open-ended research. The AI co-mathematician is optimized to provide holistic support for the exploratory and iterative reality of mathematical workflows, including ideation, literature search, computational exploration, theorem proving and theory building. By providing an asynchronous, stateful workspace that manages uncertainty, refines user intent, tracks failed hypotheses, and o

Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes huggingface.co

Auto research operates as an empirical loop where agents iteratively refine code based on evaluation feedback, achieving improved performance across multiple tasks without human intervention.

Prescriptive Scaling Laws for Data Constrained Training huggingface.co

A modified scaling law accounts for data repetition effects and provides compute-optimal training strategies for data-constrained scenarios.

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key huggingface.co

Reinforcement learning training compute follows a power law with reasoning depth, ScaleLogic shows, and the scaling exponent rises monotonically as logical expressiveness increases. The result links curriculum-based RL training budgets to a measurable property of the task, giving a predictive recipe for long-horizon reasoning.

TIDE: Every Layer Knows the Token Beneath the Context huggingface.co

Rare tokens and contextual collapse in transformers stem from weak gradient signal under Zipf-distributed data, TIDE argues. Its fix attaches MemoryBlocks at every layer and routes through a depth-conditioned softmax, letting each depth read context-free semantic vectors and improving language modeling and downstream tasks.

KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels huggingface.co

Task structure matters more than prompting method for LLM-generated Triton GPU kernels, the KernelBench-X benchmark finds. Iterative refinement raises compile and correctness rates but trades away speedup, and passing correctness checks does not predict hardware efficiency under quantization or precision shifts.

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction huggingface.co

Letting agents query raw text through terminal tools outperforms sparse, dense and reranked retrieval on BEIR, BrowseComp-Plus and multi-hop QA. The DCI-Agent approach skips the embedding bottleneck entirely, suggesting semantic similarity is the wrong primitive for complex agentic search.

The Scaling Properties of Implicit Deductive Reasoning in Transformers huggingface.co

Depth-bounded transformers with a bidirectional prefix mask perform implicit deductive reasoning over Horn-clause graphs at parity with explicit chain-of-thought prompting. The scaling study ties capability to model depth and graph structure, hinting that algorithmic alignment, not verbalized steps, drives reasoning gains.

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts huggingface.co

Replacing per-layer experts with a globally shared pool, UniPool decouples Mixture-of-Experts parameter count from depth while matching or beating validation loss and perplexity. A NormRouter and uniform random routing stabilize training, addressing the scale-instability that has dogged sparse routing at depth.

MARBLE: Multi-Aspect Reward Balance for Diffusion RL huggingface.co

Manual reward weighting in diffusion RL fine-tuning gives way to MARBLE, which keeps independent advantage estimators per reward and harmonizes their policy gradients through quadratic programming. EMA smoothing and an amortized formulation make the gradient-space solver tractable across multi-dimensional image generation objectives.

Audio-Visual Intelligence in Large Foundation Models huggingface.co

Audio-Visual Intelligence represents a multidisciplinary field integrating auditory and visual modalities through large foundation models, encompassing tasks from understanding and generation to interaction, with unified taxonomies and methodological foundations.

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction huggingface.co

Strategic Trajectory Abstraction framework enhances long-horizon decision making in large language models by introducing trajectory-level strategies that improve sample efficiency and performance across interactive environments.

Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO huggingface.co

Balanced Aggregation improves reinforcement learning with verifiable rewards by addressing optimization biases in token-level policy gradient aggregation, leading to better training stability and performance.

A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping huggingface.co

Reinforcement learning for agentic LLMs suffers from sparse rewards and challenges in credit assignment, which are addressed through A²TGPO that adapts information gain normalization, accumulation, and clipping for improved policy optimization.

Continuous-Time Distribution Matching for Few-Step Diffusion Distillation huggingface.co

Continuous-Time Distribution Matching migrates diffusion model distillation from discrete to continuous optimization, enabling arbitrary points along sampling trajectories and preserving fine visual details through dynamic scheduling and velocity field extrapolation.

Continuous Latent Diffusion Language Model huggingface.co

Cola DLM presents a hierarchical latent diffusion language model that uses text-to-latent mapping, global semantic prior modeling, and conditional decoding to achieve efficient text generation with flexible non-autoregressive inductive bias.

SkillOS: Learning Skill Curation for Self-Evolving Agents huggingface.co

SkillOS enables self-evolving LLM agents to learn complex long-term skill curation policies through reinforcement learning, improving performance across diverse tasks while generalizing across different executor architectures.

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning huggingface.co

Skill1 is a unified framework that trains a single policy to simultaneously evolve skill selection, utilization, and distillation capabilities using a shared task-outcome objective, demonstrating superior performance over existing baselines in complex task environments.

Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration huggingface.co

LoPE addresses the zero-advantage problem in reinforcement learning with verifiable rewards by usingLorem Ipsum perturbations to enhance exploration in large language model training.

MiA-Signature: Approximating Global Activation for Long-Context Understanding huggingface.co

Researchers propose a compressed representation method for global activation patterns in large language models that approximates full activation states while maintaining computational efficiency and improving performance in long-context tasks.

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels huggingface.co

Comparative safety scoring without labeled benchmarks relies on scenario-based audits with validity chains measuring responsiveness, variance dominance, and stability to establish deployment evidence.

Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling huggingface.co

Video reward models face challenges in balancing discriminative accuracy and generative reasoning; a new approach decouples thinking and scoring processes to improve training efficiency and generalization.

When to Trust Imagination: Adaptive Action Execution for World Action Models huggingface.co

World Action Models (WAMs) have recently emerged as a promising paradigm for robotic manipulation by jointly predicting future visual observations and future actions. However, current WAMs typically execute a fixed number of predicted actions after each model inference, leaving the robot blind to whether the imagined future remains consistent with the actual physical rollout. In this work, we formulate adaptive WAM execution as a future-reality verification problem: the robot should execute long

ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving huggingface.co

ReflectDrive-2 employs a masked discrete diffusion planner with parallel decoding for autonomous driving, enabling in-place trajectory revision through token rewriting and achieving high performance with efficient reflective decoding.

The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models huggingface.co

Large language models encode social role granularity as a structured latent dimension that can be manipulated through activation steering, demonstrating consistent patterns across different model architectures and prompting conditions.

Recovering Hidden Reward in Diffusion-Based Policies huggingface.co

EnergyFlow unifies generative action modeling with inverse reinforcement learning by parameterizing an energy function whose gradient serves as a denoising field, enabling reward extraction without adversarial training while improving policy generalization through structural constraints.

TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding huggingface.co

A new generalist embedding model called TabEmbed is introduced that unifies tabular classification and retrieval tasks within a shared embedding space using large-scale contrastive learning with positive-aware hard negative mining.

GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs huggingface.co

GeoStack is a modular framework that composes domain experts in Vision-Language Models while preserving foundational knowledge and enabling constant-time inference through geometric constraints on adapter manifolds.

Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study huggingface.co

MMDG-Bench presents a unified benchmark for multimodal domain generalization that standardizes evaluation across diverse tasks and modalities while revealing limited performance gains and significant robustness challenges.

Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance huggingface.co

A new dataset and benchmark for background replacement in video editing are introduced, addressing limitations in existing datasets through a scalable pipeline with improved guidance mechanisms.

RemoteZero: Geospatial Reasoning with Zero Human Annotations huggingface.co

RemoteZero enables geospatial reasoning without box supervision by leveraging semantic verification capabilities of MLLMs for self-evolving localization from unlabeled remote sensing data.

BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models huggingface.co

A large language model fine-tuned on a comprehensive biomedical tool-calling dataset demonstrates superior performance in specialized domains compared to commercial alternatives.

RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation huggingface.co

A heterogeneous ensemble of seven large language models with dual prompting strategies achieved top performance in the SemEval-2026 MTRAGEval task through judge selection and demonstrated the importance of model diversity.

PianoCoRe: Combined and Refined Piano MIDI Dataset huggingface.co

A large-scale piano MIDI dataset called PianoCoRe is introduced, featuring unified and refined open-source corpora with diverse performances and note-level alignments for music information retrieval applications.

EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions huggingface.co

EDU-CIRCUIT-HW dataset reveals significant limitations in MLLMs’ ability to accurately interpret complex STEM handwritten solutions, prompting a hybrid approach combining automated recognition with minimal human oversight for improved educational grading systems.

Generative Quantum-inspired Kolmogorov-Arnold Eigensolver huggingface.co

Generative quantum-inspired Kolmogorov-Arnold eigensolver reduces classical computational overhead in quantum chemistry workflows while maintaining accuracy and improving convergence for strongly correlated systems.

EMO: Pretraining Mixture of Experts for Emergent Modularity huggingface.co

EMO is a Mixture-of-Experts model that enables modular deployment by grouping similar domain tokens with shared experts, achieving performance comparable to standard MoEs while allowing significant expert pruning without performance degradation.

References

r/math discussion of Epoch AI audit reddit.com

an AI-assisted review using GPT-5.5 flagged ‘fatal errors’ in approximately one-third of the problems across all difficulty tiers… renders early performance metrics ‘meaningless’

CTOL Digital — FrontierMath funding scandal ctol.digital

OpenAI had secretly commissioned and funded the creation of 300 of the 350 problems… contributing mathematicians… were not informed of OpenAI’s involvement or that the company held exclusive access to most problems and solutions

BenchLM — math AI benchmark roundup benchlm.ai

On the ProofBench evaluation, Harmonic’s Aristotle achieved an overall accuracy of 71%, significantly outperforming the top foundation model, GPT-5.4, which trailed at 56%… Aristotle has distinguished itself by solving research-grade problems, including a variant of Erdős Problem #124

EdTech Innovation Hub coverage edtechinnovationhub.com

the system’s initial proof attempt was actually rejected by its own internal reviewer agent due to a logical flaw. However, upon examining the failed output, Lackenby identified a ‘really, really clever’ strategy buried within the flawed proof

Science News — ‘Math disrupted by AI’ sciencenews.org

Viazovska warns that… the surplus of ‘mostly incorrect, trivial, or duplicate’ AI papers makes finding meaningful results like ‘searching a septic tank for a single pearl’

UW Northwest Quantum — ‘How AI is changing mathematical research’ nwquantum.uw.edu

Akshay Venkatesh has expressed concern that delegating proof-finding to AI could cause researchers to lose the deep, direct experience that builds mathematical understanding

kingy.ai analysis of Karpathy’s autoresearch loop kingy.ai

agents to learn serially, use tools, and change code arbitrarily makes them ‘significantly more appropriate’ than traditional Bayesian Optimization

Hacker News discussion (item 47442435) news.ycombinator.com

skeptics on Hacker News argue that BO remains faster and cheaper for many tasks, labeling the current trend as ‘industrialized overfitting’

Hiverge blog — ‘Introducing Hiverge’ hiverge.ai

The Hive discovered optimizations reducing CIFAR-10 training time by over 20% and achieved a 1.25% speedup on GPT-2 training

Google DeepMind — AlphaEvolve impact blog deepmind.google

discovered a tiling heuristic for matrix multiplication kernels that achieved a 23% speedup in critical operations, leading to a 1% reduction in total training time for Gemini models

Medium / Adnan Masood on reward hacking medium.com

frontier models such as DeepSeek-R1-Zero and OpenAI’s o3 have demonstrated exploit rates as high as 13.9% in certain multi-step tool-use environments

RunPod write-up of OpenAI Parameter Golf runpod.io

winning submission by user codemath3000 reached 1.0565 BPB, a 14% improvement over the baseline

Muennighoff et al., JMLR (predecessor ‘Scaling Data-Constrained Language Models’) jmlr.org

Training on the same data for up to four epochs results in negligible performance degradation compared to using unique data; beyond that, the marginal value of repeated tokens decays exponentially.

VentureBeat — ‘Researchers warn of catastrophic overtraining in large language models’ venturebeat.com

OLMo-1B trained on 3 trillion tokens performed up to 3% worse after instruction tuning than the same model trained on 2.3 trillion, despite the larger data volume.

ResearchGate — ‘Weight Decay Improves Language Model Plasticity’ researchgate.net

Models trained with higher weight decay values often exhibit a higher (worse) pretraining loss but demonstrate superior ‘plasticity,’ adapting more effectively during fine-tuning.

Data Driven Investor — analysis of repeating pretraining data medium.datadriveninvestor.com

Models trained on redundant data develop ‘repetition features’ that cause them to get stuck in infinite loops during generation — a ‘Repeat Curse’ not captured by loss-based scaling laws.

DatologyAI — BeyondWeb blog datologyai.com

Source-rephrasing — where an LLM rewrites existing web content into more educational or structured formats — significantly outperforms naive generation and can accelerate convergence 5–10x versus standard web text.

Baek et al. (2026) — ‘The Finetuner’s Fallacy’ researchgate.net

Incorporating domain-specific data (1–5% of mixture) from the start of pretraining reduces tokens-to-target by up to 1.75x; a 1B SPT model outperformed a 3B general-purpose model on ProofPile and ChemPile.

Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare