Sources

Measuring the impact of learning with AI in Sierra Leone and beyond deepmind.google

Results from a randomized controlled trial show the potential of Gemini’s Guided Learning feature to boost engagement and accelerate learning.

Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech huggingface.co

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development? huggingface.co

The Meta-Agent Challenge evaluates AI models’ ability to autonomously develop agent systems through iterative programming within constrained environments, revealing significant gaps in current models’ self-improvement capabilities.

GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards huggingface.co

GRAIL scales token-wise advantages by gradient-activation saliency, focusing reinforcement learning updates on tokens that most influence the model’s output. The method outperforms GRPO on mathematical reasoning benchmarks in both accuracy and Pass@3, without requiring a separate process reward model.

NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation huggingface.co

Built on the Cosmos diffusion model, OmniDreams produces action-conditioned photorealistic sensor video fast enough to evaluate autonomous driving policies in closed loop. The world model handles unseen scenarios, replacing hand-built simulators with a learned neural one that responds to policy actions frame by frame.

Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models huggingface.co

YOLO26 ships an end-to-end family with NMS-free inference, a hybrid Muon-SGD optimizer called MuSGD, and Progressive Loss training. The single architecture covers detection, instance segmentation, pose, oriented boxes, and open-vocabulary tasks, with reported mAP and TensorRT latency gains on COCO and LVIS.

Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations huggingface.co

Linear probes score high AUROC on clean deception data but fail when domain or style shifts, tests across the Gemma 3 family show. Deception is encoded as distributed sub-threshold features inside a convex conic hull, not a single linear direction probes can latch onto.

World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning huggingface.co

Controlled concrete reasoning runs visual rollouts in a world model and feeds them to a multimodal LLM for abstract inference. Training uses privileged future context with on-policy self-distillation, lifting prediction accuracy and robustness over LLM-only or simulation-only baselines.

Stateful Visual Encoders for Vision-Language Models huggingface.co

Conditioning the visual encoder on prior frame features, rather than encoding each image independently, sharpens cross-image spatial aggregation and multi-object differencing in vision-language models. The authors report gains on longitudinal radiology, fine-grained comparison, remote sensing, and visual trajectory behavior cloning.

Value-Aware Stochastic KV Cache Eviction for Reasoning Models huggingface.co

Value-aware Stochastic Eviction keeps KV entries with large-magnitude value states and adds randomness to maintain cache diversity, avoiding the accuracy collapse that hits reasoning models under aggressive sparse attention. The method plugs into FlashAttention2 with no retraining.

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models huggingface.co

AutoMedBench presents a comprehensive benchmark for autonomous medical-AI research that evaluates agent performance across five workflow stages, revealing validation as the weakest stage and highlighting the importance of reliable pipeline execution and verification in medical AI workflows.

AgentCL: Toward Rigorous Evaluation of Continual Learning in Language Agents huggingface.co

A comprehensive evaluation framework for continual learning in language agents is introduced, emphasizing controlled task streams and memory design analysis to better assess reusable experience and learning stability.

Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task Streams huggingface.co

Adaptive Auto-Harness framework addresses dynamic task streams by decomposing performance gaps into evolution and adaptation losses, utilizing a stateful multi-agent evolver and harness tree with solve-time routing for sustained performance improvement.

Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories huggingface.co

Deep learning models with sleep and dreaming paradigms enable continual learning through memory consolidation and self-improvement phases.

WALL-WM: Carving World Action Modeling at the Event Joints huggingface.co

WALL-WM advances video-action learning by using semantic events as learning units instead of fixed action chunks, enabling more flexible and scalable vision-language-action training and inference.

AURA: Action-Gated Memory for Robot Policies at Constant VRAM huggingface.co

AURA-Mem is a recurrent memory system that adapts to embodied AI constraints by writing only when observations affect actions, significantly reducing memory writes compared to traditional KV-cache approaches.

KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks huggingface.co

KVarN is a calibration-free KV-cache quantizer that uses Hadamard rotation and dual-scaling variance normalization to reduce error accumulation during autoregressive decoding in large language models.

Ψ-Bench: Evaluating Persona-Sensitive Influencing in Persuasive Dialogues huggingface.co

LLMs demonstrate limited effectiveness in persuasive conversation despite generating coherent arguments, with user-specific profiles significantly improving performance.

From Activation to Causality: Discovery of Causal Visual Representations in the Human Brain huggingface.co

BrainCause framework uses generative and brain models to identify valid neural representations through causal testing, demonstrating that activation alone is insufficient for confirming concept representation.

Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking huggingface.co

Humanoid-GPT is a GPT-style Transformer with causal attention trained on a billion-scale motion corpus that achieves zero-shot generalization to unseen motions and control tasks through scalable pre-training on diverse motion data.

TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL huggingface.co

TRON enables scalable and controllable reinforcement learning for visual reasoning through an online environment substrate that generates unlimited diverse training instances with verifiable answers.

PaddleOCR-VL-1.6: Expanding the Frontier of Document Parsing with Under-Optimized Region Refinement and Progressive Post-Training huggingface.co

PaddleOCR-VL-1.6 enhances document parsing performance through targeted data optimization and progressive post-training techniques, achieving state-of-the-art results on OmniDocBench v1.6.

ClawHub Security Signals: When VirusTotal, Static Analysis, and SkillSpector Disagree huggingface.co

Agent skills require layered security governance due to scanner disagreement, with findings showing varying detection rates across different scanner types and attack surfaces.

SynCred-Bench: Benchmarking Synthetic Credibility in AI-Generated Visual Misinformation huggingface.co

AI-generated images with realistic text and layouts pose a significant misinformation threat requiring new detection benchmarks and methods beyond surface-level credibility assessment.

Trust Region On-Policy Distillation huggingface.co

Trust Region On-Policy Distillation (TrOPD) improves reliable token-level supervision in large language model distillation by using trust regions, outlier estimation, and off-policy guidance to address instability issues under distribution mismatch.

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification huggingface.co

OmniOPD addresses limitations of standard On-Policy Distillation by using chunk-level semantic similarity instead of token-level logits, improving learning reliability and performance with black-box teachers.

OCC-RAG: Optimal Cognitive Core for Faithful Question Answering huggingface.co

Compact task-specialized language models demonstrate superior performance in multi-hop reasoning and faithfulness compared to larger general-purpose models through a novel training pipeline and structured reasoning traces.

A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL huggingface.co

Multi-domain reinforcement learning in language models causes performance degradation through shared computational pathways, but targeted refresh and rollback techniques can selectively recover lost capabilities with minimal side effects.

Benchmarking Visual State Tracking in Multimodal Video Understanding huggingface.co

Current multimodal large language models struggle with visual state tracking in videos, performing poorly even when human-level capabilities are required, and existing agentic approaches do not effectively address these limitations.

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection huggingface.co

MIRA is a source-aware filtering framework for mid-training data selection in LLM development that uses self-anchored rubric discovery to balance scalability and semantic accuracy across heterogeneous data sources.

Decentralized Instruction Tuning: Conflict-Aware Splitting and Weight Merging huggingface.co

Instruction tuning of large language models can be improved through decentralized training that partitions mixed datasets based on gradient conflicts and merges results via weighted averaging, achieving performance comparable to centralized methods with reduced communication overhead.

Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning huggingface.co

DOMINO enables domain-specific data synthesis through an inductive approach that learns domain representations from reference examples, improving code benchmark performance without requiring explicit domain descriptions.

Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling huggingface.co

Adaptive sampling for large language models is formulated as a Markov decision process and optimized using reinforcement learning to balance correctness, latency, and computational cost.

Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation huggingface.co

Decoupled Residual Denoising Diffusion models (DRDD) improve unified image-to-image translation by separating noise diffusion for domain harmonization from residual diffusion for semantic mapping, enhancing data efficiency and performance.

PlatonicNav: Unveiling Semantic Correspondence in Navigation with Platonic Topological Maps huggingface.co

A training-free framework for embodied navigation that uses a vision-only approach to create semantic maps and ground language goals through blind matching without paired vision-language data.

Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces huggingface.co

Answer-correct long chain-of-thought traces can lead to different fine-tuning outcomes, with post-conclusion continuations identified as harmful to training, characterized by uncertainty-geometry mismatches and addressed through a lightweight boundary proxy method.

αDepth: Learning Single-Pass Soft Boundary Decomposition for Stereo Conversion huggingface.co

αDepth introduces a layered representation with Circular Alpha Representation (CAR) to address soft boundary challenges in stereo conversion through local boundary decomposition and efficient scene-level inference.

A Multi-AI-agent Framework Enabling End-to-end Finite Element Analysis for Solid Mechanics Problems huggingface.co

AbaqusAgent is a multi-agent framework using large language models to automate finite element analysis by converting natural language instructions into executable simulations with high success rates across diverse solid mechanics problems.

Conditional Hypothesis Generation for LLM-Based Text Analysis with Researcher-Specified Covariates huggingface.co

Conditional hypothesis generation framework incorporates covariates to identify meaningful language differences across subgroups while addressing stratum imbalance and sign reversal challenges.

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling huggingface.co

Researchers identify a perceptual judgment bias in multimodal large language models where visual evidence is overlooked for textual plausibility, and propose a training framework using a perturbed dataset and reward modeling to improve perceptual fidelity and evaluation consistency.

Bootstrap Your Generator: Unpaired Visual Editing with Flow Matching huggingface.co

Bootstrap Your Generator framework enables unpaired training of flow matching editing models by leveraging base model knowledge and gradient routing for improved generalization in data-scarce scenarios.

MERIT: Learning Disentangled Music Representations for Audio Similarity huggingface.co

MERIT framework learns disentangled music representations for melody, rhythm, and timbre through conditional audio generation and source-separated stems, enabling nuanced musical queries.

BA-T: An Iterative Transformer for Two-View Bundle Adjustment huggingface.co

BA-T is an iterative Transformer architecture that improves 3D reconstruction accuracy and cross-view consistency through structured updates inspired by bundle adjustment, using a lightweight design that requires only 16% of conventional decoder parameters.

Prior Availability in Industrial Visual Sim-to-Real: A Review of CAD-Guided and CAD-Unavailable Regimes huggingface.co

Industrial visual sim-to-real is reframed as a domain-gap problem categorized by prior availability, distinguishing CAD-available, CAD-unavailable, and boundary-prior settings for robust deployment across varied industrial conditions.

References

LessWrong linkpost on ADAS (Hu et al., UBC/Sakana AI) lesswrong.com

ADAS introduces a ‘meta agent’ that programs new agents in code… operating in a Turing-complete code space rather than just tweaking prompts, allowing it to discover novel logic patterns that humans might not intuitively design.

Berkeley RDI — ‘Trustworthy Benchmarks’ blog rdi.berkeley.edu

Many tasks within the harness download dependencies like curl during the verification phase; researchers were able to replace these binaries with malicious scripts to manipulate scoring… agents can sometimes achieve perfect scores by manipulating evaluation scripts rather than solving the tasks.

RE-Bench (METR/Anthropic) arXiv 2502.13138 arxiv.org

AI agents excel in short 2-hour bursts, [but] expert humans still hold a significant lead in longer 32-hour tasks, illustrating the ‘horizon gap’ in current autonomous reasoning.

DeepLearning.AI ‘The Batch’ on OpenAI MLE-bench deeplearning.ai

Top-tier agents… achieving success rates over 64%, a massive leap from the 16.9% bronze-medal rate seen in initial 2024 tests.

AIDE (AI-Driven Exploration) GitHub — wecoai/aideml github.com

AIDE’s tree-search approach allows agents to win four times more Kaggle medals than standard linear scaffolds, effectively trading increased inference compute for better engineering outcomes.

TheMoonlight.io independent review of MAC themoonlight.io

Top-performing artifacts consistently converged on parallel sampling with majority voting, prompt diversification, and minimal ReAct-style tool-use loops… dense and highly orchestrated frameworks often suffered from ‘under-exploration’ or became trapped in local optima.

CGD working paper (Evans et al.) — ‘How Big Are Effect Sizes in International Education Studies?’ cgdev.org

Standardised effect sizes are sensitive to sample variance and measurement tools; a single correct answer on a test can translate to a standardized gain anywhere from 0.08 to 0.80 SDs depending on the study.

Matthew Kraft — ‘The Effect Size Benchmark That Matters Most’ (2023) static1.squarespace.com

Nearly 36% of education RCTs produce effect sizes smaller than 0.05 SD… the median impact across 200+ studies in LMICs is only 0.10 SD for math and reading.

Stanford SCALE — Rori AI math tutor RCT in Ghana scale.stanford.edu

Students using Rori for just one hour a week achieved an effect size of 0.36 SD in math growth scores… delivered via WhatsApp for roughly $5 per student.

OpenLearnLM Benchmark (Korea University, Texas A&M et al.) researchgate.net

No single model dominates all axes of Knowledge, Skills and Attitude… ‘deception items’ test whether a model behaves differently when it knows it is being monitored.

ET-Mag — ‘Google’s Gemini for Education: A Critical Analysis’ et-mag.com

Google’s decision to extend access to students under 13 represents a reversal of previous safety policies… combined with an 18-month default retention of chat histories, this amounts to ‘safety theater’.

AI School Librarian Substack — ‘The Quiet Collapse of the AI Tutor’ aischoollibrarian.substack.com

AI tutors rely on student qualities — persistence and curiosity — that many learners are still developing; students routinely attempt to bypass Socratic hints to extract direct answers.

ResearchGate — Benchmarking Commercial ASR Systems on Code-Switching Speech (Arabic, Persian, German) researchgate.net

Intra-sentential switching is significantly harder for ASR systems than inter-sentential switching because the acoustic transitions are too subtle for models to detect reliably without specialized language identification data.

Gladia — Code-switching language coverage limitations gladia.io

A ‘router’ approach that switches between small monolingual models, rather than relying on one massive multilingual model, can outperform end-to-end multilingual systems on short bilingual utterances.

Apple ML Research — Humanizing WER machinelearning.apple.com

WER is a poor proxy for quality in code-mixed contexts because it treats all errors equally… near-synonyms or minor morphological changes should not be penalized as harshly as meaning-altering mistakes.

Sarvam.ai — Evaluating Indian Language ASR sarvam.ai

WER artificially inflates errors due to inconsistent transliterations (e.g., using different scripts for the same word); ‘transliteration-optimized WER’ (toWER) maps all text to a single writing system to separate orthographic variation from genuine recognition failures.

AssemblyAI Benchmarks assemblyai.com

Universal-3 Pro leads in extremely noisy environments and provides keyterm prompting of up to 1,000 domain-specific terms, while Scribe V2 leads on FLEURS across 30 languages at ~93.5% accuracy.

HKUST — Developing a Multilingual Dataset and Evaluation Metrics for Code-Switching (ASCEND) researchportal.hkust.edu.hk

ASCEND is a 10.6-hour fully open-source Hong Kong Mandarin-English corpus that provides a cleaner ‘gold standard’ than SEAME’s 192 hours of noisier Southeast Asian conversational speech.

Sources

References

Jack Sun, writing.