JS Wei (Jack) Sun

GPT-5.2 leans on physicists, Sylph defers benchmarks, RecursiveMAS skips a rival

Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.

← Back to the issue

Sources

🔬Doing Vibe Physics — Alex Lupsasca, OpenAI latent.space

The full story of how GPT‑5.x derived new results in theoretical physics and quantum gravity.

The Last Harness You’ll Ever Build huggingface.co

A two-level framework automates AI agent deployment by optimizing task-specific harnesses through evolutionary loops and meta-learning protocols, eliminating the need for manual harness engineering.

Recursive Multi-Agent Systems huggingface.co

RecursiveMAS extends recursive scaling principles from single models to multi-agent systems, enabling collaborative reasoning through iterative latent-space computations with improved efficiency and accuracy.

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery huggingface.co

AutoResearchBench tests AI agents on deep and wide scientific literature discovery tasks requiring agentic web browsing, and even frontier LLMs post low accuracy on the harder items. Code and a project page accompany the benchmark, positioning it as a stress test for autonomous research agents.

BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate huggingface.co

BARRED generates synthetic training data for custom guardrail policies by decomposing a policy into dimensions and running asymmetric multi-agent debate, then fine-tuning a classifier. The authors report it outperforms proprietary LLMs and dedicated guardrail systems on bespoke policy enforcement.

GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction huggingface.co

GoClick is a 230M-parameter vision-language model for GUI element grounding on mobile devices, using an encoder-decoder architecture with progressive data refinement and task-type filtering. It targets a device-cloud collaboration setting where heavier agent models offload click prediction to the phone.

A Systematic Post-Train Framework for Video Generation huggingface.co

A post-training recipe for video diffusion models stacks supervised fine-tuning, RLHF with Group Relative Policy Optimization, prompt enhancement, and inference-time optimization to improve controllability, temporal coherence, and visual quality, presented as a systematic pipeline rather than a single technique.

Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora huggingface.co

ProDa reframes LLM training data as source code and evaluation as unit tests, letting practitioners debug concept-level gaps and reasoning-chain breaks in domain corpora the way engineers debug software. The approach targets self-improvement of LLMs from raw corpora without bigger models.

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents huggingface.co

TCOD attributes on-policy distillation instability in multi-turn agents to trajectory-level KL blowup from compounding inter-turn errors, and fixes it with a temporal curriculum that gradually extends trajectory depth. Tests on ALFWorld, WebShop, and ScienceWorld show stronger student-teacher transfer.

V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think huggingface.co

V-GRPO casts denoising as a Markov decision process and pairs Group Relative Policy Optimization with a diffusion ELBO surrogate, cutting gradient-step variance for text-to-image RLHF. The authors report faster, more efficient human-preference alignment than prior online RL methods for diffusion models.

Step-Audio-R1.5 Technical Report huggingface.co

Audio language models trained with reinforcement learning from verified rewards suffer from reduced conversational quality, prompting a shift toward reinforcement learning from human feedback for improved immersive dialogue experiences.

Seeing Isn’t Believing: Uncovering Blind Spots in Evaluator Vision-Language Models huggingface.co

Current vision-language models used for evaluating image-to-text and text-to-image tasks show significant reliability issues in detecting various types of output errors, particularly fine-grained compositional and spatial errors, with pairwise comparison offering slightly better but still imperfect performance.

Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models huggingface.co

Refinement via Regeneration (RvR) improves multimodal model refinement by reformulating the process as conditional image regeneration instead of editing, achieving better semantic alignment and higher evaluation scores.

Meta-CoT: Enhancing Granularity and Generalization in Image Editing huggingface.co

Meta-CoT enhances image editing by decomposing editing operations into task-target-understanding triplets and fundamental meta-tasks, improving both granularity and generalization through CoT-editing consistency rewards.

AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark huggingface.co

AutoGUI-v2 is a comprehensive benchmark for evaluating GUI functionality understanding and interaction prediction capabilities of autonomous agents across multiple platforms.

DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios huggingface.co

Real-world data visualization (DV) requires native environmental grounding, cross-platform evolution, and proactive intent alignment. Yet, existing benchmarks often suffer from code-sandbox confinement, single-language creation-only tasks, and assumption of perfect intent. To bridge these gaps, we introduce DV-World, a benchmark of 260 tasks designed to evaluate DV agents across real-world professional lifecycles. DV-World spans three domains: DV-Sheet for native spreadsheet manipulation includi

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation huggingface.co

Mutual Forcing enables efficient autoregressive audio-video generation through a unified model that combines few-step and multi-step training modes with shared parameters for improved consistency and reduced overhead.

Toward Scalable Terminal Task Synthesis via Skill Graphs huggingface.co

SkillSynth is an automated framework for terminal task synthesis that uses scenario-mediated skill graphs to control execution trajectory diversity during training.

Co-Director: Agentic Generative Video Storytelling huggingface.co

Co-Director presents a hierarchical multi-agent framework that formulates video storytelling as a global optimization problem, using multi-armed bandits and multimodal self-refinement to maintain semantic coherence and outperform existing approaches.

IAM: Identity-Aware Human Motion and Shape Joint Generation huggingface.co

An identity-aware motion generation framework models the relationship between body morphology and motion dynamics using multimodal signals and joint motion-shape generation to produce realistic, identity-consistent human motions.

MAIC-UI: Making Interactive Courseware with Generative UI huggingface.co

MAIC-UI is a zero-code system for creating interactive STEM courseware that uses structured knowledge analysis and incremental generation to enable rapid editing and improve educational outcomes.

Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages huggingface.co

A controlled multidimensional pairwise evaluation framework for multilingual TTS systems was developed using linguistic control and perceptual annotations across 10 Indic languages.

Offline Evaluation Measures of Fairness in Recommender Systems huggingface.co

Research addresses limitations in recommender system fairness evaluation measures by analyzing theoretical flaws, developing novel approaches, and providing guidelines for appropriate measure selection.

References

Matt von Hippel — 4 gravitons blog 4gravitons.com

if AI is bad at originality, it’s a documentation problem — textbooks and papers only record polished arguments, omitting the twists and turns of the creative process

Institute for Advanced Study news ias.edu

ChatGPT spits out surprising insight in particle physics

OpenAI — Extending single-minus amplitudes to gravitons openai.com

an internal scaffolded reasoning model re-derived and formally proved the formulas from scratch in approximately 12 hours; the formulas were verified against the Berends–Giele recursion and Weinberg’s soft theorem

Investing.com / Sabine Hossenfelder commentary investing.com

AI is changing theoretical physics fast … but it could exacerbate the flood of nonsense in academic publishing by making it easier to generate technically correct but physically irrelevant papers

Hugging Face blog — David Louapre, ‘GPT and the single-minus gluons’ huggingface.co

the core physical insight — that these amplitudes could be non-zero in Klein space — originated with the human physicists; GPT-5.2 Pro acted as a superhuman algebraic simplifier rather than an autonomous discoverer

BiggoNews aggregation of Tao / Lupsasca remarks finance.biggo.com

I spend much less time being confused. … The amount of time you spend confused just dramatically shrinks and you move so much faster — verification is becoming the new bottleneck

GitHub: SylphAI-Inc/autoresearch-adal github.com

On an A10 24GB GPU over 49 hours, AdaL completed 336 experiments compared to only 76 by Claude Code… AdaL reached a superior validation BPB of 1.1048, a 4.3% improvement over Claude Code’s 1.1539.

ZeroSync: Ralph Loop technical deep dive zerosync.co

A Ralph Loop restarts the agent with a fresh context window on every iteration… state is persisted externally in the file system using git history, progress logs, and structured specification files.

MindStudio: Khattab on DSPy auto-optimized harness mindstudio.ai

An auto-optimized harness running a small model like Claude Haiku could outperform much larger models on TerminalBench 2 by processing millions of feedback tokens to rewrite its own control logic.

TheAIEdge newsletter: AdalFlow PyTorch-like framework newsletter.theaiedge.io

AdaL adopts a ‘PyTorch-like’ design, utilizing textual auto-differentiation and supervised fine-tuning to treat prompts as differentiable parameters.

InfoSec Write-ups: Self-Evolving AI Agents infosecwriteups.com

Security researchers warn of ‘Evolution Poisoning,’ where an attacker injects crafted tasks into an agent’s training batch to steer its evolving protocols toward adversarial objectives, such as skipping deep security analysis.

FailingFast.io: AI coding benchmarks guide failingfast.io

Agentic benchmarks are often fragile… reward tampering, where an agent hacks its own evaluation metrics to show false progress, remains a persistent concern in unsupervised loops.

Towards Dev (Medium) — LatentMAS overview medium.com

LatentMAS reports up to 14.6% higher accuracy than text-based MAS across 9 benchmarks … 4.3x faster inference and an 83.7% token reduction — without any training.

r/AI_Agents discussion of LatentMAS reddit.com

Training-free framework that uses a Linear Alignment Matrix to map latent thoughts between agents without updating model weights — works out-of-the-box with any pre-trained transformer.

Tran & Kiela 2026, ‘Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets’ researchgate.net

When the total token budget is normalized, single-agent systems consistently match or outperform multi-agent architectures on complex multi-hop reasoning tasks across Qwen3 and Gemini 2.5.

Yutori Scouts repo tracker (RecursiveMAS GitHub issues) scouts.yutori.com

Issue #1 and Issue #20 explicitly request the training scripts and the complete inference pipeline for the nine benchmarks … necessary training data has not yet been fully uploaded to Hugging Face.

Towards AI — practitioner writeup pub.towardsai.net

Standard prompt-injection defenses are ill-equipped for latent-space interactions, where malicious ‘thought states’ could theoretically be injected without appearing in any logs; the 13M-parameter RecursiveLink adapter requires specialized co-optimization for every new agent combination.

StartupHub.ai — ‘Scaling Agent Collaboration via Recursion’ startuphub.ai

Critics note that recursive ‘critique loops’ (Self-Refine, Reflexion) have been established since 2023, and the gains attributed to RecursiveMAS may stem primarily from the latent channel’s ability to maintain high-fidelity information across rounds, rather than the recursion pattern itself.

Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare