Parameter Golf, SymptomAI, Workspace-Bench post wins on unaudited evals
Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.
Sources
What Parameter Golf taught us about AI-assisted research openai.com
Parameter Golf brought together 1,000+ participants and 2,000+ submissions to explore AI-assisted machine learning research, coding agents, quantization, and novel model design under strict constraints.
SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment huggingface.co
A large-scale study of conversational AI agents for symptom assessment and differential diagnosis shows superior accuracy compared to clinicians when using structured symptom interviews, with findings validated across diverse populations and wearable health data.
Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies huggingface.co
Workspace-Bench is a benchmark for evaluating AI agents on workspace learning involving large-scale file dependencies, demonstrating significant gaps between current agents and human performance in managing complex file relationships and task execution.
ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration huggingface.co
ARIS is an open-source research harness that runs adversarial cross-model collaboration between executor and reviewer agents, with orchestration and assurance layers handling Markdown-defined skills, persistent wikis, and deterministic figure generation. Claim auditing and result-to-claim mapping aim to keep long-running runs reliable.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning huggingface.co
Post-training reinforcement learning for LLMs is decomposed into a four-stage rollout pipeline covering trajectory sampling, verifier-based filtering, compute control, and replay. The framework lets researchers compare guided versus tree rollouts, early-exit schemes, and self-evolving curricula on a shared diagnostic index.
How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum huggingface.co
A new loss J_Q built on the Tsallis q-logarithm interpolates between reinforcement learning from verifiable rewards and log-marginal-likelihood fine-tuning. Gradient amplification fixes RLVR cold-start stalling, with gains shown on FinQA, HotPotQA, and MuSiQue multi-hop reasoning benchmarks.
Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces huggingface.co
Training LLM multi-agent systems is reframed around orchestration traces logging spawning, delegation, communication, aggregation, and stopping events. The survey organizes the field along reward design, credit assignment, and orchestration-learning axes, with an accompanying GitHub list of multi-agent RL work.
Healthcare AI GYM for Medical Agents huggingface.co
Multi-turn agentic RL on clinical reasoning tasks tends to degenerate into verbose single-turn answers under sparse terminal rewards. A self-distillation framework using an EMA teacher and truncated on-policy distillation stabilizes GRPO training and restores genuine multi-turn behavior.
A Benchmark for Interactive World Models with a Unified Action Generation Framework huggingface.co
iWorld-Bench evaluates interactive world models across visual generation, trajectory following, and memory using a unified action-generation framework over diverse video datasets. It targets the gap between passive video synthesis and the perception-reasoning-action loop needed for physical interaction.
Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO huggingface.co
Skills-Coach automates agent skill optimization through four modules handling task generation, lightweight optimization, comparative execution, and traceable evaluation, all without weight updates. The framework is validated on a benchmark of 48 diverse skills, positioning training-free GRPO as a route to self-evolving agents.
Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation huggingface.co
Chain of Evidence (CoE) presents a visual attribution framework that uses Vision-Language Models to reason over document screenshots, enabling precise, pixel-level evidence localization for iterative retrieval-augmented generation systems.
OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories huggingface.co
A simple supervised fine-tuning approach achieves state-of-the-art performance in deep search capabilities using minimal data, outperforming complex industrial pipelines and demonstrating the effectiveness of academic-led development in large language model agents.
HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness huggingface.co
HeavySkill presents a framework where complex reasoning is internalized as an intrinsic model skill rather than relying on external orchestration, demonstrating superior performance through parallel reasoning and summarization stages that can be enhanced via reinforcement learning.
The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail huggingface.co
A self-contained text-to-speech to speech-to-text flywheel approach significantly improves niche-domain Indic automatic speech recognition performance through synthetic data generation and low-resource fine-tuning techniques.
ESARBench: A Benchmark for Agentic UAV Embodied Search and Rescue huggingface.co
Embodied Search and Rescue task and benchmark are introduced to evaluate multimodal large language model-driven UAV agents in realistic search and rescue scenarios with dynamic environmental conditions.
StateSMix: Online Lossless Compression via Mamba State Space Models and Sparse N-gram Context Mixing huggingface.co
StateSMix combines a self-trained Mamba-style State Space Model with sparse n-gram context mixing and arithmetic coding to achieve lossless compression without external dependencies or pre-trained weights.
Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL huggingface.co
PRISM addresses distributional drift in multimodal models by inserting a distribution-alignment stage between supervised fine-tuning and reinforcement learning with verifiable rewards, using a black-box adversarial game between policy and MoE discriminator for disentangled corrective signals.
X2SAM: Any Segmentation in Images and Videos huggingface.co
X2SAM is a unified multimodal model that extends segmentation capabilities from images to videos while supporting conversational instructions and visual prompts for both modalities.
Video Generation with Predictive Latents huggingface.co
Predictive Video VAE combines predictive learning with video reconstruction to improve latent space representation and generative performance through temporal coherence and motion priors.
StableI2I: Spotting Unintended Changes in Image-to-Image Transition huggingface.co
StableI2I is a unified evaluation framework that assesses content fidelity and consistency in image-to-image tasks without requiring reference images, providing accurate and interpretable measurements correlated with human judgments.
SVGS: Enhancing Gaussian Splatting Using Primitives with Spatially Varying Colors huggingface.co
Spatially varying Gaussian splatting improves multi-view reconstruction by enhancing Gaussian primitives with spatially variant colors and opacity functions, achieving better novel view synthesis and geometric reconstruction.
PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination huggingface.co
PatRe benchmark models the complete patent examination process as a dynamic, multi-turn interaction between examiners and applicants, revealing key performance differences among LLMs in legal reasoning and technical novelty assessment.
SplAttN: Bridging 2D and 3D with Gaussian Soft Splatting and Attention for Point Cloud Completion huggingface.co
SplAttN addresses cross-modal entropy collapse in point cloud completion by replacing hard projection with differentiable gaussian splatting for dense image representation, demonstrating superior performance on multiple benchmarks.
TCDA: Thread-Constrained Discourse-Aware Modeling for Conversational Sentiment Quadruple Analysis huggingface.co
A novel framework combining Thread-Constrained Directed Acyclic Graph and Discourse-Aware Rotary Position Embedding addresses limitations in conversational sentiment analysis by capturing dialogue structure and temporal sequences more effectively.
References
Ning & Xiong (CMU), ‘Auto Research with Specialist Agents’ (arXiv listing) lonepatient.top
The CMU loop autonomously executed over 900 trials for Parameter Golf, ultimately reducing the validation bits-per-byte by 0.81% over the initial baseline… maintaining a strictly measured lineage across 1,197 headline trials.
TheNeuralFeed — ‘Why SSMs Struggle in Parameter-Constrained Training’ theneuralfeed.com
Mamba’s in_proj weights compress up to 3.26x worse than Transformer attention QKV weights when using the LZMA algorithm… functionally distinct projections (B, C, and dt) possess varying scales and effective ranks, resulting in a byte-stream with fewer recurring patterns.
namspdr Substack — ‘I entered OpenAI’s Parameter Golf’ namspdr.substack.com
Despite having no prior deep learning experience, Nam orchestrated a ‘shared chat channel’ where Claude acted as implementer and Codex served as critic… discovered that Test-Time Training and depth recurrence were fundamentally at odds because recurrence couples layers while TTT assumes they are independent.
GitHub issue #2127, openai/parameter-golf github.com
Issue #2127 identified that the prepare_caseops_data.py script mistakenly overlapped 80% of the validation documents with training data, potentially inflating performance scores.
RunPod blog — Parameter Golf technical breakdown runpod.io
EMA blending, which typically improves performance in standard training, actually interacted poorly with GPTQ, causing BPB regressions after quantization… new records must exceed the previous state-of-the-art by at least 0.005 nats with a p-value below 0.01.
i10x.ai — ‘OpenAI Parameter Golf Challenge’ analysis i10x.ai
OpenAI is increasingly bypassing traditional academic credentialing in favor of demonstrable problem-solving ability… Keller Jordan joined OpenAI after his Muon optimizer and modded-nanogpt projects—published via blog posts rather than peer-reviewed journals—caught the attention of major labs.
2 Minute Medicine on Microsoft MAI-DxO 2minutemedicine.com
MAI-DxO correctly diagnosed 85.5% of 304 complex NEJM cases … far exceeding the 20% accuracy of unaided human physicians in the same sequential diagnosis format.
News-Medical on Google’s AMIE clinic study news-medical.net
AMIE’s suggestions were included in the correct final diagnosis in 90% of cases … a feasibility study at urgent care clinics demonstrated it could safely conduct pre-visit history-taking with 100 real patients without triggering a single safety stop.
Android Authority on Fitbit Labs androidauthority.com
Fitbit Labs introduced a Gemini-powered Symptom Checker conversational assistant alongside an Unusual Trends feature that alerts users when resting heart rate, HRV and breathing rate drift from their personal baseline.
Center for Democracy & Technology cdt.org
De-identification is not a foolproof shield … as little as 300 seconds of certain sensor recordings could be used to re-identify individuals with 86–100% accuracy.
Intuition Labs commercial clinical AI overview intuitionlabs.ai
Babylon Health, once valued at $4.2 billion, filed for bankruptcy … cited its failure to support bold AI claims with independent validation. Most leading platforms operate in the U.S. under enforcement discretion rather than formal FDA 510(k) clearance.
Google DeepMind blog (AI co-clinician) deepmind.google
AMIE achieved higher diagnostic accuracy than primary care physicians across 149 scenarios and was rated superior on 30 of 32 clinical axes by specialist physicians, including empathy and clarity.
ResearchGate paper page (Workspace-Bench 1.0) researchgate.net
Even top combinations like DeepAgent with GLM-5.1 only reached roughly 60% accuracy… average agent performance sits much lower at approximately 47.4%, raising open questions about the ability of LLMs to handle ‘orchestration singularity’ and complex cross-file retrieval.
arXiv 2601.05111 — Agentic Benchmark Checklist (ABC) arxiv.org
Many popular benchmarks, such as τ-bench and SWE-bench, contain issues that misrepresent performance by 40% to 100%… τ-bench previously counted empty responses from trivial ‘do-nothing’ agents as successes.
ResearchGate — Agent-as-a-Judge survey researchgate.net
Judges were found to be up to 50% more likely to incorrectly mark their own failed outputs as successful… 93% of development teams report consistency failures, where the same input receives wildly different scores across separate runs.
OpenClawLaunch — Hermes Agent + GLM benchmarks openclawlaunch.com
GLM-5.1 recorded a perfect 15/15 ‘usable fit’ score within Hermes, placing it #1… when tested in the OpenClaw harness, its ranking dropped to #5, suggesting its instruction-tuning is specifically aligned with Hermes orchestration protocols.
BenchLM / Epoch AI on OSWorld benchlm.ai
Approximately 10% of OSWorld tasks are invalid or rely on volatile live internet data… nearly 45% of OSWorld tasks can be ‘cheated’ using terminal commands or Python scripts rather than true visual interaction.
Medium — ‘SWE-bench won’t save you when production burns’ medium.com
Agent-generated patches in SWE-bench are merged by human maintainers at only half the rate of human-written patches, often due to security vulnerabilities or poor code style that automated tests fail to catch.