When the scaffold outweighs the model: a day of harness-defined results
Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.
Sources
Qwen3.5-Omni Technical Report huggingface.co
Qwen3.5-Omni is a large-scale multimodal model with hundreds of billions of parameters that excels in audio-visual understanding and generation, featuring advanced architectures and novel capabilities like Audio-Visual Vibe Coding.
The Amazing Agent Race: Strong Tool Users, Weak Navigators huggingface.co
The Amazing Agent Race benchmark introduces DAG-based puzzles to evaluate LLM agents’ navigation and tool-use capabilities beyond traditional linear benchmarks, revealing that navigation errors dominate performance issues.
GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows huggingface.co
General Tool Agents face significant challenges in real-world workflow completion, with performance dropping sharply from atomic tasks to complex, open-ended workflows, highlighting the need for improved execution frameworks beyond model capacity.
Mind DeepResearch Technical Report huggingface.co
MindDR proposes a three-agent deep research framework trained through a four-stage pipeline combining SFT cold-start, Search-RL, Report-RL, and preference alignment. The system is evaluated on real-world Chinese queries using a multi-dimensional rubric and reports strong results across multiple research benchmarks.
Where does output diversity collapse in post-training? huggingface.co
Empirical study finds that diversity collapse in post-trained LLMs traces primarily to training data composition rather than generation format, with SFT, DPO, and chain-of-thought distillation each affecting diversity differently across tasks. The authors decompose loss into quality-control and residual components to localize where collapse happens.
Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning huggingface.co
STOP introduces learnable token-level path pruning for parallel reasoning in large reasoning models, cutting redundant prefixes early to save compute. The authors compare learnable against non-learnable pruning strategies across budgets and report both efficiency and accuracy gains, with code and a project page released.
Maximal Brain Damage Without Data or Optimization: Disrupting Neural Networks via Sign-Bit Flips huggingface.co
1P-DNL shows that flipping a single sign bit in network parameters can catastrophically degrade models without any data or optimization, demonstrated on ResNet-50 ImageNet classification, Mask R-CNN and YOLOv8-seg detection and segmentation, and the Qwen3-30B-A3B-Thinking language model, with targeted bit protection as mitigation.
ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics huggingface.co
ArtifactNet detects AI-generated music by training a compact UNet plus CNN on codec-specific residuals in magnitude spectrograms after harmonic-percussive separation. The authors release ArtifactBench and report better cross-codec robustness than prior detectors through codec-aware training that targets forensic compression artifacts.
PRL-Bench: A Comprehensive Benchmark Evaluating LLMs’ Capabilities in Frontier Physics Research huggingface.co
PRL-Bench evaluates LLMs on end-to-end theoretical and computational physics research workflows drawn from frontier problems, finding current systems fall well short of autonomous scientific exploration. The benchmark targets agentic science capability rather than isolated problem-solving, exposing gaps in domain knowledge and multi-step research execution.
Elucidating the SNR-t Bias of Diffusion Probabilistic Models huggingface.co
Diffusion models accumulate an SNR-timestep mismatch between training and inference denoising. The proposed differential correction method handles frequency components separately during the reverse process, improving generation quality across multiple diffusion backbones with negligible added compute, with code released as DCW.
VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects huggingface.co
A large-scale human-annotated video editing dataset with multi-dimensional quality labels and a specialized reward model for evaluating editing quality are introduced, along with a benchmark for standardized system comparison.
AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization huggingface.co
AccelOpt is a self-improving LLM agentic system that autonomously optimizes kernels for AI accelerators using iterative generation and optimization memory, achieving significant throughput improvements at reduced costs.
Learning Adaptive Reasoning Paths for Efficient Visual Reasoning huggingface.co
Adaptive visual reasoning framework reduces unnecessary computation by dynamically selecting optimal reasoning formats while maintaining accuracy.
(1D) Ordered Tokens Enable Efficient Test-Time Search huggingface.co
Autoregressive models with coarse-to-fine token structures show better test-time scaling and enable training-free text-to-image generation when combined with image-text verifiers.
Hierarchical Codec Diffusion for Video-to-Speech Generation huggingface.co
HiCoDiT generates speech from videos by leveraging the hierarchical structure of discrete speech tokens, achieving better audio-visual alignment through coarse-to-fine conditioning with dual-scale normalization.
TwinTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation huggingface.co
TwinTrack framework addresses pancreatic cancer segmentation ambiguity through post-hoc calibration of ensemble probabilities to empirical mean human response, improving calibration metrics on multi-rater benchmarks.
Repurposing 3D Generative Model for Autoregressive Layout Generation huggingface.co
LaviGen introduces a 3D layout generation framework that uses an adapted 3D diffusion model with dual-guidance self-rollout distillation for improved efficiency and spatial accuracy.
TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment huggingface.co
Enhanced vision-language models achieve superior dense patch-text alignment through improved pretraining techniques including patch-level distillation, modified masked image objectives, and optimized caption sampling strategies.
Motif-Video 2B: Technical Report huggingface.co
Motif-Video 2B achieves high text-to-video generation quality using a specialized architecture with shared cross-attention and three-part backbone, along with efficient training methods, while requiring significantly fewer parameters and training data than larger models.
RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies huggingface.co
RoboLab is a simulation benchmarking framework that addresses limitations in robot policy evaluation by enabling scalable, realistic task generation and systematic analysis of policy behavior under controlled perturbations.
DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off huggingface.co
A novel reinforcement learning approach for large language models that addresses the exploration-exploitation trade-off through perplexity-based sample partitioning and bidirectional reward allocation mechanisms.
NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results huggingface.co
This paper presents an overview of the NTIRE 2026 Challenge on Video Saliency Prediction. The goal of the challenge participants was to develop automatic saliency map prediction methods for the provided video sequences. The novel dataset of 2,000 diverse videos with an open license was prepared for this challenge. The fixations and corresponding saliency maps were collected using crowdsourced mouse tracking and contain viewing data from over 5,000 assessors. Evaluation was performed on a subset
QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies huggingface.co
QuantCode-Bench evaluates large language models on generating executable trading strategies by testing their ability to translate natural language descriptions into functional code that operates correctly on historical financial data.
Can Large Language Models Reinvent Foundational Algorithms? huggingface.co
Large language models can reinvent foundational computer science algorithms through an unlearning and reinvention process, with performance varying based on hint levels and reinforced learning techniques.
Universal statistical signatures of evolution in artificial intelligence architectures huggingface.co
The study finds that artificial intelligence architectural evolution follows the same statistical patterns as biological evolution, including similar fitness effect distributions and convergence dynamics.
EdgeDetect: Importance-Aware Gradient Compression with Homomorphic Aggregation for Federated Intrusion Detection huggingface.co
EdgeDetect enables efficient and secure federated intrusion detection for 6G-IoT environments through gradient binarization and homomorphic encryption, achieving high accuracy with reduced communication overhead and strong privacy protection.
PersonaVLM: Long-Term Personalized Multimodal LLMs huggingface.co
A novel personalized multimodal language model framework called PersonaVLM is introduced that enables long-term personalization through memory retention, multi-turn reasoning, and response alignment capabilities.
Web Retrieval-Aware Chunking (W-RAC) for Efficient and Cost-Effective Retrieval-Augmented Generation Systems huggingface.co
Web Retrieval-Aware Chunking (W-RAC) introduces a cost-efficient framework for web document processing that reduces LLM token usage and hallucination risks through structured content representation and retrieval-aware grouping decisions.
References
MarkTechPost marktechpost.com
the flagship 3.5-Omni Plus and Flash variants were launched as proprietary models available only via Alibaba’s DashScope API… Only the Light variant has been officially released with open weights
r/LocalLLaMA discussion reddit.com
the ‘open-source champion’ appears to be ‘closing the door’ on its most advanced multimodal tech
OmniGAIA leaderboard (qwen.ai) qwen.ai
Gemini-3.1 Pro currently leads the OmniGAIA tool-use evaluation with a score of 68.9%, followed by Qwen3.5-Omni-Plus at 57.2%
SGLang GitHub issue #19822 github.com
severe hallucination and repetitive output loops when serving the model on Ascend 910B hardware via SGLang… the model frequently falls into infinite generation cycles
Digital Applied benchmark review digitalapplied.com
the technical report frequently shifted its comparison models—swapping between Gemini 3.1 Pro, GPT-5.2, and Claude Opus—which some termed a ‘misleading’ tactic to highlight specific wins while obscuring broader deficiencies
LLMBase comparison llmbase.ai
Qwen3.5-Omni’s native voice cloning—requiring only three seconds of audio to mimic a speaker—poses a high security risk without standardized industry guardrails
Wang et al., ‘GTA: A Benchmark for General Tool Agents’ (NeurIPS 2024) proceedings.neurips.cc
GPT-4 and GPT-4o achieved success rates of less than 50%, while the majority of mainstream LLMs failed to complete even 25% of tasks; smaller models often failed to follow the rigid ‘Thought-Action-Action Input’ formatting required for successful tool invocation.
MarkTechPost — ‘Balancing Act: Impact of Format Restrictions on Reasoning in LLMs’ marktechpost.com
Strict schemas often force models to provide a final answer key before a reasoning key, effectively stripping the model of its ability to ‘think’ before committing to a result.
r/MachineLearning — ‘Plain English Outperforms JSON for LLM Tool (Calling)’ reddit.com
Replacing JSON with natural language tool-calling can boost accuracy by up to 18 percentage points by reducing this formatting burden and context bloat.
rapidclaw.dev — ‘AI Agent Benchmarks 2026’ rapidclaw.dev
A 2026 Berkeley study found that several leading agentic benchmarks could be exploited to achieve near-perfect scores without actually solving the underlying tasks… many ‘frontier’ models see a 60% performance drop when required to succeed across eight consecutive runs.
OpenReview discussion of LLM-as-judge biases openreview.net
Position bias… verbosity bias… self-preference bias, where models like GPT-4 may favor outputs from their own model family, potentially inflating benchmark scores in a non-objective manner.
manus.im — ‘OpenClaw vs Manus Desktop’ manus.im
Manus utilizes a multi-agent architecture where a ‘Planner’ delegates sub-tasks to specialized ‘Executors’ and ‘Verifiers’… while OpenClaw addresses long-horizon state through a ‘Heartbeat’ mechanism and a local Markdown-based memory system that survives session restarts.
The Moonlight (independent paper review) themoonlight.io
navigation errors were present in 52% of trials, while tool-use errors remained below 17%… reasoning-optimized GPT-OSS-120B model performed poorly (3.1% FA) because it spent its token budget on internal thinking rather than executing tool calls
Clarifai benchmark roundup (GPT-OSS comparison) clarifai.com
On TAU-bench Retail… the model scored 67.8%, trailing behind GLM-4.5 (79.7%) and Kimi K2 (70.6%)… BFCL-v3 function-calling benchmark… 67–68%, while competitors like Qwen3 Thinking reached nearly 72%
Fireworks.ai GPT-OSS deployment notes fireworks.ai
many standard LLM libraries and frontends fail to pass these reasoning tokens back to the model in multi-turn conversations, leading to a total breakdown in agentic behavior… a documented regression specifically broke parallel tool calling
Quesma — CompileBench on Harbor quesma.com
Harbor functions as a CI/CD pipeline for agents: it initializes a sandbox, drops the agent into it, records the interaction trajectory, and employs a verifier to produce a numerical reward… some third-party task images required patching to resolve inherent reliability issues
OpenReview — Berkeley RDI benchmark audit openreview.net
Berkeley RDI researchers proved this by ‘breaking’ top benchmarks like WebArena and GAIA using simple exploits—such as DOM injection or prompt injections—that allowed agents to achieve near-perfect scores without solving a single task
INFOGENT / ‘Failure is Feedback’ (CLiC-it 2025) aclanthology.org
Most autonomous web agents suffer from an inability to backtrack… Traditional agents are often designed for forward-only navigation; once they navigate away from a useful page to a dead end, they lack the state-space memory to return