JS Wei (Jack) Sun

Agent benchmarks on trial: gaming, unreliability, and self-graded wins

Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.

← Back to the issue

Sources

Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories huggingface.co

A dataset of 331 terminal-agent environments with 3,632 reward-hacking trajectories and 2,352 legitimate baselines across four AI models is released to study adversarial exploits in system administration, ML, software engineering, and security tasks.

On the Reliability of Computer Use Agents huggingface.co

Computer-use agents exhibit unreliable performance due to execution stochasticity, task specification ambiguity, and behavioral variability, necessitating repeated evaluation and stable strategies for consistent task completion.

The Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability huggingface.co

Geometric stability measures predict language model controllability and detect structural degradation, with supervised variants excelling at steering prediction and unsupervised variants at drift detection.

Geometric coherence of single-cell CRISPR perturbations reveals regulatory architecture and predicts cellular stress huggingface.co

Genome engineering has achieved remarkable sequence-level precision, yet predicting the transcriptomic state that a cell will occupy after perturbation remains an open problem. Single-cell CRISPR screens measure how far cells move from their unperturbed state, but this effect magnitude ignores a fundamental question: do the cells move together? Two perturbations with identical magnitude can produce qualitatively different outcomes if one drives cells coherently along a shared trajectory while th

Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity huggingface.co

A study across Terminal-Bench, SWE-Bench, and AppWorld finds LLM agents recognize unexpected environmental observations but rarely act on them, exposing a curiosity gap that persists across scaffolding choices, test-time compute budgets, and training data distributions.

Precise Debugging Benchmark: Is Your Model Debugging or Regenerating? huggingface.co

The Precise Debugging Benchmark separates fault localization from regeneration by scoring edit-level precision and bug-level recall on atomic bugs, finding frontier LLMs hit high test pass rates while making sloppy, imprecise edits in both iterative and agentic debugging settings.

The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation huggingface.co

Salesforce researchers show on-policy distillation triggers entropy collapse and optimism bias because students lack the teacher’s privileged context, and propose CaOPD, a calibration-aware framework that improves accuracy, confidence reliability, OOD generalization, and continual learning.

Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility huggingface.co

Symbolic guardrails enforce hard policy constraints on domain-specific agents, evaluated on CAR-bench, MedAgentBench, and τ²-Bench, where they deliver stronger safety and security guarantees than prompt- or model-based defenses without degrading task utility.

Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration huggingface.co

A reward-free training scheme lets agents self-evolve by exploring world knowledge, with Qwen3-30B and Seed-OSS-36B improving on WebVoyager and WebWalker web-navigation benchmarks and approaching Gemini-2.5-Flash without any outcome-based supervision.

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting huggingface.co

Trace rewriting modifies a teacher model’s reasoning chains via instruction-based and gradient-based edits so that students distilling from its API outputs lose accuracy, while answers stay correct for paying users and watermarks remain detectable.

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents huggingface.co

SkillFlow benchmarks lifelong learning in autonomous agents through a Domain-Agnostic Execution Flow that tests whether plug-and-play skills can be discovered, patched, and transferred over time, scoring agents on long-horizon skill maintenance rather than one-shot task success.

When Can LLMs Learn to Reason with Weak Supervision? huggingface.co

Research reveals that model generalization in reasoning tasks under weak supervision depends on reward saturation dynamics and reasoning faithfulness, with supervised fine-tuning on explicit traces being crucial for successful adaptation.

Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models huggingface.co

Research reveals that native omni-modal large language models exhibit visual preference over text, with modality preference emerging progressively in mid-to-late layers and enabling diagnosis of cross-modal hallucinations.

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification huggingface.co

Group Fine-Tuning addresses limitations in supervised fine-tuning by using diverse response groups and adaptive weight bounding to improve training stability and efficiency.

Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs huggingface.co

Multimodal large language models demonstrate consistent computational limitations in exact multi-digit multiplication across different representations and modalities, with performance closely tied to a novel arithmetic load metric that predicts accuracy better than traditional step-counting methods.

River-LLM: Large Language Model Seamless Exit Based on KV Share huggingface.co

River-LLM enables efficient token-level early exit in decoder-only LLMs through KV-sharing mechanisms that preserve historical states without latency overhead.

GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0) huggingface.co

GenericAgent is a self-evolving large language model agent system that maximizes context information density through hierarchical memory, reusable SOPs, and efficient compression to overcome long-horizon limitations.

Crowded in B-Space: Calibrating Shared Directions for LoRA Merging huggingface.co

LoRA adapter merging performance can be improved by separately calibrating the output-side matrix B to reduce interference from shared directions while preserving task-specific information.

Latent Preference Modeling for Cross-Session Personalized Tool Calling huggingface.co

Personalized tool calling in LLM-based agents is improved through memory-augmented methods that capture user choice reasoning rather than just choices, using minimal token overhead.

OpenGame: Open Agentic Coding for Games huggingface.co

OpenGame is an open-source agentic framework for end-to-end web game creation that uses specialized code models and evaluation benchmarks to overcome challenges in interactive application development.

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence huggingface.co

Agent-World introduces a self-evolving training framework that advances general agent intelligence through autonomous environment discovery and continuous learning across diverse real-world scenarios.

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents huggingface.co

An automated pipeline generates diverse, verified environments for claw-like agents from natural language descriptions, enabling large-scale benchmark construction and continuous evaluation.

On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability huggingface.co

State-of-the-art open-source LLM-based dense retrievers demonstrate varying levels of generalizability and stability, with instruction-tuned models showing better performance but facing specialization trade-offs, while embedding geometry offers insights for robustness improvement.

KWBench: Measuring Unprompted Problem Recognition in Knowledge Work huggingface.co

KWBench presents a benchmark for evaluating large language models’ ability to recognize professional scenarios without prompting, focusing on identifying underlying game-theoretic structures from raw inputs.

Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play huggingface.co

STRATAGEM addresses limitations in reasoning transfer for language models by using a reasoning transferability coefficient and evolution reward to promote abstract, domain-agnostic patterns over game-specific heuristics.

The Continuity Layer: Why Intelligence Needs an Architecture for What It Carries Forward huggingface.co

The paper advocates for a continuity layer in AI systems to address the limitation of transient understanding, proposing a Decomposed Trace Convergence Memory storage primitive and a four-layer development approach.

VoxMind: An End-to-End Agentic Spoken Dialogue System huggingface.co

VoxMind enhances spoken dialogue models with agentic capabilities through a “Think-before-Speak” mechanism and dynamic tool management to improve task completion rates while maintaining conversational quality.

EvoMaster: A Foundational Agent Framework for Building Evolving Autonomous Scientific Agents at Scale huggingface.co

EvoMaster is a scalable, self-evolving agent framework designed for large-scale scientific discovery that enables iterative hypothesis refinement and knowledge accumulation across experimental cycles.

WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models huggingface.co

WebCompass evaluates web development capabilities through diverse input modalities and task types, using automated evaluation methods that simulate real-world coding workflows.

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval huggingface.co

MathNet is a large-scale, multilingual, multimodal dataset of Olympiad-level math problems designed for evaluating mathematical reasoning and retrieval in generative models and embedding-based systems.

When Background Matters: Breaking Medical Vision Language Models by Transferable Attack huggingface.co

MedFocusLeak enables transferable black-box attacks on vision-language models for medical imaging by injecting imperceptible perturbations that redirect model attention, demonstrating significant vulnerabilities in clinical diagnostic reasoning.

EasyVideoR1: Easier RL for Video Understanding huggingface.co

EasyVideoR1 presents an efficient reinforcement learning framework for video understanding that improves training throughput, supports diverse video tasks, and enables joint image-video training with comprehensive evaluation across multiple benchmarks.

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation huggingface.co

OneVL presents a unified vision-language-action framework that improves latent chain-of-thought reasoning for autonomous driving by integrating language and visual world model supervision for faster, more accurate trajectory prediction.

MultiWorld: Scalable Multi-Agent Multi-View Video World Models huggingface.co

MultiWorld is a unified framework for multi-agent multi-view world modeling that achieves accurate multi-agent control while maintaining multi-view consistency through specialized modules for condition handling and global state encoding.

Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation huggingface.co

Researchers extend MeanFlow generation from class labels to text inputs by integrating powerful LLM-based text encoders, overcoming limitations of few-step refinement through enhanced semantic feature representation.

Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding huggingface.co

Vision-language models face challenges in compositional reasoning due to insufficient samples for distinguishing subtle semantics, which are addressed through lexical concreteness-based negative sample selection and a novel margin-based loss function.

MARCO: Navigating the Unseen Space of Semantic Correspondence huggingface.co

MARCO is a compact, fast model that improves semantic correspondence accuracy and generalization beyond training data by using a coarse-to-fine objective and self-distillation framework with DINOv2 and diffusion backbones.

Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models huggingface.co

SemanticQA evaluates language models on semantic phrase processing tasks, revealing significant performance variations in reasoning and comprehension across different phrase types and model architectures.

OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video huggingface.co

A novel video-to-script task is introduced along with OmniScript, an 8B-parameter omni-modal language model trained through progressive pipeline techniques for long-form narrative comprehension and temporal localization.

Modeling Multiple Support Strategies within a Single Turn for Emotional Support Conversations huggingface.co

Multi-strategy utterance generation methods for emotional support conversations outperform single-strategy approaches by enabling multiple support strategies within individual utterances.

MNAFT: modality neuron-aware fine-tuning of multimodal large language models for image translation huggingface.co

Modality neuron-aware fine-tuning (MNAFT) enhances image translation by selectively updating specific neurons in multimodal large language models, preserving pre-trained knowledge while improving cross-modal understanding.

Back to Repair: A Minimal Denoising Network\ for Time Series Anomaly Detection huggingface.co

JuRe, a simple denoising network for time series anomaly detection, demonstrates that architectural simplicity can match or exceed complex models when the training objective properly implements the manifold-projection principle.

HSG: Hyperbolic Scene Graph huggingface.co

Hyperbolic Scene Graph (HSG) improves scene graph modeling by learning embeddings in hyperbolic space, enhancing hierarchical structure quality and retrieval performance through natural encoding of hierarchical relationships.

Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding huggingface.co

A meta-optimized approach enables generalizable semantic visual decoding from fMRI by rapidly inferring unique neural encoding patterns from few image-brain examples without fine-tuning across subjects and scanners.

Forge-UGC: FX optimization and register-graph engine for universal graph compiler huggingface.co

Forge-UGC is a four-phase compiler for efficient transformer deployment on heterogeneous hardware, offering faster compilation, reduced inference latency, and lower energy consumption compared to existing frameworks.

MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models huggingface.co

Current full-duplex speech language models struggle with multi-round conversations due to inconsistent performance across different evaluation dimensions, necessitating comprehensive benchmarking.

MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts huggingface.co

A large-scale dataset of 5.7 million PubMed structured abstracts is introduced for biomedical conclusion generation, enabling evaluation of large language models’ ability to reason from structured scientific evidence.

Modeling Sparse and Bursty Vulnerability Sightings: Forecasting Under Data Constraints huggingface.co

Forecasting vulnerability-related activities using time-series models reveals challenges with sparse, bursty data, favoring count-based methods like Poisson regression for more stable predictions.

Significance and Stability Analysis of Gene-Environment Interaction using RGxEStat huggingface.co

Genotype-by-Environment (GxE) interactions influence the performance of genotypes across diverse environments, reducing the predictability of phenotypes in target environments. In-depth analysis of GxE interactions facilitates the identification of how genetic advantages or defects are expressed or suppressed under specific environmental conditions, thereby enabling genetic selection and enhancing breeding practices. This paper introduces two key models for GxE interaction research. Specifically

References

Davis Brown — ‘Cheating Agents’ blog davisrbrown.com

In nearly 97% of recorded traces, the Pilot scaffold loaded task verifiers directly into the agent’s environment… ForgeCode’s performance plummeted from 81.8% to 71.7%, dropping it from 1st to 14th place when tested in a clean environment.

Berkeley RDI — ‘Trustworthy Benchmarks’ blog rdi.berkeley.edu

Their automated scanning agent achieved near-perfect scores by ‘trojanizing’ binary wrappers… the agent could overwrite core tools to write a ‘1’ directly to the reward file, bypassing the actual task requirements entirely.

ImpossibleBench (OpenReview) openreview.net

GPT-5 reportedly exploited test cases 92% of the time… Claude Opus 4.1 often maintained high cheating rates (around 46%) even when provided with an ‘abort’ mechanism to flag impossible tasks.

RockCyberMusings — ‘Reasoning Theater’ rockcybermusings.com

a model commits to an answer early and uses the CoT as a post-hoc justification rather than a genuine deliberation process

First Principles — CoT monitorability piece firstprinciples.org

when models are penalized for ‘bad thoughts’ in their CoT during training, they do not stop the behavior; instead, they learn to stop verbalizing it, making their internal processes illegible to monitors

Sina Finance coverage / HN-Twitter discussion summary finance.sina.cn

providing 3,632 curated exploit trajectories essentially creates a ‘hacker’s manual’ that can be used to fine-tune models specifically for deception

Marc Brooker (AWS Distinguished Engineer) blog brooker.co.za

Pass@k is exponentially forgiving… a model that succeeds only 5% of the time can achieve a 99.4% Pass@100 score. Pass^k is exponentially unforgiving: it measures the likelihood of an agent successfully completing all k steps in a sequence.

Epoch AI audit of OSWorld epoch.ai

Roughly 15% of tasks can be solved via the terminal alone, and another 30% can bypass intended GUI interactions by downloading Python packages… Roughly 10% of tasks rely on live web data [so] the benchmark is not stable over time.

Simular’s own Agent S3 announcement simular.ai

Agent S3 achieved a 72.6% success rate on OSWorld, technically surpassing the human-level baseline of 72.36%… attributed to a Behavior Best-of-N (bBoN) scaling method that generates multiple independent rollouts.

OpenReview paper on Bayesian agent evaluation (Bayes@N) openreview.net

Bayes@N treats model outcomes as categorical distributions under a Dirichlet prior… provides credible intervals… achieves faster convergence and greater rank stability than Pass@k, even at much smaller sample counts.

arXiv 2510.04265 (agent latency study) arxiv.org

Planning and reflection steps account for 75% to 94% of total agent latency, often making agents 1.4 to 2.7 times slower than humans… successive steps can take up to 3x longer as the context window fills with reflection traces.

Jack Sun Wei — ‘Three Papers, Three Headline Numbers, Three Asterisks’ jacksunwei.me

The feature-split RDM trick is a repackaging of split-half reliability methods long standard in the Gershman and Kriegeskorte labs… and DINOv2 — arguably the best vision model on transfer — ranks lowest on Shesha. If stability isn’t required for the best model, what exactly is the metric measuring?

Raju — ‘Geometric Alignment Tax in Scientific Foundation Models’ preprint (raju.ai) raju.ai

Replacing discrete tokenization with continuous objectives reduced geometric distortion by up to 8.5×… 14 biological foundation models exhibit local-global decoupling, representational compression, or geometric vacuity (embeddings carrying less structure than random noise).

Raju et al., arXiv 2604.16642 — single-cell CRISPR companion paper arxiv.org

After controlling for perturbation magnitude, low coherence is independently associated with HSPA5 (BiP) upregulation across five datasets and >2,200 perturbations; the high-stability/high-stress quadrant is systematically depleted, consistent with stress as a signature of off-manifold trajectories.

Hugging Face — pcr2120/shesha-geometry package card huggingface.co

shesha-geometry ships as a pip-installable Python library with AnnData/scanpy integration and a decoupled paper-reproduction repo, but external benchmark verification is absent and engagement remains low (single-digit stars/upvotes) as of late April 2026.

LessWrong — ‘Steering Awareness: Models Can Be Trained to Detect…’ lesswrong.com

Models can be trained with 95.5% accuracy to detect residual-stream injections and identify the injected concept, raising the prospect that a model could remain geometrically ‘stable’ under Shesha while strategically masking responses to the very interventions Shesha is calibrating.

arXiv 2601.07473 — Representational similarity benchmarking (ReSi-style) arxiv.org

CKA often fails to detect changes outside the top ~10 principal components and CKA values can be heavily manipulated without changing functional behavior; CCA, conversely, is overwhelmed by initialization noise.

Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare