JS Wei (Jack) Sun

Code harness lifts GPT-5.5 26pt, Llama skips tool calls, TDDev's 65.8% rebutted

Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.

← Back to the issue

Sources

Code as Agent Harness huggingface.co

Large language models are increasingly used as operational substrates for agent reasoning and execution in agentic systems, with code serving as a unified infrastructure layer across multiple domains and applications.

Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use huggingface.co

Research reveals a disconnect between language models’ recognition of when tools are needed and their actual tool invocation behavior, identifying a “knowing-doing gap” in tool-use reliability.

From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements huggingface.co

TDDev automates test-driven development for web application generation by integrating requirement analysis, browser-based validation, and structured repair reporting to improve code quality and reduce human intervention.

AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents huggingface.co

AMD’s open-source AgentKernelArena scores coding agents on full GPU kernel optimization workflows, not just one-shot outputs. It tests HIP-to-HIP, Triton-to-Triton, and PyTorch-to-HIP translation while checking compilation, correctness, and performance on configurations the agent never saw during development.

OProver: A Unified Framework for Agentic Formal Theorem Proving huggingface.co

OProver trains theorem-proving agents through continued pretraining plus iterative SFT and RL on compiler-verified proofs. The framework folds Lean 4 verifier feedback directly into the loop, letting proof synthesis improve from its own successful attempts rather than static demonstration data.

Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics huggingface.co

Watching hidden-state probes evolve across a reasoning trace beats static snapshots for predicting model behavior. Signal-processing features over probe trajectories—replacing simple average- or max-pooling—raise AUROC on safety monitoring of large reasoning models by exploiting temporal structure in the internal monologue.

Targeted Neuron Modulation via Contrastive Pair Search huggingface.co

Contrasting harmful and benign prompts surfaces a small set of MLP neurons that act as a targetable refusal gate in instruction-tuned models. Modulating them shifts jailbreak refusal rates without degrading output quality, exposing how alignment fine-tuning concentrates discrimination structure.

Agent Bazaar: Enabling Economic Alignment in Multi-Agent Marketplaces huggingface.co

Agent Bazaar simulates LLM agents trading in marketplaces and measures systemic risks like algorithmic instability and Sybil deception. Training with REINFORCE++ on an adaptive curriculum raises the framework’s Economic Alignment Score, curbing destabilizing behavior without hand-coded rules.

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers huggingface.co

Matching optimizer updates to each weight block’s equivariance—permutation symmetry in embeddings and LM heads, shared-shift in SwiGLU and MoE routers—improves pretraining stability over coordinate-wise Adam. The hybrid row-norm and spectral updates extend Muon and Scion across Qwen3-0.6B, Gemma 3 1B, OLMoE, and gpt-oss.

AI for Auto-Research: Roadmap & User Guide huggingface.co

AI handles structured research tasks like literature synthesis and analysis reliably but stumbles on novel ideation and scientific judgment, the roadmap finds. The accompanying benchmark suite and tool inventory argue for human-governed collaboration rather than end-to-end automation across long-horizon research agents.

Geometric Phase Transition Enables Extreme Hippocampal Memory Capacity huggingface.co

Superior spatial memory emerges from hippocampal population geometry transitioning from disorganized to crystalline states, enabling higher capacity and stability through topological rigidity and specific neural circuit dynamics.

SNLP: Layer-Parallel Inference via Structured Newton Corrections huggingface.co

Transformer models can achieve faster inference through parallel Newton-style updates that approximate sequential computations using structured Jacobian approximations and specialized regularization techniques.

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows? huggingface.co

Healthcare workflow benchmark challenges agents with policy-dense, multi-role, and multilateral interaction requirements, revealing significant performance gaps in automated enterprise applications.

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation huggingface.co

LongLive-2.0 presents an NVFP4-based parallel infrastructure for long video generation that addresses training and inference bottlenecks through sequence-parallel autoregressive training and diffusion model tuning.

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization huggingface.co

OSCAR is an ultra-low-bit KV cache quantization method that aligns quantization with attention-aware covariance structures, achieving high accuracy and efficiency for long-context LLM serving.

Post-Trained MoE Can Skip Half Experts via Self-Distillation huggingface.co

Zero-Expert Self-Distillation Adaptation (ZEDA) enables efficient dynamic Mixture-of-Experts models by converting static models into adaptive ones with reduced computational costs and improved inference speed.

LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs huggingface.co

LiteFrame, a lightweight video encoder with Compressed Token Distillation training method, reduces latency and increases frame processing capacity for long-form video understanding in Video LLMs while maintaining accuracy.

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs huggingface.co

AstraFlow is a dataflow-oriented reinforcement learning system that enables efficient multi-policy collaborative training and elastic scaling across diverse compute resources for large language model agents.

CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection huggingface.co

CompactAttention improves chunked prefill attention efficiency by using Block-Union KV Selection to minimize KV block tables and enable in-place access without explicit compaction.

SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution huggingface.co

SkillsVote is a governance framework for long-horizon LLM agents that manages reusable skills through structured collection, recommendation, and evolution processes.

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation huggingface.co

VideoSeeker introduces a novel paradigm for instance-level video understanding by integrating agentic reasoning with visual prompts, achieving superior performance through automated data synthesis and reinforcement learning.

Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models huggingface.co

Researchers introduce PUMA, a framework that uses semantic redundancy detection to improve reasoning efficiency in large models by identifying when continued thinking provides no new insights, thus reducing computational waste while maintaining answer accuracy and reasoning quality.

Where Should Diffusion Enter a Language Model? Geometry-Guided Hidden-State Replacement huggingface.co

DiHAL, a geometry-guided diffusion-transformer hybrid, identifies optimal layers for diffusion integration in pretrained transformers by using geometry-based proxies to select diffusion-friendly hidden-state interfaces, enabling more effective language modeling compared to traditional continuous diffusion approaches.

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents huggingface.co

MM-ToolBench presents a comprehensive benchmark for evaluating omni-modal tool-use agents through closed-loop verification across diverse professional tasks.

MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents huggingface.co

MementoGUI presents a memory framework for GUI agents that uses learned controllers for selective memory management and retrieval, improving long-horizon task performance through compressed visual and textual representations.

Incantation: Natural Language as the Action Interface for Multi-Entity Video World Models huggingface.co

Incantation enables interactive video world modeling with natural language conditioning for fine-grained multi-entity control and cross-entity generalization through novel video backbone and attention mechanisms.

SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training huggingface.co

A novel online reinforcement learning framework for diffusion models that improves safety without requiring supervised paired data or reward tuning, achieving state-of-the-art performance on multiple harm categories.

A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation huggingface.co

Large language models demonstrate significant limitations in abstract reasoning abilities compared to human performance, particularly in complex 3D task understanding, while an automated benchmark pipeline shows that programmatic verification ensures solution uniqueness through cycle consistency.

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models huggingface.co

An autoregressive action expert generates continuous action sequences conditioned on vision-language prefixes, maintaining long-term memory for context-aware robotic policy training with improved trajectory smoothness and task success rates.

StableVLA: Towards Robust Vision-Language-Action Models without Extra Data huggingface.co

Vision-Language-Action models exhibit degraded performance under unseen visual disturbances, but a lightweight information-theoretic adapter module significantly improves robustness with minimal parameter overhead.

EndPrompt: Efficient Long-Context Extension via Terminal Anchoring huggingface.co

EndPrompt extends large language model context windows by using a two-segment training approach that preserves semantic continuity while enabling efficient long-range positional learning through sparse supervision.

AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents huggingface.co

AtlasVA is a teacher-free visual skill memory framework for vision-language model agents that uses spatial heatmaps, visual exemplars, and symbolic text skills to improve spatial decision-making in long-horizon tasks.

Actionable World Representation huggingface.co

WorldString is a neural architecture that models object state manifolds from point clouds or RGB-D video streams, serving as a foundational component for physical world models with differentiable structure for policy learning integration.

GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions huggingface.co

GRASP is a large-scale social reasoning dataset connecting high-level social questions with fine-grained gaze and gesture events, along with Social Grounding Reward to improve multimodal model understanding of social interactions.

Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring huggingface.co

Large language models exhibit systematic bias toward central tendency when evaluating clinical assessments, particularly affecting critical score extremes important for cognitive impairment screening.

E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring huggingface.co

Post-merge quantization framework E-PMQ improves low-bit deployment of merged neural network models by addressing coupled quantization and merging deviations through expert-guided calibration and weight anchoring.

TopoPrimer: The Missing Topological Context in Forecasting Models huggingface.co

TopoPrimer enhances forecasting accuracy by incorporating global topological structures via persistent homology and spectral sheaf coordinates, demonstrating consistent improvements across diverse domains and challenging scenarios.

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science huggingface.co

SCICONVBENCH evaluates large language models’ ability to handle ill-posed scientific queries through multi-turn dialogue, focusing on clarifying ambiguous requests and resolving inconsistent information across computational science domains.

DexHoldem: Playing Texas Hold’em with Dexterous Embodied System huggingface.co

DexHoldem presents a real-world benchmark for evaluating embodied agents in dexterous manipulation tasks, testing both primitive execution and higher-level perception and decision-making capabilities.

Code-as-Room: Generating 3D Rooms from Top-Down View Images via Agentic Code Synthesis huggingface.co

A novel MLLM-based agentic framework called Code-as-Room generates 3D indoor rooms by converting top-down images into executable Blender code through a structured execution harness with cross-stage memory to maintain context.

KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration huggingface.co

ODENative online GRPO framework KVPO aligns streaming video generators with human preferences through causal-semantic exploration and velocity-field surrogate policy based on trajectory velocity energy.

Measuring Maximum Activations in Open Large Language Models huggingface.co

Modern open LLMs exhibit wide variation in activation magnitudes across families and training stages, with maximum values spanning four orders of magnitude and showing complex scaling patterns that differ from simple size-based expectations.

WavFlow: Audio Generation in Waveform Space huggingface.co

WavFlow generates high-fidelity audio directly in raw waveform space using waveform patchify and amplitude lifting, achieving competitive performance on video-to-audio and text-to-audio benchmarks without intermediate latent representations.

MixSD: Mixed Contextual Self-Distillation for Knowledge Injection huggingface.co

MixSD addresses knowledge injection in language models by aligning supervision with the model’s native generation distribution, reducing catastrophic forgetting during fine-tuning.

Lance: Unified Multimodal Modeling by Multi-Task Synergy huggingface.co

Lance is a unified multimodal model that combines understanding, generation, and editing capabilities for images and videos through collaborative multi-task training and a dual-stream architecture.

NGM: A Plug-and-Play Training-Free Memory Module for LLMs huggingface.co

A training-free N-gram Memory module enhances language model performance by directly utilizing pretrained token embeddings for knowledge retrieval without requiring additional memory tables or retrieval pipelines.

FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models huggingface.co

FINESSE-Bench presents a comprehensive suite of eight specialized financial benchmarks designed to evaluate large language models across multiple levels of professional competency and task types.

Evaluating Cognitive Age Alignment in Interactive AI Agents huggingface.co

ChildAgentEval presents a psychometrically grounded benchmark for assessing cognitive age alignment in MLLM-based agents by comparing their reasoning performance against human developmental stages.

References

MindStudio — ‘Agent Harnesses Beat Model Upgrades’ mindstudio.ai

the same GPT-5.5 model achieved an 87.2% success rate in a specialized coding harness (Cursor) compared to only 61.5% in a native environment

Preprints.org — Chain-of-Code Collapse study preprints.org

adversarial prompt perturbations—surface-level changes that preserve core logic—can cause accuracy to plummet by up to 42%

arXiv — Scope Delineation Before Localization (SDBL) arxiv.org

boosting step-level accuracy from 8% to over 32% in expert-curated logs

Modal — ‘Best Sandboxes for SWE-bench Coding Agents’ modal.com

production-grade systems like Modal and E2B employ gVisor or Firecracker microVMs for hardware-level isolation … parallelizing 500-task benchmarks now takes roughly seven minutes

DecodeTheFuture — AutoAgent / Meta-Harness writeup decodethefuture.org

AutoAgent … achieved #1 rankings on SpreadsheetBench (96.5%) and TerminalBench (55.1%) after only 24 hours of autonomous optimization

Medium (E. Pappas) — ‘The Optimizer in the Loop’ medium.com

self-evolving systems create a ‘harness debt,’ where the infrastructure intended to catch errors becomes its own source of complex, hard-to-debug failures

When2Tool benchmark (Weng Lab) lilywenglab.github.io

A linear probe on the hidden state of the last input token predicts tool necessity with AUROC > 0.90 across model families; Probe&Prefill reduces unnecessary API calls by 48-56% while staying within 1.7% of baseline accuracy.

LessWrong / Apollo Research — Detecting strategic deception with linear probes lesswrong.com

Linear probes can distinguish honest from deceptive reasoning patterns with over 90% accuracy, acting as neural circuit breakers that catch misalignment before action.

ResearchGate — Thermodynamic Analysis of Sycophancy in LLMs researchgate.net

Sycophancy emerges in two stages: a deeper representational divergence where the model distinguishes truth, followed by a late-layer preference shift that overrides this internal knowledge during the final decoding phase.

EmergentMind — BFCL v4 overview emergentmind.com

BFCL focuses on deterministic AST-based validation of function-call syntax, while ToolBench relies on LLM-as-a-judge ToolEval; both treat necessity as a static property of the prompt rather than a function of model capability.

Agent4Science — Mechanistic interpretability of tool routing agent4science.org

Accuracy in tool selection drops from 100% to 42% when tool descriptions are shuffled but names remain the same, suggesting the ‘decision’ is often a sophisticated lookup rather than deep semantic reasoning.

Medium — ‘LLMs Can’t Calculate’ (practitioner blog) medium.com

Models with fewer than 7B parameters often fail the ‘detour problem’ — resisting the urge to complete a sentence and instead format a structured tool call — independent of any prompt instruction.

WebGen-Agent OpenReview submission (Shi et al.) openreview.net

In a direct head-to-head comparison on the full 101-instruction WebGen-Bench test set, WebGen-Agent (using Qwen3-Coder) achieved 58.2% accuracy, significantly outperforming TDDev’s 44.0%… [TDDev’s headline] results were inflated by testing on a limited subset of only 10 instructions and using ‘golden’ design images as hints.

FullStack-Agent (ResearchGate, 2026) researchgate.net

While TDDev proved that TDD infrastructure alone could improve generation quality by 34–48 percentage points… it was noted for producing relatively simple codebases that struggled with complex database implementations.

WebGen-Bench review (themoonlight.io) themoonlight.io

DeepSeek-R1 paired with the Bolt.diy framework emerged as the functional leader, achieving an accuracy of 27.8%… WebGen-LM-32B, a model fine-tuned on the benchmark’s companion dataset, achieved a functional accuracy of 38.2%, surpassing all proprietary models.

themoonlight.io review of TDDev paper themoonlight.io

The testing agent achieved 100% accuracy in detecting actual bugs (no false positives) but incorrectly flagged 25% of functional applications as broken due to Playwright selector timeouts (false negatives).

Playwright MCP documentation playwright.dev

Standard accessibility snapshots typically consume 200–400 tokens, compared to thousands for raw DOM or pixel-based vision models… locators based on roles and names are more resilient to UI refactors than brittle CSS selectors.

Vantage ‘Agentic Coding Costs’ blog vantage.sh

Agentic coding consumes roughly 1,000x more tokens than standard code chat… input tokens, rather than output, drive 85% of total costs because agents must resend the entire codebase context, test logs, and conversation history at every turn.

Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare