JS Wei (Jack) Sun

Agent research moves from leaderboard scores to the trace itself

Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.

← Back to the issue

Sources

CodeTracer: Towards Traceable Agent States huggingface.co

CodeTracer is a tracing architecture that analyzes code agent execution by reconstructing state transitions and localizing failures in complex multi-stage workflows.

CocoaBench: Evaluating Unified Digital Agents in the Wild huggingface.co

A new benchmark called CocoaBench evaluates unified digital agents on complex, multi-capability tasks requiring vision, search, and coding integration, revealing significant room for improvement in current agent systems.

SWE-AGILE: A Software Agent Framework for Efficiently Managing Dynamic Reasoning Context huggingface.co

SWE-AGILE addresses reasoning limitations in software engineering by using dynamic context management to balance detailed analysis with computational efficiency.

How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models huggingface.co

Mechanistic interpretability work locates alignment behavior in specific attention gates and amplifier heads that fire early in the forward pass to commit a refusal decision. The routing circuit transfers across model scales, and the authors validate it via per-head ablation, knockout cascades, and in-context cipher contrasts.

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music huggingface.co

NVIDIA and UMD’s follow-up to Audio Flamingo extends context to long-form audio across speech, sound, and music, and introduces a Temporal Audio Chain-of-Thought reasoning mechanism. Training uses a curriculum spanning pre-, mid-, and post-training stages on new AudioSkills-XL and LongAudio-XL datasets.

Solving Physics Olympiad via Reinforcement Learning on Physics Simulators huggingface.co

Sim2Reason trains LLMs on physics-simulator-generated trajectories with reinforcement learning to build physical reasoning, then transfers zero-shot to International Physics Olympiad problems. The pipeline shows simulators can substitute for scarce annotated reasoning data in a hard scientific domain.

Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks huggingface.co

AggAgent replaces naive majority voting for parallel test-time scaling on long-horizon agent tasks with a lightweight aggregation agent that navigates and synthesizes candidate trajectories on demand, sidestepping context-window blowup when combining many tool-augmented rollouts into a final answer.

Introspective Diffusion Language Models huggingface.co

I-DLM closes the quality gap between diffusion and autoregressive language models by enforcing introspective consistency at decoding time, using causal masking, logit shifting, and introspective strided decoding. A stationary-batch scheduler boosts throughput in large-concurrency serving.

From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models huggingface.co

Survey categorizes RL credit-assignment techniques for LLMs by granularity and methodology, spanning Monte Carlo, temporal difference, model-based, game-theoretic, and information-theoretic approaches, and contrasts methods suited to reasoning tasks against those built for multi-turn agentic settings with sparse rewards.

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences? huggingface.co

Scale AI’s SciPredict benchmark tests whether LLMs can forecast outcomes of natural-science experiments. Models trail human experts on both accuracy and confidence calibration, and unlike humans, they fail to improve on experiments that domain experts flag as predictable.

Time is Not a Label: Continuous Phase Rotation for Temporal Knowledge Graphs and Agentic Memory huggingface.co

RoMem introduces a temporal knowledge graph module that uses semantic speed gates and continuous phase rotation to distinguish persistent from evolving facts, achieving superior performance in temporal reasoning and agentic memory tasks.

Efficient RL Training for LLMs with Experience Replay huggingface.co

Experience replay techniques for large language model post-training balance staleness variance and computational costs while maintaining performance and policy entropy.

The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping huggingface.co

MEDS is a memory-enhanced dynamic reward shaping framework that improves sampling diversity in reinforcement learning for large language models by identifying and penalizing recurrent error patterns through clustering of historical behavioral signals.

Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing huggingface.co

Audio-Omni presents the first end-to-end framework unifying audio generation and editing across sound, music, and speech domains using a frozen multimodal language model and trainable diffusion transformer architecture.

Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models huggingface.co

Unified multimodal models suffer from pseudo-unification due to asymmetric encoding and split response patterns, requiring consistent information flow for genuine multimodal synergy.

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting huggingface.co

SCOPE enhances on-policy distillation by adapting supervision paths based on trajectory correctness, using teacher-perplexity-weighted KL distillation for incorrect trajectories and student-perplexity-weighted MLE for correct ones, achieving superior reasoning performance.

Continuous Adversarial Flow Models huggingface.co

Continuous adversarial flow models improve image generation by using an adversarial objective with a learned discriminator to better align samples with target distributions, achieving superior results on ImageNet and text-to-image tasks.

General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks huggingface.co

Large language models demonstrate limited general reasoning capabilities despite strong domain-specific performance, as revealed by a new benchmark assessing K-12 level reasoning across diverse problem types.

Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind huggingface.co

Large language models face challenges in theory-of-mind reasoning for adversarial interactions, but reinforcement learning-trained AI double agents demonstrate improved belief manipulation and theory-of-mind capabilities through bidirectional reward optimization.

Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration huggingface.co

A nonlinear extrapolation framework for reinforcement learning with verifiable rewards in large language models that reduces computational overhead by modeling rank-1 parameter trajectories through LoRA training and predictor-based prediction-extend processes.

Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation huggingface.co

Effective multilingual teacher models for synthetic data generation are identified through systematic evaluation of data quality metrics rather than model size alone, with findings showing that prompt diversity and response fluency better predict student model performance.

IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs huggingface.co

IceCache is a novel KV cache management strategy that uses semantic token clustering and PagedAttention to improve memory efficiency and performance in long-sequence LLM inference.

Counting to Four is still a Chore for VLMs huggingface.co

Vision-language models exhibit counting failures due to reduced visual evidence utilization in later language layers, which can be mitigated through modality attention share interventions.

Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs huggingface.co

Lineage analysis reveals structural patterns and systemic issues in LLM dataset evolution, enabling more diverse and controlled data curation through lineage-aware sampling approaches.

Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization huggingface.co

Mobile GUI agents using MLLMs can execute complex tasks while addressing privacy personalization through trajectory-induced preference optimization that improves persona alignment and task executability.

Learning Long-term Motion Embeddings for Efficient Kinematics Generation huggingface.co

Efficient motion generation is achieved through compressed motion embeddings and conditional flow-matching models that produce realistic long-term motions from text prompts or spatial inputs.

TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training huggingface.co

TorchUMM presents a unified codebase for evaluating and analyzing multimodal models across understanding, generation, and editing tasks with standardized protocols and diverse datasets.

Eliciting Medical Reasoning with Knowledge-enhanced Data Synthesis: A Semi-Supervised Reinforcement Learning Approach huggingface.co

MedSSR enhances medical reasoning in large language models through knowledge-enhanced data synthesis and semi-supervised reinforcement learning, improving performance on rare disease tasks while reducing reliance on expensive reasoning trace distillation.

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation huggingface.co

Transformers face challenges from Attention Sink phenomenon where excessive attention focuses on uninformative tokens, impacting interpretability and performance, necessitating comprehensive research survey addressing fundamental usage, mechanistic understanding, and strategic mitigation approaches.

Zero-shot World Models Are Developmentally Efficient Learners huggingface.co

A computational model called Zero-shot Visual World Model demonstrates how children can efficiently learn physical world understanding from limited first-person experiences, generating competent behavior across multiple benchmarks while mimicking developmental patterns and brain-like representations.

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation huggingface.co

OmniShow is an end-to-end framework for human-object interaction video generation that effectively integrates multiple modalities through unified conditioning and attention mechanisms while addressing data scarcity via decoupled training strategies.

ADD for Multi-Bit Image Watermarking huggingface.co

ADD is a multi-bit image watermarking method that uses linear combination and inner product operations for embedding and decoding, achieving high accuracy and efficiency compared to existing approaches.

SHARE: Social-Humanities AI for Research and Education huggingface.co

SHARE models are causal language models pre-trained specifically for social sciences and humanities that match general-purpose model performance while MIRROR provides a text review interface that preserves critical engagement without generating content.

TRACE: Capability-Targeted Agentic Training huggingface.co

TRACE enables LLM agents to improve in agentic environments by identifying capability gaps through trajectory comparison, creating targeted training environments, and using LoRA adapters for efficient, environment-specific self-improvement.

Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator huggingface.co

Uni-ViGU presents a generation-centric approach to unified multimodal video understanding and generation by extending video generation as a foundation through unified flow matching and bidirectional training mechanisms.

Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation huggingface.co

Video diffusion models struggle with temporal control and semantic coherence in multi-event sequences, but a new inference-time method enables fine-grained temporal control through cross-attention penalties that improve alignment and reduce semantic interference.

Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models huggingface.co

Masked diffusion language models can be accelerated through strategic replacement of full models with smaller ones during specific denoising steps, achieving reduced computational costs with minimal impact on generation quality.

Panoptic Pairwise Distortion Graph huggingface.co

Researchers introduce a novel approach to image assessment by representing image pairs as structured distortion graphs that capture region-level degradation information, challenging existing multimodal models’ ability to understand fine-grained visual differences.

SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation huggingface.co

A modular framework called SPASM is presented for generating stable multi-turn dialogues with consistent personas, addressing issues like persona drift and echoing through a perspective-agnostic context projection method.

Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series huggingface.co

The Bielik v3 PL series achieves improved language-specific performance through specialized Polish tokenization, FOCUS-based embeddings, and multi-stage training with supervised fine-tuning, direct preference optimization, and reinforcement learning.

DiningBench: A Hierarchical Multi-view Benchmark for Perception and Reasoning in the Dietary Domain huggingface.co

A new hierarchical, multi-view benchmark called DiningBench is introduced to evaluate vision-language models on fine-grained food classification, nutrition estimation, and visual question answering, revealing current models’ limitations in fine-grained visual discrimination and nutritional reasoning.

Strips as Tokens: Artist Mesh Generation with Native UV Segmentation huggingface.co

SATO introduces a novel token ordering strategy for autoregressive transformers that preserves edge flow and semantic layout in mesh generation through triangle strip-based sequences.

TAIHRI: Task-Aware 3D Human Keypoints Localization for Close-Range Human-Robot Interaction huggingface.co

TAIHRI is a vision-language model designed for egocentric human-robot interaction that enables precise 3D keypoint localization through 2D keypoint reasoning and next token prediction.

BMdataset: A Musicologically Curated LilyPond Dataset huggingface.co

A curated LilyPond dataset and adapted CodeBERT model demonstrate that expert-curated small datasets can outperform large noisy corpora for music understanding tasks.

ATANT: An Evaluation Framework for AI Continuity huggingface.co

ATANT presents an open framework for evaluating AI system continuity through a 10-checkpoint methodology using a 250-story corpus across 6 life domains, achieving 100% accuracy in cumulative testing.

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding huggingface.co

Speculative Decoding evaluation requires diverse workloads to accurately measure performance, which existing benchmarks lack, prompting the introduction of SPEED-Bench for standardized assessment across semantic domains and serving regimes.

QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation huggingface.co

QuanBench+ evaluates large language models on quantum code generation across multiple frameworks using functional testing and repair-based feedback, revealing significant progress but persistent dependence on framework-specific knowledge.

References

ETH Zurich SRI Lab blog (FixedCode) sri.inf.ethz.ch

agents attempt to ‘fix’ resolved issues over 50% of the time, often introducing unnecessary modifications instead of submitting an empty patch

Cyfrin blog — ‘Why AI coding agents can be overkill’ cyfrin.io

agents consuming over 21,000 tokens to correct a single-character typo in a README, essentially ‘over-preparing’ by pulling in excessive repository metadata

arXiv 2506.12286 (AgenTracer / counterfactual replay) arxiv.org

AgenTracer-8B… achieved a step-level accuracy of 42.86% on its automated subset, outperforming proprietary giants like Gemini 2.5 Pro and Claude 4 Sonnet by over 11%

Langfuse — LangSmith alternative FAQ langfuse.com

Langfuse’s reliance on OpenTelemetry (OTel) standards makes it more flexible for ‘polyglot’ stacks

r/webdevelopment thread on debugging AI-generated code reddit.com

agents currently act as ‘high-speed juniors’ who lack the heuristics to identify when they are confused, leading to ‘Jenga tower’ codebases

AI Native Foundation paper digest ainativefoundation.org

diagnostic styles vary by model — with GPT-5 prioritizing efficiency and Claude Sonnet 4 favoring comprehensive retrieval

MarkTechPost — AIO Sandbox release coverage marktechpost.com

Agent-Infra releases AIO Sandbox, an all-in-one runtime for AI agents with browser, shell, shared filesystem and MCP

Cocoa-Agent implementation review (nxcode.io) nxcode.io

Claude Sonnet 4.6 showed higher instability, with performance dropping from 34.0% in other frameworks to just 15.7% under Cocoa-Agent

Hugging Face — ‘eval costs bottleneck’ blog huggingface.co

agent benchmarks are highly scaffold-sensitive, with identical tasks showing a 33x cost spread depending on the configuration … success rates on some benchmarks fell from 60% on a single run to 25% when subjected to 8-run consistency checks

Kili Technology — 2026 AI benchmarks guide kili-technology.com

every major benchmark could be ‘exploited’ to achieve 100% scores without actually solving tasks … agents can ‘cheat’ WebArena by using browser primitives to navigate to local file:// URLs and read hidden answer keys

WorkOS — GAIA benchmark explainer workos.com

GAIA evaluates agents on … ‘conceptually simple but tedious’ tasks that humans solve with 92% accuracy

ResearchGate — ‘Advances and Frontiers of LLM-based Issue Resolution’ survey researchgate.net

SWE-Lego-Qwen3-8B currently stands as a top performer, achieving a 42.2% resolve rate on SWE-Bench Verified using supervised fine-tuning (SFT) alone, which rises to 49.6% with test-time scaling (TTS@16).

ofox.ai — 2026 LLM leaderboard roundup ofox.ai

OpenAI officially retired the benchmark from its internal evaluations, citing severe contamination… a separate study found that 59.4% of the hardest tasks were actually flawed, with models submitting correct fixes that were rejected by the benchmark’s narrow test cases.

arXiv 2601.16746v3 — SWE-Pruner arxiv.org

SWE-Pruner introduces a 0.6B parameter ‘neural skimmer’ that selectively prunes lines of code based on the agent’s current goal, claiming 23–54% token reduction while maintaining or even improving success rates.

InfoQ — Opus 4.6 context compaction infoq.com

Claude Code’s context compaction is an ‘extractive summarization’ process that triggers automatically when the context window reaches approximately 65–75% utilization… preserves ‘high-signal tokens’ such as specific architectural decisions, unresolved bugs, and implementation details while discarding redundant tool outputs or verbose reasoning.

ACL Findings 2025 — SkyRL-Agent aclanthology.org

SkyRL-Agent (SA-SWE-32B) reached a higher success rate of 39.4%, reportedly achieving this with a 2x cost reduction… attributed to its ‘asynchronous pipeline dispatcher’ and AST-search tool, which reduces the ‘noise’ in the agent’s context window.

arXiv 2604.11716v1 — SWE-AGILE author limitations section arxiv.org

the sliding window size (N) was set randomly between 2 and 5 during training. The authors admit that a more systematic study of the optimal window size for different task complexities is still needed… tool outputs (like long stack traces or file contents) still consume significant context.

Jack Sun

Jack Sun, writing.

Engineer · Bay Area

Hands-on with agentic AI all day — building frameworks, reading what industry ships, occasionally writing them down.

Digest
All · AI Tech · AI Research · AI News
Writing
Essays
Elsewhere
Subscribe
All · AI Tech · AI Research · AI News · Essays

© 2026 Wei (Jack) Sun · jacksunwei.me Built on Astro · hosted on Cloudflare