SWE-WebDevBench caps at 60%, φ_first beats voting, Qwen3-4B splits think/speak
Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.
Sources
SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies huggingface.co
A comprehensive evaluation framework called SWE-WebDev Bench is presented to assess AI-powered application development platforms across multiple dimensions including requirement understanding, architectural decision-making, code quality, and production readiness.
The First Token Knows: Single-Decode Confidence for Hallucination Detection huggingface.co
First-token confidence (phi_first) derived from initial token distribution matches or exceeds semantic self-consistency in detecting hallucinations while being more computationally efficient.
When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning huggingface.co
Side-by-Side Interleaved Reasoning enables controlled disclosure timing in autoregressive models, improving accuracy and efficiency through interleaved private reasoning and delayed content release.
OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents huggingface.co
OpenSearch-VL releases a full training stack for visual search agents, combining Wikipedia path sampling, fuzzy entity rewriting, and a GRPO variant with advantage clamping. The agents wield OCR, cropping, and super-resolution tools, beating prior baselines across multimodal retrieval benchmarks.
RLDX-1 Technical Report huggingface.co
RLWRLD’s RLDX-1 fuses heterogeneous sensor modalities through cross-modal joint self-attention, running in real time on a humanoid platform. The general-purpose policy outperforms existing vision-language-action models on both simulation benchmarks and complex real-world dexterous manipulation tasks.
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction huggingface.co
MiniCPM-o 4.5 replaces turn-based interaction with Omni-Flow, a streaming framework that aligns vision, speech, and text along a shared temporal axis. The model perceives and responds simultaneously while remaining small enough for parameter-efficient fine-tuning on edge hardware.
Stream-T1: Test-Time Scaling for Streaming Video Generation huggingface.co
The Stream-X1 series targets streaming video generation from two angles: Stream-T1 adds test-time temporal guidance for consistency, while Stream-R1 reweights distillation supervision by reliability and perplexity. Both report gains in motion quality and text alignment without extra inference cost.
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation huggingface.co
Stream-R1 improves video diffusion model distillation by adaptively weighting supervision based on reliability and perplexity, enhancing visual quality, motion quality, and text alignment without additional computational overhead.
PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World huggingface.co
PhysForge pairs a visual-language planner that drafts a Hierarchical Physical Blueprint with a diffusion model that synthesizes geometry and kinematics via KineVoxel Injection. The output assets drop directly into physics simulators without manual rigging or constraint authoring.
KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning huggingface.co
KinDER procedurally generates environments stressing kinematic and dynamic constraints, then evaluates imitation learning, reinforcement learning, and foundation-model baselines on the same tasks. Real-to-sim-to-real experiments expose where current embodied reasoning systems fail under physical constraint pressure.
A Foundation Model for Zero-Shot Logical Rule Induction huggingface.co
NRI reframes Inductive Logic Programming as a pretraining problem, encoding literals through domain-agnostic statistics and decoding rules in parallel to preserve disjunction permutation invariance. A product T-norm relaxation makes rule execution differentiable, letting one model induce rules across unseen domains.
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems huggingface.co
Researchers introduce BRIGHT-Pro, an expanded expert-annotated benchmark for reasoning-intensive retrieval, and RTriever-Synth, an aspect-decomposed synthetic corpus, to improve retriever performance through agentic search evaluation and LoRA fine-tuning.
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation huggingface.co
JoyAI-Image integrates a spatially enhanced MLLM with MMDiT to achieve unified visual understanding, text-to-image generation, and instruction-guided image editing with enhanced spatial intelligence.
Diffusion Model as a Generalist Segmentation Learner huggingface.co
Pretrained diffusion models can be adapted for semantic and open-vocabulary segmentation tasks through latent space conditioning and text-guided alignment, achieving state-of-the-art performance across diverse domains.
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning huggingface.co
ResRL improves LLM reasoning by decoupling semantic distributions between positive and negative responses through negative sample projection, maintaining diversity while outperforming existing methods on multiple benchmarks.
HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation huggingface.co
HERMES++ combines 3D scene understanding and future geometry prediction through BEV representation, LLM-enhanced queries, temporal linking, and joint geometric optimization for autonomous driving applications.
Lightning Unified Video Editing via In-Context Sparse Attention huggingface.co
In-context sparse attention framework enables efficient video editing with reduced computational costs while maintaining visual quality.
CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing huggingface.co
Large language models demonstrate limited creative problem-solving abilities when required to repurpose objects based on affordance reasoning, indicating a gap in current AI capabilities for novel tool usage.
XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity huggingface.co
XL-SafetyBench presents a multilingual safety benchmark with 5,500 test cases across 10 country-language pairs to evaluate both universal and culturally specific harms in language models.
Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Environments huggingface.co
A novel framework called Autonomous Preference Optimization (APO) is proposed to address reasoning alignment challenges in multi-modal large language models under concept drift conditions, achieving improved robustness and performance through constraint-aware optimization techniques.
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models huggingface.co
A new training approach called D-OPSD enables efficient supervised fine-tuning for diffusion models by leveraging on-policy self-distillation with text and multimodal features while preserving few-step inference capabilities.
MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills huggingface.co
A domain-specific audit framework for medical research agent skills demonstrates reliable assessment consistency compared to expert review, supporting governance of specialized AI capabilities in healthcare applications.
Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback huggingface.co
Three methods for multi-view proficiency estimation—SkillFormer, PATS, and ProfVLM—achieve state-of-the-art accuracy on Ego-Exo4D with reduced parameters and training epochs while enabling interpretable feedback generation.
TT4D: A Pipeline and Dataset for Table Tennis 4D Reconstruction From Monocular Videos huggingface.co
A novel reconstruction pipeline for table tennis gameplay from monocular broadcast videos achieves high-fidelity 3D ball trajectories and spin estimation through a learned lifting network that operates on unsegmented 2D tracks before time segmentation.
APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music huggingface.co
A large-scale multi-task learning framework for AI-generated music predicts both popularity and aesthetic quality using frozen audio embeddings from a self-supervised music understanding model, demonstrating strong generalization across different generative architectures.
References
The Hacker News (Wiz disclosure on Base44) thehackernews.com
the api/apps/{app_id}/auth/register and api/apps/{app_id}/auth/verify-otp endpoints lacked proper access controls… the app_id was a non-secret value easily found in public manifest files, allowing attackers to programmatically register verified accounts for private enterprise applications
PCMag — RedAccess vibe-coding report pcmag.com
over 5,000 ‘vibe-coded’ applications—built with tools like Base44, Lovable, and Replit—possessed virtually no authentication or security… 89% of Lovable-generated apps were found to lack Supabase Row Level Security
Tracxn company profile / QwikBuild tracxn.com
Nilesh Trivedi co-founded Snowmountain AI in 2023… QwikBuild-Bench, a company-developed metric comparing AgentQ against five other AI coding platforms—the tool captured 97.7% of user requirements on the first attempt, compared to a competitor high of 57%
EmergentMind summary of FeatBench (arXiv 2509.22237) emergentmind.com
FeatBench focuses on incremental development… FeatBench success rates peak at roughly 30%, whereas top models like Claude 4.7 and GPT-5.5 achieve 60-71% on Vibe Code Bench… Vibe Code Bench showed a 42.7% performance gap between top and bottom models, compared to a mere 2.8% gap on SWE-bench
Veracode blog — Base44 vulnerability commentary veracode.com
48% of AI-generated files contain at least one known vulnerability… the verification gap—the lack of a security review layer between AI output and production—remains the primary threat
ACL Anthology — LLM-as-judge audit (IJCNLP 2025) aclanthology.org
Multi-tier scoring systems (e.g., 0/0.5/1 rubrics) intended to capture partial correctness often fail to align with human experts… a ‘thinking tier’ bias has been observed: judges systematically assign higher scores (up to +1.5 points) to models marketed as ‘reasoning’ tiers
Artefact (industry blog) — ‘Detecting hallucinations in LLMs one token at a time’ artefact.com
WEPR… uses weighted averages of entropy across a sequence and has achieved ROC-AUC scores up to 93.6 in financial RAG settings
Azaria & Mitchell (2023) — ‘The Internal State of an LLM Knows When It’s Lying’ (SAPLMA) researchgate.net
the model’s hidden layer activations contain discernible patterns that correlate with truthfulness, even when the final output is false
INSIDE: LLMs’ Internal States Retain the Power of Hallucination Detection (ICLR 2024) arxiv.org
EigenScore… applies spectral analysis to the covariance matrix of hidden states across multiple sampled outputs to measure semantic consistency in the latent space
Semantic Energy preprint (arXiv 2605.02241) arxiv.org
softmax normalization destroys ‘evidence strength’ information… a model might assign high probability to a token simply because it knows only one (incorrect) way to answer
Linear-probe long-form hallucination study (PMC12078457) pmc.ncbi.nlm.nih.gov
lightweight ‘linear probes’ that analyze hidden model states can achieve an AUC of 0.90, significantly outperforming adapted semantic entropy methods which lagged at 0.71
StartupHub.ai practitioner write-up startuphub.ai
if a model begins a response with a standard template (e.g., ‘As an AI language model…’ or ‘Sure, here is…’), the entropy of that first token reflects the model’s training on conversational fillers rather than its confidence in the subsequent factual claim
ICML 2026 poster page icml.cc
Accepted for ICML 2026 (poster) — establishes peer-review provenance for the SxS interleaved-reasoning framework.
GoPenAI blog — ‘Hidden Chain-of-Thought Reasoning Without Saying Why’ blog.gopenai.com
OpenAI hides raw reasoning chains in o1/o3 to prevent distillation; critics argue users ‘pay for invisible reasoning tokens’ with no debugging path.
Emergent Mind — Hidden Chain-of-Thought topic survey emergentmind.com
Anthropic and Apollo Research observed instances where hidden CoT contained explicit terms like ‘sabotage’ and ‘lying’; Bengio warns private reasoning makes traditional safety checks largely ineffective.
arXiv 2505.04588 (Alibaba ZeroSearch) arxiv.org
Curriculum-based RL with simulated search interleaving; targets training-cost reduction (~90%) rather than disclosure timing — a contemporaneous but orthogonal use of interleaving.
EMNLP 2025 Findings — Interleaved Reasoning via RL aclanthology.org
Reports up to 80% reduction in time-to-first-token and +19.3% Pass@1 on multi-hop reasoning when intermediate commitments are rewarded — corroborates the latency claims of SxS via a different method.
troels.im — ‘Why the structure of AI’s output matters’ troels.im
Autoregressive ‘probability distribution shaping’: early low-confidence tokens force the model into post-hoc justification, empirically validating the ‘premature commitment’ problem SxS targets.