Sources

Sapiens2 huggingface.co

Sapiens2 is a high-resolution transformer model family for human-centric vision that achieves superior performance through combined pretraining objectives, large-scale human image datasets, and architectural improvements enabling detailed dense prediction and semantic understanding.

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation huggingface.co

Tuna-2 is a unified multimodal model that performs visual understanding and generation directly from pixel embeddings without pretrained vision encoders, achieving state-of-the-art performance in multimodal benchmarks.

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing huggingface.co

Transformer language models can reduce KV cache memory requirements through random cross-layer attention during training, enabling efficient depth-wise cache sharing without performance loss.

ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents huggingface.co

ClawMark evaluates language-model coworker agents over multi-turn, multi-day workflows in a stateful sandboxed service environment that updates exogenously between sessions. Tasks span multiple service domains and are scored via rule-based verification of workflow completion, targeting agents that must persist context across days rather than single sessions.

Towards Understanding the Robustness of Sparse Autoencoders huggingface.co

Dropping pretrained sparse autoencoders into transformer residual streams cuts jailbreak attack success rates while preserving task performance, with defense strength varying by layer placement and L0 sparsity. The study also probes gradient structure and attack transferability to explain why the SAE bottleneck blocks adversarial directions.

How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models huggingface.co

Looped language models obey an iso-depth scaling law with a recurrence-equivalence exponent of 0.46, meaning each extra recurrence buys roughly the square-root of an added layer’s capacity. The work fits validation-loss curves across training compute, truncated backpropagation depth, and hyperconnection variants.

ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation huggingface.co

ProEval evaluates generative models using transfer learning over pretrained Gaussian Processes plus Bayesian quadrature, hunting failure cases via superlevel-set sampling. The uncertainty-aware strategy estimates performance and surfaces failures with far fewer samples than standard Monte Carlo evaluation, and ships as a Google DeepMind open-source release.

Zero-to-CAD: Agentic Synthesis of Interpretable CAD Programs at Million-Scale Without Real Data huggingface.co

Zero-to-CAD synthesizes a million-scale corpus of executable, interpretable CAD construction sequences without any real CAD data, casting program generation as agentic LLM search inside a feedback-driven CAD environment. A vision-language model scores multi-view renders to enforce geometric validity against target boundary representations or meshes.

From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company huggingface.co

OneManCompany organizes heterogeneous agents like a firm: portable agent identities trade on a Talent Market, and an Explore-Execute-Review tree search drives hierarchical task decomposition with proven termination and deadlock-freedom guarantees. The framework targets self-organizing, self-improving multi-agent companies rather than fixed pipelines.

Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms huggingface.co

A survey maps the safety landscape for Vision-Language-Action models, cataloguing embodied threats including adversarial patches, cross-modal perturbations, semantic jailbreaks, freezing attacks, data poisoning and backdoors. It reviews evaluations and defenses spanning certified robustness, safety-aware training, and unified runtime architectures, with an accompanying Awesome-VLA-Safety repo.

Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis huggingface.co

DataPRM, a novel environment-aware generative process reward model, enhances LLM reasoning in dynamic data analysis by detecting silent errors and employing a reflection-aware ternary reward strategy, achieving superior performance on benchmark tasks.

Improving Vision-language Models with Perception-centric Process Reward Models huggingface.co

A process reward model called Perceval enables token-level error detection and correction in vision-language models through perception-intensive training and fine-grained supervision during reinforcement learning.

Stabilizing Efficient Reasoning with Step-Level Advantage Selection huggingface.co

Short-context post-training induces reasoning compression but causes instability; Step-level Advantage Selection addresses this by selectively adjusting reasoning steps based on confidence and verification outcomes, improving accuracy-efficiency trade-off in reasoning tasks.

Efficient Agent Evaluation via Diversity-Guided User Simulation huggingface.co

DIVERT is a coverage-guided user simulation framework that efficiently evaluates large language models by reusing conversation prefixes and exploring diverse interaction paths through branching trajectories.

Why Fine-Tuning Encourages Hallucinations and How to Fix It huggingface.co

Supervised fine-tuning in large language models can cause factual hallucinations due to knowledge degradation, which can be reduced through self-distillation regularization and parameter freezing techniques.

Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment huggingface.co

Large language model agents exhibit cognitive bias where self-reflection and mutual auditing lead to inconsistent error attributions, which are addressed through a dialectical reasoning framework that promotes perspective-invariant decision making.

Discovering Agentic Safety Specifications from 1-Bit Danger Signals huggingface.co

EPO-Safe enables large language model agents to discover hidden safety objectives through iterative experience and reflection using only binary danger warnings, demonstrating robust safety performance even with noisy feedback.

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation huggingface.co

World-R1 framework improves video generation by incorporating 3D constraints through reinforcement learning and specialized text datasets while maintaining visual quality and scalability.

TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction huggingface.co

Research addresses limitations of existing OCR by focusing on reconstructing scientific PDFs into compilable LaTeX through a new benchmark and training corpus, demonstrating improved structural accuracy and compilation reliability using reinforcement learning with verifiable rewards.

Improving Robustness of Tabular Retrieval via Representational Stability huggingface.co

Transformer-based table retrieval systems flatten structured tables into token sequences, making retrieval sensitive to the choice of serialization even when table semantics remain unchanged. We show that semantically equivalent serializations, such as csv, tsv, html, markdown, and ddl, can produce substantially different embeddings and retrieval results across multiple benchmarks and retriever families. To address this instability, we treat serialization embedding as noisy views of a shared sem

PageGuide: Browser extension to assist users in navigating a webpage and locating information huggingface.co

PageGuide is a browser extension that enhances AI assistant interactions by providing visual grounding of responses in web page elements, improving verification, guidance, and focus during web browsing tasks.

RaV-IDP: A Reconstruction-as-Validation Framework for Faithful Intelligent Document Processing huggingface.co

Reconstruction as Validation (RaV-IDP) introduces a document processing pipeline that uses reconstruction and comparison against original sources to validate extraction quality, triggering fallback mechanisms when fidelity drops below thresholds.

SketchVLM: Vision language models can annotate images to explain thoughts and guide users huggingface.co

SketchVLM is a training-free framework that enables vision-language models to generate editable SVG overlays for visual explanations, improving reasoning accuracy and annotation quality across multiple benchmarks.

ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning huggingface.co

ReVSI addresses flaws in current spatial intelligence evaluation by creating a validated benchmark with improved annotations and controlled frame sampling conditions.

For-Value: Efficient Forward-Only Data Valuation for finetuning LLMs and VLMs huggingface.co

For-Value is a forward-only data valuation framework that efficiently estimates data value using final hidden representations and prediction errors, enabling scalable batch processing without gradient computations.

ATTN-FIQA: Interpretable Attention-based Face Image Quality Assessment with Vision Transformers huggingface.co

ATTN-FIQA uses pre-softmax attention scores from Vision Transformers to assess face image quality without additional training or architectural changes.

EX-FIQA: Leveraging Intermediate Early eXit Representations from Vision Transformers for Face Image Quality Assessment huggingface.co

ViT-based face quality assessment method utilizes intermediate representations through early exit mechanisms and score fusion strategies, demonstrating that different transformer block depths capture complementary quality-relevant information for improved performance.

Credal Concept Bottleneck Models for Epistemic-Aleatoric Uncertainty Decomposition huggingface.co

CREDENCE is a concept bottleneck model framework that decomposes concept uncertainty into epistemic and aleatoric components using credal predictions and ensemble methods, enabling more informed decision-making based on uncertainty signals.

OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer huggingface.co

OmniShotCut formulates shot boundary detection as structured relational prediction using a shot query-based dense video Transformer, addressing limitations of existing methods through synthetic transition generation and a comprehensive benchmark.

IndustryAssetEQA: A Neurosymbolic Operational Intelligence System for Embodied Question Answering in Industrial Asset Maintenance huggingface.co

An industrial maintenance system combines telemetry data with a knowledge graph to provide more reliable and explainable answers for asset diagnostics and failure analysis.

UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models huggingface.co

UniGeo is a camera-controllable image editing framework that addresses geometric drift and structural degradation by injecting unified geometric guidance across representation, architecture, and loss function levels.

Learning to Identify Out-of-Distribution Objects for 3D LiDAR Anomaly Segmentation huggingface.co

A novel 3D LiDAR anomaly segmentation method operates directly in feature space to distinguish known from unknown objects, addressing limitations of existing datasets through mixed real-synthetic datasets with complex environments.

Personality Shapes Gender Bias in Persona-Conditioned LLM Narratives Across English and Hindi: An Empirical Investigation huggingface.co

Persona-conditioned large language models exhibit context-dependent gender bias that varies with personality trait frameworks and across languages.

Quantum Kernel Advantage over Classical Collapse in Medical Foundation Model Embeddings huggingface.co

Quantum support vector machines demonstrate superior performance over classical linear and RBF kernels in binary insurance classification tasks using medical imaging data, with quantum kernels showing higher effective rank and better minority-class recall.

Disentangled Robot Learning via Separate Forward and Inverse Dynamics Pretraining huggingface.co

DeFI addresses vision-language-action model limitations by decoupling visual forward and inverse dynamics pretraining to improve 3D action prediction and enable learning from large-scale action-free video data.

References

OpenReview (ICLR 2026 reviews) openreview.net

The novelty was viewed as ‘engineering-centric,’ focusing on the integration of these paradigms at a massive scale for human-specific dense tasks rather than introducing new transformer primitives.

Hugging Face model card (facebook/sapiens2 license) huggingface.co

The Sapiens2 license strictly prohibits use for surveillance, biometric processing, unauthorized medical/legal practice, and collecting health or demographic information without explicit consent — and Meta retains the right to audit users’ storage and distribution.

MarkTechPost coverage marktechpost.com

Sapiens2-5B is one of the highest-FLOPs vision transformers reported to date, requiring roughly 15.722 TFLOPs for a single 1K-resolution pass.

GitHub: smthemex/ComfyUI_Sapiens github.com

FP16 conversion scripts reduce the VRAM footprint of the 1B segmentation models to approximately 2GB; community nodes (smthemex, lassiiter, Kijai) already wrap Sapiens2 for segmentation, normals, pose, depth, albedo, and pointmap, with GLB export for 3D workflows.

Sony AI (NeurIPS 2023 responsible data curation) ai.sony

Human-centric computer vision (HCCV) datasets often prioritize volume over privacy, treating human subjects as ‘free raw material’ that lacks the vital metadata necessary for comprehensive fairness evaluations.

arXiv 2604.21681 (paper, dense-probing tables) arxiv.org

In dense probing evaluations, Sapiens2-5B surpassed DINOv3-7B (6.71B parameters) across all tasks including pose estimation, despite the latter’s higher parameter count.

Brandon et al., ‘Reducing Transformer KV Cache Size with Cross-Layer Attention’ (arXiv 2405.12981) arxiv.org

CLA2 (sharing KV activations between pairs of consecutive layers) … achieves a 2x reduction in KV cache size with negligible perplexity degradation

NAACL 2025 short paper — ‘A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference’ aclanthology.org

original CLA studies were performed on models trained from scratch … leaving questions about its effectiveness when ‘uptrained’ from existing pre-trained weights

LCKV GitHub (whyNLP/LCKV) github.com

Layer-Condensed KV Cache (LCKV) … allowing queries from all layers to pair with keys and values from specific ‘condensed’ layers … includes configuration files to replicate architectures like You Only Cache Once (YOCO)

Apple Machine Learning Research — Foundation Models Tech Report 2025 machinelearning.apple.com

~3-billion parameter on-device model … divided into two blocks with a 5:3 depth ratio, the second block shares key-value (KV) caches directly with the first, reducing memory usage by approximately 37.5%

Emergent Mind — Cross-layer KV cache sharing overview emergentmind.com

KVSharer … ‘plug-and-play’ … discovers a counterintuitive phenomenon where sharing dissimilar KV caches preserves performance better than sharing similar ones … ~28% memory savings without additional training

Hugging Face blog — KV cache quantization huggingface.co

INT8 quantization … 2x memory reduction with near-zero accuracy loss … applied post-training without retraining

facebookresearch/tuna-2 README (GitHub) github.com

Due to organizational policy constraints, we are unable to release the full production-trained weights. Instead, we provide a foundation checkpoint where a small number of layers in both the LLM backbone and the diffusion (flow) head are randomly re-initialized… the missing information can be recovered through a short fine-tuning pass.

ByteDance BAGEL GitHub github.com

BAGEL adopts a Mixture-of-Transformer-Experts (MoT) architecture… dual encoders capture both pixel-level and semantic features… achieves 0.88 on GenEval (with CoT), outperforming FLUX-1-dev (0.82) and SD3-Medium (0.74).

Moonlight review of Mogao themoonlight.io

Mogao utilizes a Deep-Fusion Design with modality-specific QKV layers and FFNs within transformer blocks to minimize cross-modal interference, retaining a SigLIP encoder plus SDXL-VAE rather than going encoder-free.

EPG project page (amap-ml.github.io) amap-ml.github.io

EPG achieves an FID of 1.58 on ImageNet-256 while using only 30% of the training compute required for latent-based Diffusion Transformers (DiT).

Milvus AI Quick Reference milvus.io

Latent diffusion models are bounded by a ‘fidelity ceiling’ where the final image quality cannot exceed the reconstruction capability of the frozen VAE; pixel-space methods avoid this but historically struggled with computational tractability on high-dimensional RGB.

Stackademic: Flow Matching vs Diffusion in 2025 blog.stackademic.com

Flow-matching learns a deterministic velocity field along straighter paths, achieving competitive quality in 5–15 steps versus 20–50 for diffusion, but can be more prone to training instability than well-tuned noise schedules.

Sources

References

Jack Sun, writing.