Sources

Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies huggingface.co

Learning While Deploying framework enables continuous improvement of Vision-Language-Action policies through fleet-scale offline-to-online reinforcement learning with distributed robot experience and human interventions.

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning huggingface.co

Given the rapidly growing capabilities of vision-language models (VLMs), extending them to interactive decision-making tasks such as video games has emerged as a promising frontier. However, existing approaches either rely on large-scale supervised fine-tuning (SFT) on human trajectories or apply reinforcement learning (RL) only in relatively short-horizon settings (typically around 20—30 turns). In this work, we study RL-based training of VLMs for long-horizon decision-making in Super Mario La

Trees to Flows and Back: Unifying Decision Trees and Diffusion Models huggingface.co

Decision trees and diffusion models are mathematically unified through a shared optimization principle called Global Trajectory Score Matching, enabling efficient generative models and neural network distillation methods.

Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction huggingface.co

Web2BigTable splits internet-scale extraction between a coordinator and worker agents, decomposing tasks for parallel execution and looping through a run-verify-reflect cycle. Shared external memory lets the agents tackle both broad sweeps and deep lookups without losing context across iterations.

Better Models, Faster Training: Sigmoid Attention for single-cell Foundation Models huggingface.co

Swapping softmax for sigmoid attention in single-cell transformers yields better cell-type separation, faster convergence, and steadier gradients thanks to bounded derivatives and a diagonal Jacobian. FlashSigmoid and TritonSigmoid kernels keep throughput on par with FlashAttention-2.

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring huggingface.co

Themis-RM is a multilingual code reward model trained on a large preference dataset spanning functional correctness, style, and other dimensions. Parameter-efficient fine-tuning enables cross-lingual transfer, letting one reward stack score generations flexibly along chosen criteria.

Online Self-Calibration Against Hallucination in Vision-Language Models huggingface.co

The framework closes the generative-discriminative gap in VLMs by running Monte Carlo tree search to mine preference pairs, then applying direct preference optimization with a dual-granularity reward. Self-supervision happens online, so models keep tightening calibration without fresh human labels.

End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer huggingface.co

Jointly optimizing the tokenizer and autoregressive generator with a 1D semantic code, rather than freezing a separate VQ stage, posts state-of-the-art FID on ImageNet 256x256. Shared training aligns reconstruction and generation objectives that previously fought each other.

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks huggingface.co

MASCing learns steering masks over routing logits using an LSTM surrogate, letting operators toggle expert circuits tied to behaviors like jailbreak defense or adult-content blocking. Switching safety profiles needs no weight updates, only a swap of the steering matrix.

Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance huggingface.co

Stable-GFN drops the brittle partition-function estimate in GFlowNet training for pairwise contrastive trajectory balance, then adds robust masking and a fluency stabilizer. The result is red-teaming prompts that stay coherent and varied instead of collapsing into gibberish or single modes.

When Do Diffusion Models learn to Generate Multiple Objects? huggingface.co

Diffusion models struggle with multi-object generation due to scene complexity rather than concept imbalance, with counting being particularly challenging in low-data regimes.

UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors huggingface.co

UniVidX is a unified multimodal framework that uses video diffusion model priors for versatile video generation through stochastic condition masking, decoupled gated LoRA, and cross-modal self-attention mechanisms.

Let ViT Speak: Generative Language-Image Pre-training huggingface.co

GenLIP is a minimalist generative pretraining framework for Vision Transformers that directly predicts language tokens from visual tokens using language modeling, offering simplicity, scalability, and competitive performance in multimodal tasks.

From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills huggingface.co

Structured representation of agent skills disentangles scheduling, execution, and logic components, improving performance in skill discovery and risk assessment tasks.

Learning to Act and Cooperate for Distributed Black-Box Consensus Optimization huggingface.co

A trajectory-driven framework uses large language models to guide agent behavior and cooperation patterns in distributed black-box consensus optimization, improving solution quality and efficiency.

Map2World: Segment Map Conditioned Text to 3D World Generation huggingface.co

Map2World enables 3D world generation from user-defined segment maps with improved scale consistency and detail enhancement through a pipeline leveraging asset generator priors.

Prox-E: Fine-Grained 3D Shape Editing via Primitive-Based Abstractions huggingface.co

A training-free framework for fine-grained 3D editing that uses geometric primitives and vision-language models to preserve identity while enabling localized structural changes.

Soft Anisotropic Diagrams for Differentiable Image Representation huggingface.co

We introduce Soft Anisotropic Diagrams (SAD), an explicit and differentiable image representation parameterized by a set of adaptive sites in the image plane. In SAD, each site specifies an anisotropic metric and an additively weighted distance score, and we compute pixel colors as a softmax blend over a small per-pixel top-K subset of sites. We induce a soft anisotropic additively weighted Voronoi partition (i.e., an Apollonius diagram) with learnable per-site temperatures, preserving informati

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling huggingface.co

Talker-T2AV presents an autoregressive diffusion framework for talking head synthesis that separates high-level cross-modal reasoning from low-level modality-specific refinement, improving lip-sync accuracy and cross-modal consistency.

AnalogRetriever: Learning Cross-Modal Representations for Analog Circuit Retrieval huggingface.co

AnalogRetriever is a unified tri-modal retrieval framework that enhances analog circuit design by encoding schematics, descriptions, and netlists into a shared embedding space using vision-language models and graph convolutional networks.

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation huggingface.co

A language-adversarial speaker encoder (LASE) is proposed to address cross-script voice cloning issues by training with contrastive loss and gradient-reversal learning to produce language-uninformative yet speaker-informative embeddings.

References

The Robot Report therobotreport.com

Agibot deploys real-world reinforcement learning system… robots acquired new skills in tens of minutes by learning directly from environmental feedback, achieving 99.9% reliability on active production lines.

PR Newswire (Agibot/Longcheer joint release) prnewswire.com

World’s first embodied AI deployment in consumer electronics precision manufacturing mass production line… 310 UPH, ~19s cycle time, 140+ hours continuous operation, integrated within 36 hours.

Moonlight review of Entropy-Regularized Adjoint Matching themoonlight.io

QAM is strictly regularized against the empirical behavior distribution… suffers from a ‘Support-Binding Dilemma’ — popularity bias suppresses rare optimal actions and a zero-support trap prevents exploration of novel high-reward actions.

Federico Sarrocco blog on π*0.6 / RECAP federicosarrocco.com

RECAP is essentially filtered behavioral cloning that sidesteps policy gradient complexities, but it is limited by its training data; it converges to the best behavior present in the experience buffer rather than discovering a globally optimal policy.

Moonlight review of LWD themoonlight.io

Replacing DIVL with standard scalar expectile regression dropped long-horizon online performance by ~16.7%; the entropy-adaptive τ schedule outperformed a constant τ baseline (0.88 vs 0.84).

RLinf docs (HG-DAgger baseline) rlinf.readthedocs.io

SOP+HG-DAgger often achieves the highest peak performance (0.94–0.98) on tasks requiring high semantic precision because human corrections provide a cleaner signal than sparse environment rewards.

Seth Karten personal site (lead author) sethkarten.ai

co-founded the PokéAgent Challenge and authored PokéChamp, a minimax language agent for Pokémon battles that achieved human-expert performance without traditional RL training

PokeGym benchmark (ResearchGate) researchgate.net

GPT-5.4 exhibits higher ‘Aware Deadlocks’—recognizing it is stuck but failing to recover

SkyPilot blog on VeRL/HybridFlow blog.skypilot.co

HybridFlow has demonstrated throughput improvements between 1.5x and 20x compared to earlier frameworks like DeepSpeed-Chat… some newer frameworks like Verlog may outperform it in ‘extreme’ long-horizon tasks exceeding 400 turns

Medium: PPO for LLM alignment medium.com

filtering improves stability, it can reduce output diversity (entropy), as the model is never explicitly penalized for repeating specific sub-optimal but ‘safe’ patterns

ResearchGate: PPO vs DQN comparative study researchgate.net

Baseline PPO models often show only marginal progress at 300,000 timesteps, with meaningful level completion rates appearing only after 1–10 million steps

NVIDIA developer blog on BALROG developer.nvidia.com

DeepSeek-R1 recently set a new state-of-the-art on the BALROG leaderboard with a 34.9% progression rate… even the most advanced models typically achieve less than 50% accuracy in strategic multi-agent environments

alphaXiv overview alphaxiv.org

decision trees define continuous probability flow ODEs, while diffusion processes naturally induce tree-like hierarchical clusterings

Han & Zhou, ‘Diffusion Boosted Trees’ (arXiv 2311.14922) arxiv.org

parameterize each diffusion timestep with a single decision tree, effectively treating the entire diffusion process as a boosting algorithm

TabSyn / latent-diffusion tabular survey (arXiv 2502.17119) arxiv.org

TabSyn reduces error rates in column-wise distribution and pair-wise correlation estimation by up to 86% compared to TabDDPM … requiring fewer than 20 NFEs vs 1,000 for TabDDPM

Wasserstein-Gradient-Flow perspective on diffusion (arXiv 2406.01813) arxiv.org

neural networks often fail to learn ‘conservative’ vector fields (true scores), yet still succeed because they effectively minimize transport costs between distributions

ResearchGate: ‘Diffusion Models for Tabular Data: Challenges, Current Progress and Future Directions’ researchgate.net

the interpolation-based method SMOTE remains an unexpectedly strong competitor in terms of machine learning utility, sometimes matching or exceeding TabDDPM on simpler datasets

lonepatient arxiv digest (2026-05-04) lonepatient.top

DSMTree successfully bridges the performance gap … matching teacher performance within a 2% margin … On the Heart Disease dataset the distilled neural network actually exceeded the original decision tree by 3.7%

Sources

References

Jack Sun, writing.