Sessa beats Transformers, CHAI beats Gemini, agent survey regrades Sora
Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.
Sources
Sessa: Selective State Space Attention huggingface.co
Sessa is a decoder architecture that integrates attention within a recurrent feedback loop, enabling superior long-context modeling with power-law memory decay and flexible selective retrieval compared to Transformers and state-space models.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond huggingface.co
World models are categorized into three capability levels and four law regimes to better understand and develop predictive environment models for AI agents across diverse domains.
Building a Precise Video Language with Human-AI Oversight huggingface.co
Video-language models are enhanced through structured visual specifications and human-AI oversight frameworks that improve captioning accuracy and enable detailed video generation control.
Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents huggingface.co
Memanto proposes a universal memory layer for long-horizon agents that swaps hybrid semantic-graph pipelines for a typed schema plus an information-theoretic search engine, adding temporal versioning and conflict resolution. Benchmarks on LongMemEval and LoCoMo target the ingestion delay and operational complexity of vector-based systems.
Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework huggingface.co
ESRRSim is a taxonomy-driven agentic framework that probes large language models for emergent strategic risks like deception, evaluation gaming and reward hacking, scoring both final responses and intermediate reasoning traces across multiple LLMs to surface behaviors that single-turn safety tests typically miss.
LLM Safety From Within: Detecting Harmful Content with Internal Representations huggingface.co
SIREN is a lightweight guard model that taps an LLM’s internal layer activations rather than terminal-layer outputs, using linear probes and an adaptive layer-weighting scheme over safety neurons to flag harmful content with far fewer trainable parameters and support real-time streaming detection.
Learning Evidence Highlighting for Frozen LLMs huggingface.co
HiLight trains a small emphasis actor to insert highlight tags around key evidence in long contexts before passing them to a frozen LLM solver. Trained via reinforcement learning on only the solver’s task reward, it transfers zero-shot to long-context QA and sequential recommendation.
Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets huggingface.co
Stanford’s SLIDERS tackles question answering over large document sets by extracting facts into a relational database and reasoning with SQL, replacing chunk-level retrieval and aggregation. The pipeline keeps provenance and extraction rationales as metadata to support data reconciliation across sources.
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model huggingface.co
dWorldEval is a discrete diffusion world model for evaluating robotic policies in simulation, mapping observations and actions into a unified token space and using a transformer denoiser with sparse keyframe memory and a progress token to jointly predict future states across modalities.
AgentSearchBench: A Benchmark for AI Agent Search in the Wild huggingface.co
AgentSearchBench evaluates the problem of finding the right AI agent for a task, arguing that retrieval and reranking over textual agent descriptions are insufficient. It scores candidate agents using execution-grounded behavioral signals from probing runs rather than card metadata alone.
EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training huggingface.co
EmbodiedMidtrain addresses the gap between vision-language models and vision-language-action models by using a mid-training approach that selects VLA-aligned data to improve downstream robot manipulation performance.
Video Analysis and Generation via a Semantic Progress Function huggingface.co
Researchers developed a Semantic Progress Function to analyze and correct non-linear semantic evolution in generated media, enabling smoother transitions through semantic linearization.
FlowAnchor: Stabilizing the Editing Signal for Inversion-Free Video Editing huggingface.co
FlowAnchor enables stable and efficient video editing by addressing signal instability in high-dimensional latent spaces through spatial-aware attention refinement and adaptive magnitude modulation.
DiffNR: Diffusion-Enhanced Neural Representation Optimization for Sparse-View 3D Tomographic Reconstruction huggingface.co
DiffNR enhances neural representation optimization for CT reconstruction by integrating a single-step diffusion model with specialized conditioning layers and pseudo-reference volume generation for artifact correction.
DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation huggingface.co
A large-scale dataset of schematic diagrams called DiagramBank is introduced for multimodal retrieval and exemplar-driven scientific figure generation, addressing the gap in automated publication-grade diagram creation by existing AI scientist systems.
AgriIR: A Scalable Framework for Domain-Specific Knowledge Retrieval huggingface.co
AgriIR presents a modular retrieval-augmented generation framework for agricultural information access that uses configurable components to deliver accurate, transparent, and resource-efficient domain-specific responses.
References
Habr discussion on Sessa release habr.com
The official GitHub repository (LibratioAI/sessa) was created just one week prior to the paper’s announcement and contained only ~1,000 lines of code… the repository lacked essential components such as training loops, evaluation scripts, and pre-trained checkpoints, leading developers to question the validity of the reported results.
Aditya Inamdar, ‘Sessa: When Attention Meets Recurrence’ (Medium) medium.com
Sessa is framed as a ‘new paradigm’ that combines the input-dependent routing of Transformers with the stateful aggregation of SSMs, with a feedback gain constraint |γ_t| < 1 enforced via tanh to keep the recurrent solve bounded.
VentureBeat on Mamba-3 venturebeat.com
Mamba-3 introduces complex-valued states that allow the model to represent ‘rotational’ logic and oscillatory patterns, significantly improving performance on state-tracking and logic puzzles that stumped earlier SSMs.
Medium: ‘The task a 5-year-old can solve that Mamba-2 cannot’ medium.com
Mamba-2 utilizes a restricted scalar-identity transition matrix A, which limits its expressive capacity… its scalar A matrix can only represent decay or growth, not the rotation required for tasks like parity or complex associative recall.
TechBuddies on Titans/TTT benchmarks techbuddies.io
Titans MAC (Memory as a Context) variant achieved 96.2% accuracy on 16K sequences, significantly outperforming DeltaNet (71.4%) and early TTT versions (88.4%)… TTT-E2E matches the loss scaling of full-attention Transformers much more closely than its peers as context grows toward 128K tokens.
LibratioAI/sessa GitHub repository github.com
Repository contains the official PyTorch implementation of the SessaLayer with FlashAttention support and a reference fallback; as of late April 2026 it had minimal community engagement (single-digit stars, no open issues).
CameraBench project page (Lin et al., NeurIPS 2025 Spotlight) linzhiqiu.github.io
SfM models excel at geometric trajectories but struggle with semantic intents such as ‘following’ a subject, while VLMs face the opposite problem — struggling with precise geometric estimation.
ICLR 2025 — AuroraCap (Chai et al.), precursor work iclr.cc
AuroraCap retains 99.5% of performance while using only 10–20% of the original visual tokens via bipartite soft matching token merging, and outperforms Gemini-1.5 Pro (CIDEr 88.9 vs 82.2) on Flickr30k.
OpenReview review of AuroraCap / VDCscore openreview.net
Reviewers flagged ‘unfair comparisons’ — baseline Gemini and GPT-4V were evaluated zero-shot while the specialist model benefited from structured fine-tuning, and VDCscore’s reliance on GPT-4o introduces cumulative error and version drift.
Gnoppix forum — Qwen3-VL long-video benchmark report forum.gnoppix.org
Qwen3-VL-235B maintained 99.5% accuracy locating events within two-hour videos (~1M tokens), and the 72B variant scored 72.6% on VideoMME vs GPT-4o’s 68.2%.
Salad.com Wan 2.1 benchmarking report blog.salad.com
The Wan 1.3B ‘lite’ version generates a 5-second 480p clip on an RTX 4090 in ~4 minutes using 8.19 GB VRAM, but 720p/1080p generation can exceed 30 minutes per clip on consumer cards.
arXiv 2410.03051 — hallucination evaluation of frontier VLMs arxiv.org
Even Gemini-3.1-Pro, which reduced hallucination rate by 38 percentage points over its predecessor, still hallucinates ~50% of the time on complex omniscience benchmarks.
arxiviq Substack review arxiviq.substack.com
provides much-needed theoretical foundation for a field often criticized for its ambiguous definitions
The Decoder — LeCun on Sora the-decoder.com
modeling the world by predicting pixels is a ‘dead end’ … generative models focus on irrelevant high-dimensional noise
world-models.io Sora vs Genie 2 index world-models.io
ranks Sora high in predictive fidelity (85/100) but significantly lower in ‘Planning & Control’ (30/100) compared to Google DeepMind’s Genie 2
CMU/UCSD critique of JEPA (arXiv 2507.05169) arxiv.org
JEPA is still fundamentally autoregressive and suffers from ‘latent drift,’ where errors compound in abstract space without the ‘grounding’ that pixel-level reality checks provide
Raconteur — Autonomous AI agents 2026 governance raconteur.net
standard IT frameworks assume predictable system behavior, whereas L3 agents are inherently adaptive and non-linear
Medium — Capability Evolver implementation medium.com
treating agent improvement as a production feature … typed mutations and validation gates to ensure that autonomous evolutions do not compromise system stability