Sessa beats Transformers, CHAI beats Gemini, agent survey regrades Sora

TL;DR

Sessa beats Transformers 86% vs 79% on SymbolSoup, the paper’s own synthetic long-range task.
Sessa’s baselines omit Mamba-3, Titans, and TTT, the actual 2026 long-context hybrid frontier.
CHAI fine-tunes an 8B Qwen3-VL past Gemini-3.1-Pro on cinematic captioning benchmarks the CHAI team designed.
400-paper world-models survey scores Sora 85/100 fidelity vs 30/100 planning on its own decision-usability index.
Round-ups feature Memanto typed memory, SIREN activation-based safety, and AgentSearchBench for agent discovery.

Three research stories land today and they share a quiet tell: in each, the team claiming the result also picked the ruler. Sessa beats Transformers 86% to 79% — on SymbolSoup, the paper’s own synthetic task, with the actual 2026 long-context frontier (Mamba-3, Titans, TTT) absent from the baselines. CHAI fine-tunes an 8B Qwen3-VL past Gemini-3.1-Pro on cinematic captioning benchmarks the CHAI group itself designed. And a 400-paper agentic world-models survey doesn’t just propose a new taxonomy; it argues the field’s pixel-fidelity yardstick is the wrong one, and re-scores Sora at 85/100 fidelity vs 30/100 planning on its own decision-usability index.

None of these are bad results — Sessa’s triangular-solve attention is a genuinely interesting recipe, CHAI’s expert-critique loop is a real pipeline, and the survey’s L1→L3 levels framing is overdue. But the day reads as a reminder that 2026’s research wins increasingly travel with their own measuring tape, and the work of independent comparison still has to be done by someone else.

Sessa beats Transformers 86% to 79% — on its own synthetic task

TL;DR

Sessa wraps causal attention inside a triangular solve (I − B_fb)s = f, yielding power-law memory decay instead of SSM exponential forgetting.
Beats Transformer 86% vs 79% on SymbolSoup, crushing Mamba-2 (5%) on the paper’s own synthetic long-range task.
Repo is 7 days old, ~1,000 lines, with no training loops, eval scripts, or checkpoints ¹².
Baselines omit Mamba-3, Titans, and TTT — the actual 2026 long-context hybrid frontier ³⁴.

The mechanism

Sessa’s pitch is mathematically clean. A standard causal attention pass produces a forward signal f_t. A second attention mechanism over the strict past produces a strictly lower-triangular feedback matrix B_fb. The layer’s output is the solution to a causal linear system, computed by forward substitution:

$$s_t = f_t + \gamma_t \sum_{j<t} \alpha^{fb}_{t,j}, s_j$$

Because B_fb is nilpotent, (I − B_fb)^{-1} expands as a path-sum over hop counts — every output token aggregates contributions across all feedback path lengths in a single layer, where Transformers offer one hop and Mamba offers one chain. Constraining the feedback gain |γ_t| < 1 via tanh keeps the solve BIBO-stable ⁵. Theorem 14 claims universal approximation for finite stacks. The headline regime: long-range influence decays as O(ℓ^-β) with 0 < β < 1, a heavier tail than either baseline can produce.

What the numbers actually show

The wins are real on the tasks chosen, and the losses are honestly reported:

Task	Sessa	Transformer	Mamba-2
SymbolSoup acc. ↑	0.860	0.792	0.050
Diffuse MQAR token acc. ↑	0.154	0.122	0.002
SimpleStories perplexity ↓	8.37	7.67	7.72

Mamba-2’s collapse to 5% isn’t a Sessa achievement — it’s the well-documented consequence of Mamba-2’s scalar-identity transition matrix, which can represent decay or growth but not the rotation needed for parity-style recall ⁶. Mamba-3, released March 2026, specifically reintroduces complex-valued states to fix exactly this failure mode ³. Comparing to Mamba-2 in late April 2026 is a soft target. Titans MAC hits 96.2% on 16K Needle-in-a-Haystack and TTT-E2E tracks full-attention loss curves out to 128K — neither appears in Sessa’s table ⁴.

The short-context regression is also worth holding onto: an ablation that removes the feedback branch improves perplexity to 8.09. Feedback capacity is dead weight when long-range dependencies aren’t the bottleneck.

The provenance problem

This is where the source article needs an asterisk the paper doesn’t carry. The LibratioAI/sessa GitHub repo was created roughly one week before the paper dropped, ships ~1,000 lines of code, and contains no training loops, no evaluation scripts, and no pre-trained checkpoints ¹². The headline result rests on SymbolSoup, a synthetic task introduced in the same paper — not RULER, LongBench, or any standard NIAH variant. The author has no prior ML publication record. Habr commenters cataloging the release flagged the initial wave of social praise as plausibly LLM-generated; the repo itself shows single-digit stars and zero open issues ¹².

Takeaway

The path-sum-over-hops framing and the BIBO-stability/universal-approximation results are genuinely interesting — attention coefficients living inside a triangular solve is a primitive worth thinking about. But the empirical case is one synthetic benchmark, one stale baseline pair, and an unrunnable repo. Until someone outside Libratio reproduces SymbolSoup and ports Sessa to RULER against Mamba-3 and Titans, this is a theory paper wearing a benchmark’s clothes.

CHAI tunes Qwen3-VL to top Gemini-3.1-Pro on film captions

TL;DR

A fine-tuned 8B Qwen3-VL reportedly beats Gemini-3.1-Pro on cinematic captioning, on benchmarks the same group designed.
CHAI pairs a cinematographer-designed visual taxonomy with expert critiques of model pre-captions.
Those critiques supervise SFT, DPO, and inference-time scaling in a single training loop.
Re-captioned footage trains Wan to follow 400-word prompts specifying camera motion, lens, focus, and framing.

The contribution: structure plus critique

CHAI (Critique-based Human-AI Oversight) is the third move in a research line from Zhiqiu Lin and collaborators. CameraBench gave them a 50+ category camera-motion taxonomy co-designed with filmmakers ⁷; AuroraCap gave them an efficient captioning recipe that retained 99.5% of quality at 10–20% of visual tokens and edged Gemini-1.5 Pro on Flickr30k CIDEr (88.9 vs 82.2) ⁸. CHAI grafts that taxonomy — hundreds of primitives covering subjects, scenes, motion, spatial, and camera dynamics — onto a Qwen3-VL backbone.

The methodological twist is the division of labor. Models draft pre-captions; trained experts critique and revise them into post-captions. The (pre, post, critique) triples then supervise three things at once: caption generation (SFT), reward modeling on pre-vs-post preferences (DPO), and a critique-generation head used at inference time. Ablations in the paper argue that critique quality — measured in precision, recall, and constructiveness — is the lever that governs downstream gains.

Why the Gemini comparison needs an asterisk

OpenReview discussion on the predecessor flagged two issues that almost certainly carry forward. Baseline Gemini and GPT-4V were evaluated zero-shot while the specialist was fine-tuned on in-distribution cinematic primitives — what reviewers called “unfair comparison” ⁹. And the VDCscore-style LLM-as-judge metric leans on GPT-4o, which means cumulative judge error and silent drift when the API updates ⁹.

So “outperforms Gemini-3.1-Pro” is plausible on a narrow cinematography axis without implying general superiority. The harder question for reviewers: how much of the win is CHAI versus Qwen3-VL itself? Independent reports already credit the Qwen3-VL family with 99.5% accuracy locating events in two-hour videos (~1M tokens) and a 72.6 vs 68.2 edge over GPT-4o on VideoMME ¹¹. The base model is a strong starting point; the ablations need to separate the two contributions cleanly.

The Wan fine-tune carries a compute footnote

The downstream story — re-captioning films, commercials, and games, then fine-tuning Wan to follow up to 400-word prompts with control over camera motion, angle, lens, focus, POV, and framing — is the most directly useful claim for practitioners. The catch is serving cost. Salad’s benchmarking shows Wan’s 1.3B “lite” build does a 5-second 480p clip in ~4 minutes / 8.19 GB on a 4090, but 720p/1080p generation routinely exceeds 30 minutes per clip on consumer hardware ¹². Re-captioning “large-scale professional videos” at the corpus sizes the abstract implies is the unsexy bottleneck.

Takeaway

CHAI is a credible recipe for turning a strong open backbone into a cinematography specialist via cheap-ish expert oversight. The data, code, and a Qwen3-VL fine-tune that genuinely understands “dolly in, rack focus, low-angle” are the durable contributions. The Gemini-3.1-Pro line is marketing; the critique-as-supervision loop is the idea worth stealing.

Agentic World Modeling survey grades agents L1–L3 by usability

TL;DR

400-paper survey proposes a levels × laws taxonomy — L1 Predictor → L2 Simulator → L3 Evolver, across four world regimes.
Decision-usability, not pixel fidelity, is the survey’s measure of a world model’s worth.
Sora hits 85/100 fidelity vs 30/100 planning on the independent World Models Index.
L3 self-rewriting agents are already colliding with EU AI Act and NIST oversight regimes.

The taxonomy: capability levels × world regimes

The new survey Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond tries to do what the field has avoided for years — pin down what “world model” actually means. Its answer is a 3×4 grid. On one axis, capability climbs from L1 Predictor (one-step Markovian transitions), to L2 Simulator (multi-step, action-conditioned rollouts that stay coherent), to L3 Evolver (agents that diagnose their own prediction failures and persistently revise rules, skills, or parameters via a Design–Execute–Observe–Reflect loop with regression gates). On the other axis, models are scored against the laws of the world they operate in: physical, digital, social, or scientific.

Independent reviewers credit the paper with giving “much-needed theoretical foundation” to a field whose definitions had splintered across robotics, RL, and video generation ¹³. The framework is grounded in a POMDP formulation rather than the usual modality-first taxonomy, which is why a Dreamer-style RSSM, a VQ-VAE token model, and a diffusion video generator can all be ranked on the same ladder.

Decision-usability beats pixel fidelity

The paper’s most opinionated move is to argue that visual realism is a distractor. A model that produces gorgeous frames but ignores the action input is, by this definition, not a world model at all — it’s an “action-insensitive” video generator. That critique lands inside an ongoing debate. Yann LeCun has called Sora-style pixel prediction a “dead end” because generative models burn capacity on “irrelevant high-dimensional noise” instead of dynamics ¹⁴. The independent World Models Performance Index puts numbers on it: Sora scores 85/100 on predictive fidelity but only 30/100 on Planning & Control, while DeepMind’s Genie 2 inverts that profile by trading visual polish for controllability ¹⁵.

That CMU/UCSD critique is the symmetric problem the survey under-discusses. Even LeCun’s preferred latent alternative inherits the compounding-error failure mode the paper attributes to pixel models. Neither camp has a clean answer.

L3 is where the survey meets the regulators

The L3 tier is the part of the taxonomy most likely to age fastest, because reality is moving on it. Governance analysts warn that “standard IT frameworks assume predictable system behavior, whereas L3 agents are inherently adaptive and non-linear” ¹⁷. EU AI Act enforcement in August 2026, plus a NIST initiative on unsupervised agents, are explicit responses to what regulators are calling “agentic drift.”

On the implementation side, the open-source Capability Evolver pattern is already operationalizing self-rewriting agents with “typed mutations and validation gates” to keep autonomous evolutions auditable ¹⁸ — a near-isomorphism of the survey’s Design–Execute–Observe–Reflect loop. The community appears to agree on the loop’s shape but is still inventing the safety primitives the paper sketches only abstractly.

flowchart LR
    A[Design hypothesis] --> B[Execute action]
    B --> C[Observe outcome]
    C --> D{Reflect: prediction error?}
    D -- yes --> E[Diagnose component]
    E --> F[Revise rules/skills/params]
    F --> G{Regression gates}
    G -- pass --> A
    G -- fail --> H[Roll back]
    H --> A
    D -- no --> A

The takeaway: the levels × laws scaffold is a useful orienting device, but its “open problems” framing of L3 understates how quickly that level has become a production engineering and compliance problem rather than a research one.

Round-ups

Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets

Stanford’s SLIDERS tackles question answering over large document sets by extracting facts into a relational database and reasoning with SQL, replacing chunk-level retrieval and aggregation. The pipeline keeps provenance and extraction rationales as metadata to support data reconciliation across sources.

Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents

Memanto proposes a universal memory layer for long-horizon agents that swaps hybrid semantic-graph pipelines for a typed schema plus an information-theoretic search engine, adding temporal versioning and conflict resolution. Benchmarks on LongMemEval and LoCoMo target the ingestion delay and operational complexity of vector-based systems.

Learning Evidence Highlighting for Frozen LLMs

HiLight trains a small emphasis actor to insert highlight tags around key evidence in long contexts before passing them to a frozen LLM solver. Trained via reinforcement learning on only the solver’s task reward, it transfers zero-shot to long-context QA and sequential recommendation.

LLM Safety From Within: Detecting Harmful Content with Internal Representations

SIREN is a lightweight guard model that taps an LLM’s internal layer activations rather than terminal-layer outputs, using linear probes and an adaptive layer-weighting scheme over safety neurons to flag harmful content with far fewer trainable parameters and support real-time streaming detection.

Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework

ESRRSim is a taxonomy-driven agentic framework that probes large language models for emergent strategic risks like deception, evaluation gaming and reward hacking, scoring both final responses and intermediate reasoning traces across multiple LLMs to surface behaviors that single-turn safety tests typically miss.

dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model

dWorldEval is a discrete diffusion world model for evaluating robotic policies in simulation, mapping observations and actions into a unified token space and using a transformer denoiser with sparse keyframe memory and a progress token to jointly predict future states across modalities.

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

AgentSearchBench evaluates the problem of finding the right AI agent for a task, arguing that retrieval and reranking over textual agent descriptions are insufficient. It scores candidate agents using execution-grounded behavioral signals from probing runs rather than card metadata alone.

Habr discussion on Sessa release — https://habr.com/en/articles/990704/comments/

The official GitHub repository (LibratioAI/sessa) was created just one week prior to the paper’s announcement and contained only ~1,000 lines of code… the repository lacked essential components such as training loops, evaluation scripts, and pre-trained checkpoints, leading developers to question the validity of the reported results.

↩ ↩² ↩³
LibratioAI/sessa GitHub repository — https://github.com/LibratioAI/sessa

Repository contains the official PyTorch implementation of the SessaLayer with FlashAttention support and a reference fallback; as of late April 2026 it had minimal community engagement (single-digit stars, no open issues).

↩ ↩² ↩³
VentureBeat on Mamba-3 — https://venturebeat.com/technology/open-source-mamba-3-arrives-to-surpass-transformer-architecture-with-nearly

Mamba-3 introduces complex-valued states that allow the model to represent ‘rotational’ logic and oscillatory patterns, significantly improving performance on state-tracking and logic puzzles that stumped earlier SSMs.

↩ ↩²
TechBuddies on Titans/TTT benchmarks — https://www.techbuddies.io/2026/01/09/stanford-and-nvidias-test-time-training-breakthrough-promises-long-memory-ai-without-costly-full-attention/

Titans MAC (Memory as a Context) variant achieved 96.2% accuracy on 16K sequences, significantly outperforming DeltaNet (71.4%) and early TTT versions (88.4%)… TTT-E2E matches the loss scaling of full-attention Transformers much more closely than its peers as context grows toward 128K tokens.

↩ ↩²
Aditya Inamdar, ‘Sessa: When Attention Meets Recurrence’ (Medium) — https://medium.com/@inamdaraditya/sessa-when-attention-meets-recurrence-a-new-paradigm-for-long-context-memory-38aab9f46637

Sessa is framed as a ‘new paradigm’ that combines the input-dependent routing of Transformers with the stateful aggregation of SSMs, with a feedback gain constraint |γ_t| < 1 enforced via tanh to keep the recurrent solve bounded.

↩
Medium: ‘The task a 5-year-old can solve that Mamba-2 cannot’ — https://medium.com/@user.ishan/the-task-a-5-year-old-can-solve-that-mamba-2-cannot-60891546b0a7

Mamba-2 utilizes a restricted scalar-identity transition matrix A, which limits its expressive capacity… its scalar A matrix can only represent decay or growth, not the rotation required for tasks like parity or complex associative recall.

↩
CameraBench project page (Lin et al., NeurIPS 2025 Spotlight) — https://linzhiqiu.github.io/papers/camerabench/

SfM models excel at geometric trajectories but struggle with semantic intents such as ‘following’ a subject, while VLMs face the opposite problem — struggling with precise geometric estimation.

↩
ICLR 2025 — AuroraCap (Chai et al.), precursor work — https://iclr.cc/virtual/2025/poster/28051

AuroraCap retains 99.5% of performance while using only 10–20% of the original visual tokens via bipartite soft matching token merging, and outperforms Gemini-1.5 Pro (CIDEr 88.9 vs 82.2) on Flickr30k.

↩
OpenReview review of AuroraCap / VDCscore — https://openreview.net/forum?id=tTDUrseRRU

Reviewers flagged ‘unfair comparisons’ — baseline Gemini and GPT-4V were evaluated zero-shot while the specialist model benefited from structured fine-tuning, and VDCscore’s reliance on GPT-4o introduces cumulative error and version drift.

↩ ↩²
arXiv 2410.03051 — hallucination evaluation of frontier VLMs — https://arxiv.org/html/2410.03051v1

Even Gemini-3.1-Pro, which reduced hallucination rate by 38 percentage points over its predecessor, still hallucinates ~50% of the time on complex omniscience benchmarks.

↩
Gnoppix forum — Qwen3-VL long-video benchmark report — https://forum.gnoppix.org/t/qwen3-vl-can-scan-two-hour-videos-and-pinpoint-nearly-every-detail/3006

Qwen3-VL-235B maintained 99.5% accuracy locating events within two-hour videos (~1M tokens), and the 72B variant scored 72.6% on VideoMME vs GPT-4o’s 68.2%.

↩
Salad.com Wan 2.1 benchmarking report — https://blog.salad.com/benchmarking-wan2-1/

The Wan 1.3B ‘lite’ version generates a 5-second 480p clip on an RTX 4090 in ~4 minutes using 8.19 GB VRAM, but 720p/1080p generation can exceed 30 minutes per clip on consumer cards.

↩
arxiviq Substack review — https://arxiviq.substack.com/p/agentic-world-modeling-foundations

provides much-needed theoretical foundation for a field often criticized for its ambiguous definitions

↩
The Decoder — LeCun on Sora — https://the-decoder.com/metas-chief-ai-researcher-says-openais-world-simulator-sora-is-a-dead-end/

modeling the world by predicting pixels is a ‘dead end’ … generative models focus on irrelevant high-dimensional noise

↩
world-models.io Sora vs Genie 2 index — https://world-models.io/en/compare/sora-vs-genie-2/

ranks Sora high in predictive fidelity (85/100) but significantly lower in ‘Planning & Control’ (30/100) compared to Google DeepMind’s Genie 2

↩
CMU/UCSD critique of JEPA (arXiv 2507.05169) — https://arxiv.org/pdf/2507.05169

JEPA is still fundamentally autoregressive and suffers from ‘latent drift,’ where errors compound in abstract space without the ‘grounding’ that pixel-level reality checks provide

↩
Raconteur — Autonomous AI agents 2026 governance — https://www.raconteur.net/technology/autonomous-ai-agents-2026-the-new-rules-for-business-governance

standard IT frameworks assume predictable system behavior, whereas L3 agents are inherently adaptive and non-linear

↩
Medium — Capability Evolver implementation — https://medium.com/@creativeaininja/capability-evolver-the-system-that-lets-ai-agents-rewrite-themselves-in-production-ae375e416956

treating agent improvement as a production feature … typed mutations and validation gates to ensure that autonomous evolutions do not compromise system stability

↩

Sessa beats Transformers, CHAI beats Gemini, agent survey regrades Sora

TL;DR

Sessa beats Transformers 86% to 79% — on its own synthetic task

TL;DR

The mechanism

What the numbers actually show

The provenance problem

Takeaway

CHAI tunes Qwen3-VL to top Gemini-3.1-Pro on film captions

TL;DR

The contribution: structure plus critique

Why the Gemini comparison needs an asterisk

The Wan fine-tune carries a compute footnote

Takeaway

Agentic World Modeling survey grades agents L1–L3 by usability

TL;DR

The taxonomy: capability levels × world regimes

Decision-usability beats pixel fidelity

L3 is where the survey meets the regulators

Round-ups

Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets

Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents

Learning Evidence Highlighting for Frozen LLMs

LLM Safety From Within: Detecting Harmful Content with Internal Representations

Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework

dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

Footnotes

Jack Sun, writing.