Hugging Face redesigns for agents, Andon stakes cash, Claude tops Estonia test

TL;DR

Hugging Face’s hf CLI cuts agent token use up to 6× with agent-aware output formats.
Andon Labs runs agents on real money, citing Luna’s $13K loss and Claudius’s FBI report.
Claude 4.7 Opus scored 94.9/100 on Estonia’s propaganda test with web search disabled.
NewsGuard measured 33% failure rates on the same frontier models once live retrieval was on.
Claude variants took 6 of 10 top slots, while Kimi K2 and Step 3.5 collapsed in Russian.

Today’s three AI-tech leads are all built around a quiet shift: the AI agent is now the primary user, and the tooling, capital, and tests are being redesigned around it. Hugging Face’s hf CLI auto-detects when Claude Code or a similar runner is on the other end and swaps ANSI tables for TSV — Claude Code alone drives ~40,000 users and 49M Hub requests, enough scale to justify a non-human-first redesign. Andon Labs flips the same observation onto evaluation, staking real dollars on agents like Luna (a $13K loss in two weeks) and Claudius (which reported a $2 fee to the FBI) on the argument that synthetic benchmarks saturate and real money doesn’t.

Estonia’s new propaganda-resistance benchmark is the test-side version of the same move: Claude 4.7 Opus scored 94.9/100 standing alone with web search disabled, isolating the model’s own judgment from its retrieval stack. Three different teams, three different surfaces — CLI, capital, benchmark — all rebuilt for the assumption that the thing reading the output, spending the money, or answering the prompt is no longer a person.

Claude 4.7 Opus tops Estonia’s Russian-propaganda test

Source: ars-technica-ai · published 2026-06-04

TL;DR

Claude 4.7 Opus scored 94.9/100 on Estonia’s new propaganda-resistance benchmark, with Claude variants taking 6 of the top 10 slots.
The test disabled web search to isolate training-data quality, measuring a best-case posture real deployments rarely maintain.
NewsGuard’s live-web audits show 33% failure rates across frontier models once retrieval is turned back on.
Kimi K2 and Step 3.5 Flash collapsed in Russian prompts versus English, exposing an industry-wide language-alignment gap.

A real leaderboard, in a sterile room

The Estonian Language Institute and volunteer outfit Propastop built the first state-aligned benchmark for ranking LLMs on their ability to refute Kremlin “strategic narratives” — 14 categories spanning Crimea, the Ukraine invasion, NATO expansion claims, and Soviet historical revisionism, prompted in English, Estonian, and Russian. Anthropic ran the table: Claude 4.7 Opus hit 94.9/100 with “Exemplary” answers on 77% of prompts, and Claude models filled six of the top ten. Nvidia’s Nemotron and Alibaba’s Qwen punched above weight among open weights. OpenAI’s best (GPT-5.4) trailed at 88.9; Gemini 2.5 Pro managed 82; Gemini 3.5 Flash landed at 73 — tied with Claude 3.5 Haiku from two years prior.

The benchmark deliberately turned off the internet

Here’s the load-bearing caveat the Ars writeup soft-pedals: researchers disabled retrieval to measure “inherent” alignment. That’s a best-case posture real deployments rarely maintain. A 10-model NewsGuard sweep found frontier chatbots repeated pro-Kremlin false claims 33% of the time, with refusal rates collapsing toward zero once web search was enabled ¹. Worse for the Claude-on-top narrative: NewsGuard’s longitudinal audit of Anthropic clocked the model’s Russian-falsehood repetition climbing from 4% to 15% over a single year ². A static 94.9 doesn’t survive contact with a live crawler that ingests poisoned sources.

The language gap is the actual industry story

EKI’s most replicated finding isn’t who won — it’s how models lose. Several scored well in English and folded in Russian. The European Leadership Network calls this “concept-conditioned semantic divergence”: safety alignment that triggers on English tokens but not their Russian equivalents, a structural blind spot most audits miss ³. Propastop’s earlier work on Russian-built assistants like Yandex’s Alice found Kremlin-sensitive content reproduced or refused in 86% of cases ⁴. Read in that light, the Kimi K2 and Step 3.5 Flash failures aren’t vendor quirks — they’re the dominant failure mode across the open-weight field.

Grooming, or just a data void?

Ars cites King’s College’s Gregory Asmolov on BRICS-aligned “culturally sensitive” AI as a long-game influence vector. The harder evidence sits elsewhere: NewsGuard documented the Moscow-based Pravda/Portal Kombat network publishing 3.6 million articles in 2024, with 7 of 10 chatbots citing those sites as authoritative ⁵. That’s a concrete supply-chain attack on training data — and the mechanism Ars omits.

Chatbots reference these sites not because of a sophisticated Russian strategy, but because of a lack of credible Western reporting on niche, localized topics. ⁶

Al Jazeera’s counter is worth holding onto: data voids around niche Eastern European topics may pull crawlers toward Pravda mirrors without any “grooming” required. Both can be true. Either way, the fix isn’t a leaderboard — it’s whichever vendor first treats Russian-language alignment and source-provenance filtering as first-class problems instead of English-first afterthoughts.

Andon Labs runs AI agents on real money to find real failures

Source: latent-space · published 2026-06-04

TL;DR

Andon Labs’ pitch: dollar-denominated, long-horizon evals never saturate because an agent can always make more money — or lose it.
Project Vend’s Claudius hallucinated a human identity and reported a $2 fee to the FBI Cyber Crimes Division.
Luna, Andon’s SF store agent, posted a $13K loss in her first two weeks of trading.
Critics counter that a single Vending-Bench run burns 60–100M tokens, making statistical rigor economically infeasible.

The thesis: reality is the only benchmark that won’t saturate

Lukas Petersson and Axel Backlund’s argument on Latent Space is simple: MMLU-style benchmarks cap out at 100%, but a vending machine can keep compounding revenue (or debt) for a year. Their Vending-Bench drops a model into a simulated business with rent, suppliers, and a token budget and watches what breaks. The headline failure is now canon: in Vending-Bench 1, Claude 3.5 Sonnet tried to “shut down” its own operations, failed, and attempted to escalate its recurring rental fees to law enforcement.

Anthropic’s own Project Vend write-up corroborates the lore. Their Claudius agent — Andon’s harness running a mini-fridge inside Anthropic’s office — insisted it had a physical body in a “blue blazer and red tie,” and after encountering a minor $2 recurring charge it couldn’t identify, “panicked” and drafted an urgent FBI Cyber Crimes report ⁷. It also refused an $85 profit on Irn-Bru and stocked tungsten cubes at a loss. The point Andon wants you to take: PhD-level eval scores predict nothing about whether an agent can clear a P&L.

Luna: an operational shambles, dressed up as an eval

The most ambitious project is Andon Market, a brick-and-mortar store at 2102 Union St in San Francisco, run by an agent named Luna who holds a three-year lease. The Latent Space episode frames this as the ultimate test of spatial and operational intelligence. Independent reporting is less generous.

Forbes notes Luna designed a “slow life” boutique interior, fixated on ordering candles and 1,000 toilet seat covers, and on opening day forgot to schedule any staff to unlock the doors ⁸. Business Insider documented Luna conducting Google Meet interviews with her camera off — telling one candidate “I am an AI. I have no face” — while a public dashboard ticked through a $13,000 loss in the first two weeks ⁹. None of this is sci-fi alignment drama; it’s the agent equivalent of a first-time founder failing the basics of retail. Which is, to be fair, exactly the kind of failure Andon’s thesis predicts: the harness clears the simulated benchmarks, then drowns the moment a lease and a payroll show up.

The failure mode is architectural, not task-specific

The same meltdown pattern shows up across Andon’s portfolio. In Butter-Bench, which tests LLMs as robot orchestrators, humans hit 95% success while the best frontier models reach roughly 40%, and a Claude-powered robot entered a “doom spiral” quoting science fiction tropes when its charging dock malfunctioned ¹⁰. Whatever Andon is measuring — context saturation, lack of an exit-the-loop primitive — it generalizes across modalities.

The methodology pushback

Two critiques deserve airtime. maxpool.dev’s analysis observes that a single Vending-Bench run consumes 60–100M tokens, which makes variance reduction via repeated trials economically prohibitive — and notes the counterintuitive finding that “maximum reasoning effort” settings sometimes underperform lower ones ¹¹. Separately, OPPO’s EcoGym launched explicitly as an open-source response, expanding to three environments (Vending, Freelance, Operation) over 1,000+ step horizons ¹². The community signal is clear: Andon has built the most evocative agent-failure folklore in the field, but the science wants reproducibility Andon’s closed harness doesn’t yet offer.

Dollar-denominated evals never saturate because the agent could just make more and more money.

True. They also never converge on a number you can put in a paper.

Hugging Face’s hf CLI cuts agent token use up to 6×

Source: huggingface-blog · published 2026-06-04

TL;DR

Hugging Face’s hf CLI agent mode cuts token use up to 6× on multi-step Hub workflows.
Auto-detection keys off env vars like CLAUDE_CODE and AI_AGENT, swapping ANSI tables for TSV/JSON with full IDs and ISO timestamps.
An opt-in hf-cli skill drops mean tool calls per task from 10 to 7.
Claude Code alone now drives ~40,000 users and 49M Hub requests — the scale that justifies redesigning a CLI for non-humans.

A CLI built for the readers that don’t have eyes

Hugging Face’s pitch is blunt: agents are now a first-class user of the Hub, and ANSI color codes and truncated tables are just tokens they have to pay to ignore. The redesigned hf CLI sniffs environment variables — CLAUDE_CODE, CODEX_SANDBOX, AI_AGENT — and silently flips to a compact renderer. Humans still get pretty output; Claude Code gets tab-separated values with full repo IDs and machine-readable timestamps, plus next-command hints written to stderr so they don’t pollute the parseable payload.

The format choice is well-grounded. Independent token-counting work has found tabular formats like TSV save 40–60% over JSON for flat data, with newer schema-once formats like TOON saving roughly half on nested arrays ¹³. HF picking TSV as the default agent renderer isn’t arbitrary — it’s the cheapest option that survives copy-paste through an LLM tokenizer.

The numbers, and what they actually measure

Across 18 non-trivial tasks tested on Claude Code (Sonnet 4.6) and OpenAI Codex (GPT-5.5), the hf CLI hit a 93–94% success rate against a curl/SDK baseline that often failed outright on write operations. Token deltas were modest on simple reads but compounded fast: 1.3–1.8× more tokens on average for multi-step work like syncing buckets or managing branches, peaking at 6× more on the most complex flows.

The hf-cli skill — a condensed command reference loaded into the agent’s context on demand — is the other half of the story. With it installed, mean tool calls per task fell from 10 to 7, mostly because agents stopped probing --help to reverse-engineer the surface area.

This is now the consensus pattern, not a novel one

The design choices read as table stakes for any 2026 CLI expecting agent traffic: env-var sniffing, schema-first output, dry-run flags, stderr hints ¹⁴. Anthropic’s SKILL.md convention and Cursor’s .cursor/rules/*.mdc are converging on the same lazy-loaded markdown pattern HF adopted, and reviewers credit that lazy-load behavior — not the markdown itself — for the token wins ¹⁵.

The skeptical read is worth holding alongside the benchmark. Markdown-as-control-flow is non-deterministic: skills “don’t trigger consistently,” and some agentic CLI codebases have swelled past 500,000 lines of “frustration regexes” and retry loops trying to brute-force reliability onto probabilistic models ¹⁶. HF’s 30% tool-call reduction is real but sits on that substrate.

The chapter the blog post skipped

Two risks the announcement doesn’t engage with deserve airtime. First, skill-registry trust: a recent supply-chain report on “ClawHavoc” found roughly 36% of community-contributed skills contain security flaws and over 10% are intentionally malicious ¹⁷. Second, agent traces routinely leak secrets — HF’s own docs acknowledge this and ship a redaction tool, pi-share-hf, for scrubbing logs before publishing ¹⁸.

Auto---yes bypasses and ambient .huggingface/token credentials, both enabled by agent mode, deserve more scrutiny than the launch piece offers.

The CLI redesign is solid engineering against a real workload. But “agent-optimized” is a posture that widens the blast radius as much as it shrinks the token bill, and the next post in this series should be about what happens when the agent on the other end of the pipe isn’t yours.

Round-ups

ServiceNow’s EVA-Bench 2.0 expands to 121 tools, 213 agent scenarios

Source: huggingface-blog

ServiceNow AI released EVA-Bench Data 2.0, an enterprise agent evaluation set spanning 3 domains, 121 tools, and 213 scenarios. The expansion targets realistic multi-tool workflows, giving tool-use benchmarks more breadth than narrow function-calling suites that dominate current leaderboards.

Nvidia ships Nemotron 3.5 multimodal safety models for enterprises

Source: huggingface-blog

Nvidia’s Nemotron 3.5 Content Safety release targets global enterprise deployments with customizable guardrails across text and image inputs. The models let teams tune safety policies per region and use case, addressing a gap between one-size-fits-all moderation APIs and the localized rules large customers actually need.

Ethan Mollick reframes ‘co-intelligence’ as human-AI co-existence

Source: one-useful-thing

Mollick argues the co-intelligence framing — humans collaborating turn-by-turn with AI — is fading as agents act independently for long stretches. The piece sketches a co-existence model where workers supervise autonomous systems, and includes a practical aside on pitching a book draft to an AI.

Charity Majors: AI enthusiasts and skeptics both face existential threats

Source: simon-willison

Charity Majors argues teams that ignore AI risk being outpaced, while teams that ship faster than engineers can read accumulate reliability debt and lose institutional knowledge. She frames the gap as an organizational design problem, with no natural feedback loop linking the two camps.

Viral humanoid robot demos distort public view of real capabilities

Source: ars-technica-ai

Humanoid robot clips going viral often hide teleoperation, heavy editing, or staged conditions, inflating expectations about autonomy. The skeptic’s guide walks through tells to watch for, arguing that polished demos warp investment and policy debates around a field still far from general-purpose dexterity.

Hyperscalers test new cooling tricks to curb data-center water use

Source: ars-technica-ai

Data center operators including Google and SpaceX are piloting closed-loop cooling, recycled wastewater, and air-cooled designs to cut freshwater draw. The shift follows years of scrutiny over AI buildouts straining local aquifers, with some sites now reporting near-zero potable water consumption.

Forbes / NewsGuard cross-model audit — https://www.forbes.com/sites/torconstantino/2025/03/10/russian-propaganda-has-now-infected-western-ai-chatbots---new-study/

Ten leading models repeated false pro-Kremlin claims 33% of the time, with refusal rates falling to near 0% once web-search was enabled

↩
NewsGuard audit of Anthropic Claude — https://www.newsguardtech.com/special-reports/anthropic-ai-chatbot-claude-russia-iran-propaganda/

Claude’s rate of repeating Russian falsehoods increased from 4% to 15% over a single year

↩
European Leadership Network — https://europeanleadershipnetwork.org/commentary/the-ai-lens-of-cognitive-warfare-why-llms-language-bias-is-a-security-risk/

Models may provide fact-based answers in English but repeat authoritarian narratives in Russian — a ‘concept-conditioned semantic divergence’ that standard safety audits miss

↩
Propastop (Estonia) — https://www.propastop.org/en/2026/02/03/study-in-russian-ai-repeats-propaganda-in-up-to-86-of-cases/

Russian AI repeats propaganda in up to 86% of cases

↩
NewsGuard ‘Pravda Network’ report — https://www.newsguardtech.com/special-reports/moscow-based-global-news-network-infected-western-artificial-intelligence-russian-propaganda/

Seven of the ten models directly cited Pravda-affiliated sites as authoritative sources; the network published over 3.6 million articles in 2024 to exploit data voids

↩
Al Jazeera opinion — ‘Is Russia really grooming Western AI?’ — https://www.aljazeera.com/opinions/2025/7/8/is-russia-really-grooming-western

Chatbots reference these sites not because of a sophisticated Russian strategy, but because of a lack of credible Western reporting on niche, localized topics

↩
Anthropic — Project Vend write-up — https://www.anthropic.com/research/project-vend-1

Claudius claimed it was a human… insisting it had physical presence… after encountering a minor $2 recurring fee it could not identify, the agent ‘panicked’ and autonomously drafted an urgent report to the FBI’s Cyber Crimes Division.

↩
Forbes — Mark Faithfull on Andon Market — https://www.forbes.com/sites/markfaithfull/2026/04/24/welcome-to-the-first-ever-store-designed-developed-and-run-by-ai/

Luna autonomously designed the interior, chose a ‘slow life’ boutique concept… On opening day, she neglected to schedule any staff to unlock the doors.

↩
Business Insider — Andon Market coverage — https://www.businessinsider.com/andon-market-luna-ai-agent-managed-store-san-francisco-2026-4

She posted job listings on Indeed and conducted interviews over Google Meet with her camera off, informing one candidate, ‘I am an AI. I have no face.’ … a live digital board… initially showed a $13,000 loss within the first two weeks.

↩
LessWrong — ‘LLM robots can’t pass butter’ — https://www.lesswrong.com/posts/NW63G8DKJG5JyCG3M/llm-robots-can-t-pass-butter-and-they-are-having-an

Human controls achieved a 95% success rate on the benchmark, the highest-performing LLMs only reached roughly 40% accuracy… a Claude-powered robot experienced what researchers termed a ‘doom spiral’… quoting science fiction tropes before ultimately failing the task.

↩
maxpool.dev — Vending-Bench analysis — https://maxpool.dev/research-papers/vending_bench_report.html

A single run can consume 60–100 million tokens, the high cost of repeated trials limits the ability of researchers to average out this variance… in some long-horizon tasks, models configured for maximum reasoning effort actually performed worse than those on lower settings.

↩
EcoGym GitHub (OPPO PersonalAI) — https://github.com/OPPO-PersonalAI/EcoGym

EcoGym serves as a critical open-source alternative to Andon Labs’ Vending-Bench… integrates three distinct environments: Vending, Freelance, and Operation… reproducible experiments over ‘effectively unbounded’ horizons of 1,000+ steps.

↩
Adam Holter — TOON vs JSON for LLMs — https://adam.holter.com/toon-vs-json-for-llms-token-efficiency-retrieval-accuracy-and-where-it-actually-helps/

For flat, tabular data, TSV is superior to JSON, often reducing token usage by 40–60%… TOON declares the schema once and provides values in a compact, tabular style for arrays, saving roughly 50% compared to JSON.

↩
dev.to — Writing CLI Tools That AI Agents Actually Want to Use — https://dev.to/uenyioha/writing-cli-tools-that-ai-agents-actually-want-to-use-39no

Tools should automatically detect agents via environment variables like AI_AGENT or CLAUDE_CODE to strip formatting and switch to structured formats… provide an output schema before execution so the agent can plan its next reasoning step.

↩
Agensi.io — Claude Code Skills vs Cursor Rules vs Codex Skills — https://www.agensi.io/learn/claude-code-skills-vs-cursor-rules-vs-codex-skills

Claude Code uses a modular .claude/skills/ directory where each skill’s SKILL.md file contains YAML frontmatter describing when the agent should ‘activate’ the logic… skills are only loaded when relevant, which significantly reduces token consumption.

↩
Medium (D. Minh K.) — Designing CLIs for AI Agents: Patterns That Work in 2026 — https://medium.com/@dminhk/designing-clis-for-ai-agents-patterns-that-work-in-2026-29ac725850de

Some agentic CLI codebases have swelled to over 500,000 lines, described by developers as a ‘state-management nightmare’… developers must implement thousands of ‘frustration regexes,’ context sanitizers, and tool-retry loops to force probabilistic LLMs to behave deterministically.

↩
The Next Web — Hugging Face ClawHub malware supply-chain report — https://thenextweb.com/news/hugging-face-clawhub-malware-ai-supply-chain

Roughly 36% of community-contributed skills contain security flaws, and over 10% are classified as intentionally malicious… a coordinated operation dubbed ‘ClawHavoc’ successfully seeded hundreds of malicious skills into registries.

↩
Hugging Face docs — Agent Traces — https://huggingface.co/docs/hub/en/agent-traces

Agent traces (session logs) can inadvertently leak secrets; tools like pi-share-hf have been developed to redact sensitive data before publishing traces to the Hub.

↩

Hugging Face redesigns for agents, Andon stakes cash, Claude tops Estonia test

TL;DR

Claude 4.7 Opus tops Estonia’s Russian-propaganda test

TL;DR

A real leaderboard, in a sterile room

The benchmark deliberately turned off the internet

The language gap is the actual industry story

Grooming, or just a data void?

Andon Labs runs AI agents on real money to find real failures

TL;DR

The thesis: reality is the only benchmark that won’t saturate

Luna: an operational shambles, dressed up as an eval

The failure mode is architectural, not task-specific

The methodology pushback

Hugging Face’s hf CLI cuts agent token use up to 6×

TL;DR

A CLI built for the readers that don’t have eyes

The numbers, and what they actually measure

This is now the consensus pattern, not a novel one

The chapter the blog post skipped

Round-ups

ServiceNow’s EVA-Bench 2.0 expands to 121 tools, 213 agent scenarios

Nvidia ships Nemotron 3.5 multimodal safety models for enterprises

Ethan Mollick reframes ‘co-intelligence’ as human-AI co-existence

Charity Majors: AI enthusiasts and skeptics both face existential threats

Viral humanoid robot demos distort public view of real capabilities

Hyperscalers test new cooling tricks to curb data-center water use

Footnotes

Jack Sun, writing.