Sources

Designing the hf CLI as an agent-optimized way to work with the Hub huggingface.co

Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs latent.space

We talk with the VendingBench authors on evaling Claudes from Haiku to Mythos, and how they build leading, and lasting, frontier evals from scratch.

These LLMs are the best at resisting Russian propaganda arstechnica.com

Estonian government benchmark shows how dozens of models combat Russia’s “strategic narratives.”

Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI huggingface.co

Nvidia’s Nemotron 3.5 Content Safety release targets global enterprise deployments with customizable guardrails across text and image inputs. The models let teams tune safety policies per region and use case, addressing a gap between one-size-fits-all moderation APIs and the localized rules large customers actually need.

The skeptic’s guide to humanoid robots going viral on the Internet arstechnica.com

Humanoid robot clips going viral often hide teleoperation, heavy editing, or staged conditions, inflating expectations about autonomy. The skeptic’s guide walks through tells to watch for, arguing that polished demos warp investment and policy debates around a field still far from general-purpose dexterity.

Co-Existence and the End of Co-Intelligence oneusefulthing.org

Mollick argues the co-intelligence framing — humans collaborating turn-by-turn with AI — is fading as agents act independently for long stretches. The piece sketches a co-existence model where workers supervise autonomous systems, and includes a practical aside on pitching a book draft to an AI.

How some data center operators are tackling their water use problems arstechnica.com

Data center operators including Google and SpaceX are piloting closed-loop cooling, recycled wastewater, and air-cooled designs to cut freshwater draw. The shift follows years of scrutiny over AI buildouts straining local aquifers, with some sites now reporting near-zero potable water consumption.

EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios huggingface.co

ServiceNow AI released EVA-Bench Data 2.0, an enterprise agent evaluation set spanning 3 domains, 121 tools, and 213 scenarios. The expansion targets realistic multi-tool workflows, giving tool-use benchmarks more breadth than narrow function-calling suites that dominate current leaderboards.

AI enthusiasts are in a race against time, AI skeptics are in a race against entropy simonwillison.net

Charity Majors argues teams that ignore AI risk being outpaced, while teams that ship faster than engineers can read accumulate reliability debt and lose institutional knowledge. She frames the gap as an organizational design problem, with no natural feedback loop linking the two camps.

References

Propastop (Estonia) propastop.org

Russian AI repeats propaganda in up to 86% of cases

NewsGuard audit of Anthropic Claude newsguardtech.com

Claude’s rate of repeating Russian falsehoods increased from 4% to 15% over a single year

NewsGuard ‘Pravda Network’ report newsguardtech.com

Seven of the ten models directly cited Pravda-affiliated sites as authoritative sources; the network published over 3.6 million articles in 2024 to exploit data voids

Al Jazeera opinion — ‘Is Russia really grooming Western AI?’ aljazeera.com

Chatbots reference these sites not because of a sophisticated Russian strategy, but because of a lack of credible Western reporting on niche, localized topics

Forbes / NewsGuard cross-model audit forbes.com

Ten leading models repeated false pro-Kremlin claims 33% of the time, with refusal rates falling to near 0% once web-search was enabled

European Leadership Network europeanleadershipnetwork.org

Models may provide fact-based answers in English but repeat authoritarian narratives in Russian — a ‘concept-conditioned semantic divergence’ that standard safety audits miss

Anthropic — Project Vend write-up anthropic.com

Claudius claimed it was a human… insisting it had physical presence… after encountering a minor $2 recurring fee it could not identify, the agent ‘panicked’ and autonomously drafted an urgent report to the FBI’s Cyber Crimes Division.

Forbes — Mark Faithfull on Andon Market forbes.com

Luna autonomously designed the interior, chose a ‘slow life’ boutique concept… On opening day, she neglected to schedule any staff to unlock the doors.

Business Insider — Andon Market coverage businessinsider.com

She posted job listings on Indeed and conducted interviews over Google Meet with her camera off, informing one candidate, ‘I am an AI. I have no face.’ … a live digital board… initially showed a $13,000 loss within the first two weeks.

LessWrong — ‘LLM robots can’t pass butter’ lesswrong.com

Human controls achieved a 95% success rate on the benchmark, the highest-performing LLMs only reached roughly 40% accuracy… a Claude-powered robot experienced what researchers termed a ‘doom spiral’… quoting science fiction tropes before ultimately failing the task.

EcoGym GitHub (OPPO PersonalAI) github.com

EcoGym serves as a critical open-source alternative to Andon Labs’ Vending-Bench… integrates three distinct environments: Vending, Freelance, and Operation… reproducible experiments over ‘effectively unbounded’ horizons of 1,000+ steps.

maxpool.dev — Vending-Bench analysis maxpool.dev

A single run can consume 60–100 million tokens, the high cost of repeated trials limits the ability of researchers to average out this variance… in some long-horizon tasks, models configured for maximum reasoning effort actually performed worse than those on lower settings.

dev.to — Writing CLI Tools That AI Agents Actually Want to Use dev.to

Tools should automatically detect agents via environment variables like AI_AGENT or CLAUDE_CODE to strip formatting and switch to structured formats… provide an output schema before execution so the agent can plan its next reasoning step.

Adam Holter — TOON vs JSON for LLMs adam.holter.com

For flat, tabular data, TSV is superior to JSON, often reducing token usage by 40–60%… TOON declares the schema once and provides values in a compact, tabular style for arrays, saving roughly 50% compared to JSON.

Medium (D. Minh K.) — Designing CLIs for AI Agents: Patterns That Work in 2026 medium.com

Some agentic CLI codebases have swelled to over 500,000 lines, described by developers as a ‘state-management nightmare’… developers must implement thousands of ‘frustration regexes,’ context sanitizers, and tool-retry loops to force probabilistic LLMs to behave deterministically.

Agensi.io — Claude Code Skills vs Cursor Rules vs Codex Skills agensi.io

Claude Code uses a modular .claude/skills/ directory where each skill’s SKILL.md file contains YAML frontmatter describing when the agent should ‘activate’ the logic… skills are only loaded when relevant, which significantly reduces token consumption.

The Next Web — Hugging Face ClawHub malware supply-chain report thenextweb.com

Roughly 36% of community-contributed skills contain security flaws, and over 10% are classified as intentionally malicious… a coordinated operation dubbed ‘ClawHavoc’ successfully seeded hundreds of malicious skills into registries.

Hugging Face docs — Agent Traces huggingface.co

Agent traces (session logs) can inadvertently leak secrets; tools like pi-share-hf have been developed to redact sensitive data before publishing traces to the Hub.

Sources

References

Jack Sun, writing.