Sources

Harness, Scaffold, and the AI Agent Terms Worth Getting Right huggingface.co

References

MindStudio analysis citing METR mindstudio.ai

The same model can see performance swings from 46% to 80% on the same benchmark depending solely on the quality of its harness.

MindStudio — ‘Better model vs better harness’ mindstudio.ai

METR found Claude Code beat a simple ReAct loop in only 50.7% of bootstrap samples — statistically a coin flip — and OpenAI’s Codex harness actually underperformed a generic Triframe scaffold, winning only 14.5% of the time.

Towards Data Science — workflows vs agents towardsdatascience.com

Anthropic’s framework draws the line at who decides: workflows orchestrate LLMs through predefined code paths, while agents let the LLM dynamically direct its own processes and termination.

Cognition Labs — ‘Don’t Build Multi-Agents’ youtube.com

Fragmenting context across sub-agents leads to dispersed decision-making and silent corruption that no post-processing can fix; Anthropic counters with a 90.2% win rate for lead-worker patterns but at ~15x the token cost.

kau.sh — ‘Claude Skills’ deep-dive kau.sh

On SkillsBench, human-curated skills boosted task success by over 16%, but model-generated skills often provided no measurable benefit or even degraded performance.

Firecrawl — context engineering primer firecrawl.dev

A Databricks study found model accuracy begins to drop significantly around 32,000 tokens, long before nominal million-token limits — a failure mode practitioners call ‘context rot.’

Sources

References

Jack Sun, writing.