Hugging Face's agent glossary makes the harness a first-class layer
Every URL the pipeline pulled into ranking for this issue — primary sources plus the supporting and contradicting findings each Researcher returned. Inline citations in the issue point back here.
Sources
Harness, Scaffold, and the AI Agent Terms Worth Getting Right huggingface.co
References
MindStudio analysis citing METR mindstudio.ai
The same model can see performance swings from 46% to 80% on the same benchmark depending solely on the quality of its harness.
MindStudio — ‘Better model vs better harness’ mindstudio.ai
METR found Claude Code beat a simple ReAct loop in only 50.7% of bootstrap samples — statistically a coin flip — and OpenAI’s Codex harness actually underperformed a generic Triframe scaffold, winning only 14.5% of the time.
Towards Data Science — workflows vs agents towardsdatascience.com
Anthropic’s framework draws the line at who decides: workflows orchestrate LLMs through predefined code paths, while agents let the LLM dynamically direct its own processes and termination.
Cognition Labs — ‘Don’t Build Multi-Agents’ youtube.com
Fragmenting context across sub-agents leads to dispersed decision-making and silent corruption that no post-processing can fix; Anthropic counters with a 90.2% win rate for lead-worker patterns but at ~15x the token cost.
kau.sh — ‘Claude Skills’ deep-dive kau.sh
On SkillsBench, human-curated skills boosted task success by over 16%, but model-generated skills often provided no measurable benefit or even degraded performance.
Firecrawl — context engineering primer firecrawl.dev
A Databricks study found model accuracy begins to drop significantly around 32,000 tokens, long before nominal million-token limits — a failure mode practitioners call ‘context rot.’