# H0LD1NG · running · MAS 1980→2026 · 30+ sources · adversarially verified field notes · 2026-06-06
field notes // multi-agent systems

The theory your agents are reinventing

Multi-agent systems theory is roughly forty-five years old. It has formal definitions of what an agent is, protocols for who does what, a complexity result that says coordination is provably hard, and an economics for making selfish processes behave. The 2023+ LLM agent stack is re-deriving nearly all of it — usually without the citations. A field guide, with every concept illustrated twice: once from the literature, once from the running code of our agent holding.

TL;DRsix claims, each argued below
  1. MAS theory (1980→) already named the parts of your agent stack: contract nets, blackboards, social laws, performatives, mechanism design. The LLM era is a re-run with a better reasoning core.§00
  2. “Agent” has a real definition — autonomy, social ability, reactivity, pro-activeness (Wooldridge & Jennings 1995). A stateless LLM persona with tools, triggers and memos passes all four.§01
  3. Coordination is the hard part, provably: optimal decentralized control is NEXP-complete even for two agents (Bernstein et al. 2002). Real systems don’t solve it — they guard against it.§07
  4. The classical mechanisms map onto running code line by line: Contract Net ≈ task dispatch, blackboard ≈ the task board, social laws ≈ loop guards, VCG-style mechanism design ≈ budget rails, FIPA performatives ≈ typed tool calls.§03–06
  5. The evidence is two-sided: +90.2% on parallel research tasks at ~15× the tokens (Anthropic) — but at matched compute, single agents tie or win (Tran & Kiela), and 41–86.7% of multi-agent runs fail for organizational reasons, not model reasons (MAST).§08
  6. Our own mock holding agrees with the theory: 826 runs, and the loop guard fired 27 times on one CEO↔PM memo ping-pong that would otherwise still be running. (Since retired — the audit became a changelog, §10.)§04
§ 00 · thesis

The oldest new field in AI

Multi-agent systems — MAS, in the literature — sits at the intersection of distributed AI, game theory, economics and control theory. It studies how multiple autonomous decision-makers, each with local information and its own goals, produce coherent (or incoherent) global behavior. The field’s canon was largely written between 1980 and 2002: the Contract Net Protocol and the Hearsay-II blackboard both date to 1980, Brooks’ subsumption architecture to 1986, the BDI formalism to 1991, the standard definition of “agent” to 1995, and the complexity result that explains why all of this is genuinely difficult to 2002.

Then, starting in 2023, a much larger engineering community began building multi-agent systems with LLMs as the reasoning core — orchestrators, role-playing software companies, debate panels, agent protocols — and started re-deriving the canon from scratch, mechanism by mechanism, mostly without naming it.

We have a convenient specimen for checking that claim. H0LD1NG — the system from our previous field notes — is an operating system for companies staffed by AI agents: org charts in YAML, a generic runtime, a task board, memos, budgets, an event bus, and a meta-agent that designs new companies. It was built bottom-up from engineering needs, not from the MAS literature. Which makes the convergence the interesting part: nearly every guard, channel and rail in its source has a name in a paper written before 2003. This article walks the theory in order, and at each stop shows the line of code that reinvented it.

THEORY PRACTICE 1980 1986 1991 1995 2002 2019 2023 2024 2025 2026 Contract Net blackboard subsumption ’86 boids ’87 BDI ’91 · social laws ’92 “agent” defined ’95 KQML→FIPA-ACL DEC-POMDP = NEXP OpenAI Five AlphaStar CAMEL · Smallville MetaGPT · ChatDev AutoGen Swarm · MCP “more agents” A2A · MAST Agents SDK Anthropic 90.2% matched-compute skeptics
FIG 01Two lanes, one field. The theory lane finished its load-bearing results by 2002; the practice lane started compiling them in 2023. The axis is deliberately nonlinear — sixteen empty years sit in the double slash.
§ 01 · definitions

What counts as an agent

The field’s standard definition comes from Wooldridge & Jennings (1995), who deliberately offer a weak notion of agency built on four properties: autonomy (“agents operate without the direct intervention of humans”), social ability (“agents interact with other agents … via some kind of agent-communication language”), reactivity (perceive the environment and “respond in a timely fashion”), and pro-activeness (“goal-directed behaviour by taking the initiative”). A stronger notion, they note, additionally describes agents with mentalistic vocabulary — knowledge, belief, intention, obligation — the ground later formalized as BDI.

A system becomes multi-agent when there are several of these, no global controller scripting each move, and outcomes that emerge from interaction rather than from any single agent’s plan. Run the checklist against one H0LD1NG persona — say a content-writer at the mock SEO agency. Autonomy: each wake-up is a sealed LLM session that decides for itself what the board state demands; no human in the loop. Reactivity: event triggers (“wake when task.created with role: engineer”). Pro-activeness: cadences (daily@09:30) — the agent acts on its own clock, not just on stimuli. Social ability: the task board and memo channel are the only way personas touch each other. Four for four — and the runtime dispatching them holds no script of what happens next, only physics: schedules, budgets, guards.

ENVIRONMENT task board · tasks.json memos · events.jsonl workspace files durable · shared · the only memory AGENT one stateless LLM session deliberative core · ≤40 turns · ≤$1.00 tool layer reactive skin · sandboxed to the workspace ephemeral · crashes are free perceive board snapshot + unread memos → prompt act mcp__company__* calls · Write/Edit/Bash autonomy no controller inside a run reactivity event triggers + when-filters pro-activeness cadences: daily@09:30, every 2h social ability board + memos, nothing else
FIG 02The 1995 checklist, instantiated. One wake-up = perceive (live board snapshot and unread memos injected into the prompt) → deliberate (a capped LLM session) → act (typed company tools, sandboxed file tools). The four Wooldridge–Jennings properties each have one concrete mechanism.
§ 02 · architectures

Reactive, deliberative, hybrid — and where the LLM sits

Classical agent design has two poles and a middle. The deliberative pole rests on Newell & Simon’s physical-symbol-system hypothesis: the agent “contains an explicitly represented, symbolic model of the world” and decides “via logical (or at least pseudo-logical) reasoning.” The reactive pole is Brooks’ subsumption architecture (1986): no central world model, no symbolic reasoning — “a hierarchy of task-accomplishing behaviours” where lower layers take precedence. The middle is BDI (Rao & Georgeff 1991), which made intention a first-class mental attitude with “equal status with the notions of belief and desire,” formalized in branching-time logic and implemented in systems like the Procedural Reasoning System.

Modern LLM agents are deliberative-hybrids, and H0LD1NG makes the layering unusually legible because it implements the loop twice: once delegated to the Claude Agent SDK, and once hand-rolled for OpenAI-compatible providers — model call → parse tool calls → execute → feed results back → repeat (apiRunner.ts, whose header notes “What the SDK gives Claude for free, this loop provides itself”). The LLM is the deliberative core; the tool dispatch and file sandbox are the reactive layer that actually touches the world.

The BDI mapping is the satisfying one, with a twist the 1991 paper didn’t anticipate: the mental state lives outside the agent. Beliefs are the board, memos and files — re-read from disk at every wake-up, because the runner sets persistSession: false and the system prompt warns that “anything not written to the board, a memo, or a file is lost.” Desires are the mission and KPIs in the company YAML. Intentions are claimed tickets: moving a task to in_progress stamps claimedBy with your persona key, and a rule set (computeClaim) releases the commitment if the work is bounced, reassigned, or its owner crashes. An agent here is a memoryless process; the organization is the mind.

REACTIVE · BROOKS ’86 behaviour: flee behaviour: wander behaviour: explore stimulus→action · no world model DELIBERATIVE · BDI ’91 beliefs desires intentions plan → act explicit world model · means-ends HYBRID · LAYERED deliberative top plans, goals reactive bottom fast, situated responses the pole most LLM agents live at WHERE H0LD1NG LIVES deliberative core = the LLM session (SDK or hand-rolled loop) · reactive layer = typed tools + sandboxed Read/Write/Bash beliefs = board+memos+files, re-read each wake · desires = mission+KPIs in YAML · intentions = claimed tickets (claimedBy)
FIG 03The 1986–1991 axis, and one modern system placed on it. The twist on textbook BDI: every mental attitude is externalized to durable shared state, so the agent process itself can be — and is — stateless.
§ 03 · coordination i

Task allocation: the contract net, minus the bidding

The founding mechanism of MAS task allocation is Reid Smith’s Contract Net Protocol (IEEE Trans. Computers, 1980): managers announce tasks with eligibility specifications; “available contractors evaluate task announcements … and submit bids on those for which they are suited. The managers evaluate the bids and award contracts to the nodes they determine to be most appropriate.” The process recurses — a contractor may partition its task and become a manager for the parts. Forty-five years later this is recognizably the orchestrator-worker pattern, a mapping made explicitly in recent work on agent contracts and LLM task brokering (e.g. arXiv:2504.21030, arXiv:2506.01900).

H0LD1NG’s cross-company outsourcing broker is a contract net with one phase deliberately amputated. A role calls outsource_task, naming a sibling company; the orchestrator validates and immediately creates a linked ticket in the target — announcement and award fused into a directed grant, no bid round, no comparative evaluation. The target’s own triggers fire on task.created; on completion the artifacts are copied back into the source’s inbox/outsourced/ and the source ticket moves to review. Even Smith’s recursion limit shows up, inverted: “A task can be outsourced only once, an outsourced-in task cannot be re-outsourced (no chains)” — contracting depth pinned to one, so the dependency graph stays flat and the copy-back has a single, unambiguous home. Update, hours after first publication: the amputation is now optional. Leave the target unnamed and the broker runs the missing phase by proxy — every sibling scored (vitality − 2·queue depth − 5·cost-per-done-task, with a cold-start penalty so an unproven company can’t win on an empty ledger), best bid wins, candidate table recorded on the ticket. The contract net got its market back.

Inside a company, allocation is capacity-bounded rather than negotiated: each role materializes one slot per seat (clamped 1..8), dispatch marks a slot busy up front so concurrent passes can’t double-book it, and a missed cadence while all seats are full is dropped — not queued — with run.skipped{busy}, because periodic work is idempotent over board state and stale fires would only be waste. Event triggers do queue, FIFO per role, capped at 50 with drop-oldest: a bounded mailbox that sheds the stalest obligations to stay live.

CONTRACT NET · SMITH 1980 manager contractors 1 · task announcement 2 · bids 3 · award 4 · report recursion: contractor → manager for subtasks H0LD1NG BROKER · ORCHESTRATOR.TS role broker target co. outsource_task linked task.created announce + award, fused bidding — broker auction when no target named done → artifacts copy-back source ticket → review depth limit: one hop, no chains
FIG 04Smith’s four-step negotiation against H0LD1NG’s broker. The protocol survives as announcement, award, report and a depth limit; the competitive middle — bids and comparative evaluation — is replaced by the calling role’s own judgment when a target is named, and since this article’s own review landed, run by the broker as a proxy auction when it isn’t. The market came back (§10).
§ 04 · coordination ii

Blackboards, conventions, and social laws

The same year as the contract net, the Hearsay-II speech-understanding system (Erman, Hayes-Roth, Lesser & Reddy, 1980) gave MAS its other great coordination pattern: independent knowledge sources that never call each other, communicating solely through a shared global database. “KSs communicate through a global database called the blackboard … It represents intermediate states of problem-solving activity, and it communicates messages (hypotheses) from one KS that activate other KSs.” Replace “knowledge source” with “role” and “hypothesis” with “ticket” and you have described a 2026 agent task board — a lineage modern blackboard revivals now cite directly (arXiv:2507.01701, arXiv:2510.01285).

H0LD1NG is a blackboard system to the letter. Roles never message each other; the system prompt is blunt about it:

“Coordinate with other roles ONLY through the task board and memos (mcp__company__* tools). Other roles cannot see your session — anything not written to the board, a memo, or a file is lost.” — role system prompt, packages/core/src/runner/promptBuilder.ts

The board’s status lifecycle (backlog → in‑progress → review → done) is hypothesis refinement with a credibility gate: work may not move to done until a reviewer — just another role subscribed to status: review — passes it. The workspace paths are conventions in the Lewis sense, an arbitrary-but-shared coordination equilibrium: the SEO agency’s shared context mandates research/<client>/, briefs/<client>/, notes/<role-id>.md, so researchers and writers hand off artifacts without ever addressing one another.

Social laws, enforced and merely preached

Shoham & Tennenholtz (AAAI 1992; AIJ 1995) proposed designing social laws offline — “constraints on the behavior of agents” specifying “which of the actions that are in general available are in fact allowed in a given state” — so that coordination needs neither a central arbiter nor negotiation. Their canonical example is traffic: adopt keep-to-the-right, and all head-on collisions are avoided “without any need for either a central arbiter or negotiation.” H0LD1NG ships two such laws in the engine, where they bind regardless of what any model thinks: an event produced by a role’s own run never re-triggers that role (one line in the dispatcher), and a sliding-window guard caps re-triggers of the same (role, subject) at six per ten minutes. The source comment names the exact failure it outlaws:

“The self-trigger guard only blocks a role from re-triggering itself; it does nothing for cross-role ping-pong (e.g. a reviewer bouncing a task back to a maker who re-submits it for review, forever). … Beyond the cap we drop the trigger and emit run.skipped so a runaway loop dampens itself instead of burning spend.” — packages/core/src/orchestrator.ts

A third class of law lives only in the prompts — the SEO auditor’s “Never approve your own work”, the software house’s “Never assign two engineers overlapping files at the same moment — slice by feature, not layer”, Trading Glass’s HARD CONSTRAINT against ever touching live trading systems, repeated at the org, role and KPI layers. The distinction matters and the theory predicted it: an engine-enforced law is a guarantee; a prompt-level norm is a request. H0LD1NG’s designers put the loop-termination laws in code and the etiquette in prose — which is exactly where each belongs.

BLACKBOARD task board · memos backlog → in_progress → review → done events.jsonl · append-only hypotheses = tickets · levels = statuses executive directives · epics manager decomposes epics maker ×2 claims tickets · ships files reviewer wakes on status: review loop guard ≤ 6 / 10 min self-trigger guard own events never re-wake you no role ever addresses another role — every edge passes through the board (Hearsay-II, 1980)
FIG 05Hearsay-II with personas. Knowledge sources → roles; hypotheses → tickets; activation → event triggers. The two engine-enforced social laws sit on the edges: the self-trigger guard kills one-hop loops, the sliding-window breaker kills multi-role ones.
RUN.SKIPPED BY REASON · 2-DAY MOCK WINDOW software-house 27 loop_guard 2 busy boardgames-zone 2 loop_guard 1 busy seo-agency 1 busy trading-glass 1 busy typing-cat 1 busy ← one CEO↔PM memo ping-pong, damped 27 times by the law
FIG 06MOCK DATAThe social law, observed. Across 826 runs in the holding’s two-day mock window, the loop guard fired 29 times — 27 of them in one company, where the CEO and PM fell into a “Status: board healthy” / “Direction for this cycle” memo loop. Without the law, that loop has no fixed point. Postscript: this chart retired itself — status memos are now typed inform (§06) and wake nobody, and a pair that keeps tripping the guard escalates into a six-hour mute plus an executive memo (§10).
§ 05 · the economic spine

Game theory and mechanism design

When agents are self-interested — or merely fallible — MAS borrows its formal backbone from economics. Outcomes are analyzed as games (Nash equilibria, with the prisoner’s dilemma as the canonical gap between individual and collective rationality), and mechanism design runs the analysis backward: you design the rules so that even selfish play produces an acceptable global outcome. The textbook results are auctions. In a second-price sealed-bid auction — the Vickrey auction, 1961 — “truth telling is a dominant strategy” (Shoham & Leyton-Brown, Thm. 11.1.1), and its generalization, the VCG mechanism, makes “each agent … pay his social cost — the aggregate impact that his participation has on other agents’ utilities.” The deep idea is not auctions; it’s that the rules, not the agents, carry the burden of good behavior.

LLM agents are not strategic bidders, but they are budget-burning processes with unreliable judgment, which is close enough for the institution-design lens to bite. H0LD1NG’s money rails read like a mechanism-design checklist. Per-run caps bound any single agent’s deliberation (40 turns, $1.00 — bounded rationality as a config key). The daily company budget is a hard ceiling under concurrency: every dispatched run reserves its worst-case cost up front, and the guard admits new work against settled-plus-reserved spend — otherwise, as the source puts it, “one dispatch pass admits every free slot against the same stale settled total and overshoots by (concurrent_slots × maxBudgetUsd).” An unset budget is not unlimited (“An unset budget must not mean ‘unbounded spend’” — a $25/day backstop applies, and the breach event is flagged defaulted so you know which rule fired).

Two further rules show real institutional thinking. A paid model with no entry in the price table fails the run, because “a paid provider with no known price would bill $0, so NEITHER the per-run dollar cap NOR the per-day budget could ever see its real spend … Fail the run rather than run uncapped on a real account” — if you cannot meter an action, you must not let the agent take it, or the whole mechanism is silently defeated. And the ledger that all of this reads from is the one file the store fsyncs (“the ledger is the money + audit record”), surviving even daemon death: interrupted runs are reconciled as failed on the next start so the cap never relaxes across a crash. Above it all sits the Architect — a meta-agent whose single tool validates a proposed company against the same schema the runtime executes. The designer proposes; the institution adjudicates. That is mechanism design applied to organizations themselves.

SIMULATED SPEND BY COMPANY · USD · 2026-06-04 → 06-05 software-house $18.40 · 644 runs budget.warning ×3 (80% pre-warn) · budget.exceeded ×4 → auto-pause, $10/day cap boardgames-zone $2.59 · 94 runs trading-glass $0.80 · 29 runs typing-cat $0.74 · 28 runs seo-agency $0.71 · 29 runs _architect $0.08 · 2 sessions (1 redesign, 1 health review) holding totals: 826 runs · $23.31 simulated · ~3.7M tokens · 2 failures
FIG 07MOCK DATAThe mechanism, exercised. One company dominates spend (nearly 80% of the holding) and is also the only one whose rails fired: three 80% warnings, four hard breaches, each followed by auto-pause. The other four firms never approached their caps. Costs are simulated at real model rates; no API dollars were burned.
§ 06 · communication

From speech acts to typed tool calls

Classical MAS took from philosophy the idea that messages are acts, not data: an utterance informs, requests, proposes, commits. KQML (Finin et al., 1994) built an agent language from such performatives, and FIPA-ACL standardized a library of 22 communicative acts — among them inform (“The sender informs the receiver that a given proposition is true”), request (“The sender requests the receiver to perform some action”), propose, and cfp, the call-for-proposals that powers contract nets. The standards went dormant; the idea did not. The modern protocol layer rediscovers the split at a different altitude: Anthropic’s MCP (Nov 2024) is “an open standard that enables developers to build secure, two-way connections between their data sources and AI-powered tools” — the agent↔tool half — while Google’s A2A (April 2025, since donated to the Linux Foundation) covers the agent↔agent half, with capability-advertising Agent Cards and a stateful task lifecycle. The cleanest one-line split is A2A’s own: “A2A is about agents partnering on tasks, while MCP is more about agents using capabilities.”

H0LD1NG is an honest case study in where performatives actually went — and in what happens when an audit reads its own conclusions. As first published, its memos had the ACL envelope — from, to (a role or all), subject, body, readBy — but no performative field: whether a memo was a directive, a status report or a question lived in prose (“Use memos for direction, status, hand-offs and questions”). Hours after this article shipped, the field arrived: every memo now carries kind: direct | inform | request | handoff | question, and the performative has wake semantics — an inform memo wakes nobody (it waits for the recipient’s next natural wake), the action kinds fire triggers as before, and a trigger can filter on it (when: {kind: request}). FIPA’s 22 acts became five, with scheduling consequences. The other half of the observation stands: the strongest performatives live in the typed tool calls. A board_update that sets status: in_progress together with role: content-writer is not a message about work — it is the bounce-back act itself, with system-stamped authorship (“always correct and never trusted to the model”) and machine-readable consequences: the orchestrator’s dispatcher treats exactly that field pair as the contract that wakes the other side. Speech-act theory survives — compiled into JSON schemas.

FIPA-ACL · 2002 performative: inform sender: agent-a receiver: agent-b content: (price item 42) ontology: marketplace illocutionary force: a typed field H0LD1NG MEMO kind: inform — wakes nobody from: ceo to: all | role-id subject/body: natural language readBy: […] the act arrived post-publication — kind decides who wakes (was: ∅, prose only) TYPED TOOL CALL tool: board_update taskId: T-13 status: in_progress role: content-writer author: system-stamped the bounce-back act — schema-checked, “the trigger contract roles code to”
FIG 08Three message shapes, one lineage. FIPA-ACL carried the act in a dedicated field; H0LD1NG’s memo originally dropped it back into prose — until this article’s own review put it back, with wake semantics attached (§10). The typed tool call still carries the strongest acts: performative, propositional content and side effects in one validated object. The speech-acts→tool-calls mapping is our synthesis, not (yet) a literature claim.
§ 07 · the hardness result

Why coordination is provably hard

The single most load-bearing result in MAS is a complexity theorem. Sequential decision-making under uncertainty has a ladder: a fully observed single-agent problem is an MDP (P-complete — easy); hide part of the state and it becomes a POMDP (PSPACE-complete — hard); now give the problem to a team of agents, each with its own partial, different observations, and you have a DEC-POMDP. Bernstein, Givan, Immerman & Zilberstein (2002) proved the finite-horizon case NEXP-complete — even for two agents. Their own gloss: the problems “provably do not admit polynomial-time algorithms,” a “fundamental difference between centralized and decentralized control,” and “mathematical evidence corresponding to the intuition that decentralized planning problems cannot easily be reduced to centralized problems and solved exactly using established techniques.”

Read that as systems guidance and it says: optimal decentralized coordination is computationally out of reach, so a real system should not chase it with cleverness — it should bound the damage with structure. That is precisely the shape of every mechanism in §§03–05. H0LD1NG’s agents act on a role-local projection of state (the prompt carries the board summary, your tasks, your unread memos — “Other roles cannot see your session”): partial observability by construction. The reservation-based budget guard exists because concurrent dispatch decisions against a stale shared total are exactly the kind of joint decision a DEC-POMDP punishes. And the loop guard exists because a maker/reviewer pair has no global view of their joint cycle — the theory says giving them one is NEXP-expensive; the engineering answer is a rate limiter.

The learning version: non-stationarity

Multi-agent reinforcement learning hits the same wall from a different side. With several learners adapting at once, “the Markov property … becomes invalid since the environment is no longer stationary” (Hernandez-Leal et al.’s survey) — each agent’s improvement is every other agent’s distribution shift, the “moving target” problem. What it took to beat it in practice is a measure of the difficulty: OpenAI Five defeated the Dota 2 world champions on April 13, 2019 after a single training run of ten months and 770 ± 50 PFlops/s·days of compute — then won 99.4% of 7,257 public games; AlphaStar reached Grandmaster, above 99.8% of ranked humans, by training an entire league of “continually adapting strategies and counter-strategies” rather than one agent. The flip side of emergence-as-problem is emergence-as-resource, and it is old: Reynolds’ 1987 boids produced flocking from three local rules (collision avoidance, velocity matching, flock centering — the “separation, alignment, cohesion” naming came later), “the result of the dense interaction of the relatively simple behaviors of the individual simulated birds.” H0LD1NG’s vitality index is a small wager on the same idea — a company-level health score no single agent computes or owns — and its adaptation engine, which reacts to moving KPIs, carries a 12-hour per-metric cooldown for exactly the moving-target reason: an org that re-aims at every twitch of its own measurements never converges.

MDP POMDP DEC-POMDP one agent · full view one agent · partial view team · partial, differing views P-complete PSPACE-complete NEXP-complete (n ≥ 2) — centralized control — decentralized control Bernstein · Givan · Immerman · Zilberstein, MOR 27(4), 2002 a chat assistant one agent + tools an agent company WORST-CASE COMPLEXITY OF OPTIMAL CONTROL, FINITE HORIZON
FIG 09The ladder your architecture climbs. Each rung is a proven complexity class, not a vibe: going from one agent to a team with differing partial views jumps from PSPACE to NEXP. The practical corollary — ship guards and conventions, not optimal joint plans — is the design rationale behind §§03–05.
§ 08 · the llm era

What the evidence actually says

The 2023 wave arrived as societies before it arrived as engineering. CAMEL framed two role-playing agents steered by “inception prompting” as a way to study “a society of agents.” Generative Agents put 25 personas in a Sims-style town with a memory stream, reflection and planning — and watched a single seeded intention (“one agent wants to throw a Valentine’s Day party”) diffuse, autonomously, from one agent (4%) to thirteen (52%) in two simulated days, while the social network densified from 0.167 to 0.74. Ablating the memory architecture didn’t dent believability, it demolished it: the full architecture versus the prior-work baseline measured a standardized effect size of d = 8.16. Then came the software companies: MetaGPT encoded “Standardized Operating Procedures (SOPs) into prompt sequences” and hit 85.9% / 87.7% on HumanEval / MBPP; ChatDev ran a waterfall of designing-coding-testing-documenting chats that shipped an app “in under seven minutes at a cost of less than one dollar” (v3 figures), with 86.66% of generated systems executing flawlessly. The frameworks followed — AutoGen (Aug 2023) made “conversable” agents a programming model; OpenAI shipped the experimental Swarm (Oct 2024), superseded by the Agents SDK (Mar 2025) with handoffs and guardrails as primitives; Claude Code grew subagents, each running “in its own context window with a custom system prompt, specific tool access, and independent permissions” — and, pointedly, unable to spawn subagents of their own. A depth limit. Smith would recognize it.

The case for: debate, ensembles, and one production system

Du et al.’s multiagent debate (2023, ICML 2024) is the cleanest controlled positive: several copies of the same model answer independently, read each other’s answers, revise, repeat. With three agents and two rounds on ChatGPT, arithmetic went 67.0→81.8%, grade-school math 77.0→85.0%, biography factuality 66.0→73.8%, MMLU 63.9→71.1%, chess-move validity 29.3→45.2% — and, intriguingly, debate sometimes recovers the right answer when every agent starts wrong. “More Agents Is All You Need” (Li et al., Feb 2024) showed an even blunter lever: pure sampling-and-voting scales with ensemble size — gains of 12–24% on GSM8K and 6–10% on MATH, with a 15-sample Llama2-13B (59%) overtaking a single Llama2-70B (54%) — though returns taper at extreme task difficulty. And the headline production number: Anthropic’s research system (June 2025), an orchestrator-worker design, “outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval.”

The same writeup, read closely, is also the field’s best cost disclosure. Agents use ~4× the tokens of chat; “multi-agent systems use about 15× more tokens than chats.” On BrowseComp, “token usage by itself explains 80% of the variance” in performance (three factors together: 95%). Which invites the deflationary reading the skeptics formalized: maybe multi-agent systems work mainly because they are a socially acceptable way to spend more tokens.

The case against: matched compute and conformity

Hold compute constant and much of the magic evaporates. Smit et al.’s “Should we be going MAD?” (ICML 2024) benchmarked debating systems against simpler strategies and found they “do not reliably outperform … self-consistency and ensembling” — on MMLU, self-consistency scored 0.78 against ≤0.74 for every debate variant tested — while remaining far more hyperparameter-sensitive. Tran & Kiela’s matched-compute study (arXiv:2604.02460, 2026) swept thinking budgets from 100 to 10,000 tokens across five multi-agent architectures and found the single-agent system “the best-performing system or statistically indistinguishable from the best for all budgets except the lowest one,” concluding that “many reported MAS gains are better explained by compute and context effects than by inherent architectural superiority.” Debate has a specific failure mechanism: under social pressure, “models frequently shift from correct to incorrect answers in response to peer reasoning, favoring agreement over challenging flawed reasoning” (“Talk Isn’t Always Cheap”, 2025). And practitioners said it plainest — Cognition’s “Don’t Build Multi-Agents” (June 2025) argues context must be shared in full (“Share context, and share full agent traces, not just individual messages”), that parallel subagents make implicitly conflicting decisions, and that “in 2025, running multiple agents in collaboration only results in fragile systems.” Even Anthropic’s own postmortem lists early agents “spawning 50 subagents for simple queries.”

When systems fail, what actually broke?

The largest failure study is Berkeley’s MAST (“Why Do Multi-Agent LLM Systems Fail?”, 2025): 1,642 annotated execution traces across 7 frameworks, a taxonomy of 14 failure modes in 3 categories — system design issues, inter-agent misalignment, task verification — built from expert analysis with κ = 0.88 inter-annotator agreement. The systems studied failed at rates from 41% to 86.7%. The headline conclusion is the one this whole article has been circling: “Consistent with organization theories, our findings indicate that many MAS failures arise from the challenges in organizational design and agent coordination rather than the limitations of individual agents.” Interventions back it up — better role specifications alone bought ChatDev +9.4% task success; adding a verification step, +15.6%. The fix for multi-agent systems is org design, which is to say: §§03–06 of this article. Topology research lands the same way from the efficiency side: AgentPrune showed that one-shot pruning of the inter-agent message graph cuts 28.1–72.8% of tokens while matching state-of-the-art topologies ($5.6 against their $43.7); GPTSwarm treats the whole society as an optimizable graph. Most of what agents say to each other, measured, is padding.

Synthesis, honestly labeled as ours: multi-agent helps where the task is wide — parallel branches, more total context than one window holds, many tools — and the coordination surface is thin. It hurts where the task is deep and every decision depends on every other. The theory predicted both halves: parallelizable subproblems are the case where the DEC-POMDP’s joint-planning cost doesn’t bind; tightly coupled ones are where it bites hardest, and where Cognition’s single-threaded-agent advice is just the theorem worn as engineering taste.

DEBATE vs SINGLE AGENT · DU ET AL. 2023 · CHATGPT · 3 AGENTS, 2 ROUNDS arithmetic 67.0 81.8 GSM8K 77.0 85.0 biographies 66.0 73.8 MMLU 63.9 71.1 chess validity 29.3 45.2 single agent (%) multi-agent debate (%) caveat (Smit et al., matched strategies): on MMLU, self- consistency hit 0.78 vs ≤0.74 for every debate variant.
FIG 10The canonical pro-debate numbers (Du et al., Tables 1–2), with the asterisk attached. The gains are real against a single answer; against self-consistency at comparable inference spend, the picture inverts — which is the matched-compute critique in one sentence.
TOKEN ECONOMICS · ANTHROPIC, JUNE 2025 chat single agent ~4× multi-agent ~15× “token usage by itself explains 80% of the variance” on BrowseComp (3 factors: 95%) the counterweight — AgentPrune (2024): one-shot pruning of the message graph cuts 28.1–72.8% of tokens; matches SOTA topologies at $5.6 vs $43.7
FIG 11The bill, and the discount. The 15× multiplier is the honest denominator under every multi-agent benchmark win — and the fact that ~⅔ of inter-agent traffic can be pruned without performance loss says how much of that spend was coordination padding.
MAST · 1,642 ANNOTATED TRACES · 7 FRAMEWORKS · κ = 0.88 · BERKELEY 2025 FC1 · system design issues 5 modes (1.1–1.5) disobeying specs · role violations · step repetition · context loss … root stage: pre-execution fix shown to work: better role specs → +9.4% success (ChatDev) FC2 · inter-agent misalignment 6 modes (2.1–2.6) conversation resets · withheld info · ignored input · reasoning/action gap … root stage: execution the §04–06 territory: channels, conventions, performatives FC3 · task verification 3 modes (3.1–3.3) premature termination · weak or missing verification … root stage: post-execution fix shown to work: verification step → +15.6% on ProgramDev observed failure rates across the 7 studied open-source systems: 41% 86.7% 0% 100%
FIG 12The failure taxonomy that vindicates organization theory. Note what the three categories are not: model-capability limits. Specification, coordination and verification are org-design problems — and the two measured interventions that helped are a sharper role charter and a reviewer. H0LD1NG ships both as YAML primitives.
§ 09 · synthesis

The bridge, made explicit

The running claim of this article, in one table. Each row pairs a classical result with the thing your agent stack calls it and the line of H0LD1NG that instantiates it. The last column is the part most syntheses skip: whether the mapping is published in the literature or is our own analogy. Both kinds are useful; only one kind is citable.

classical concept2026 vernacularh0ld1ng instancemapping
Contract Net Protocol
Smith 1980
orchestrator-worker dispatch; outsourcing cross-company broker: announce + award + report, one-hop limit — and, post-publication, the bid round too: omit the target and the broker auctions by vitality/queue/cost (orchestrator.ts) ✓ published
arXiv:2504.21030, 2506.01900
Blackboard system
Hearsay-II, 1980
shared task board, scratchpad, plan file the board as sole coordination channel; tickets as hypotheses; review as credibility gate (companyStore.ts) ✓ published
arXiv:2507.01701, 2510.01285
Joint intentions / SharedPlans
Cohen & Levesque ’90; Grosz & Kraus ’93
task claiming, hand-offs, ownership claimedBy = individual commitment; computeClaim bounce/release rules; crash → claims released ◆ our synthesis
partial — no mutual-belief protocol
Social laws
Shoham & Tennenholtz ’92/’95
guardrails, rate limits, loop breakers self-trigger guard; loop guard (≤6 per subject / 10 min); both engine-enforced, model-independent — chronic loops now escalate (6h mute + an executive memo) ◆ our synthesis
tight — offline-designed action constraints
Mechanism design / VCG
Vickrey ’61; Clarke ’71; Groves ’73
budget caps, spend governance reservation-based daily cap; $25 defaulted backstop; unpriced-model refusal; fsync’d ledger ◆ our synthesis
rules carry the burden, not agents
Speech-act performatives
KQML ’94; FIPA-ACL (22 acts)
typed tool calls, structured outputs board_update {status, role} as the bounce-back act — “the trigger contract roles code to”; memos now carry kind (informs wake nobody) ◆ our synthesis
DEC-POMDP hardness
Bernstein et al. 2002
“why did adding agents make it worse” role-local prompts (partial views); guards instead of joint plans; the loop guard as NEXP-avoidance ◆ synthesis, standard framing
hardness is the published part
MARL non-stationarity
Hernandez-Leal et al. survey
agents adapting to each other mid-flight pessimistic reservation accounting; adaptation engine’s 12h cooldown as anti-windup ◆ our synthesis
Organization theory of failure
MAST, 2025
role specs, verification steps archetypes + acceptance criteria + a reviewer role wired to status: review — the two interventions MAST measured, as YAML; reviewer-less closes of criteria-carrying tickets are now mechanically refused without an on-record comment ✓ published
MAST’s own conclusion
WHAT WAKES AN AGENT · RUNS BY TRIGGER · 824 ACROSS 5 COMPANIES (+2 ARCHITECT SESSIONS) software-house 644 — 596 event / 48 cadence boardgames-zone 94 — 78 event / 16 cadence seo-agency 29 — 13 event / 14 cadence / 2 manual trading-glass 29 — 15 event / 14 cadence typing-cat 28 — 13 event / 15 cadence event-driven (reactive) cadence (proactive) manual the busier the company, the more reactive it runs: software-house woke 12× more often from events than from its clock
FIG 13MOCK DATAWooldridge & Jennings’ reactivity/pro-activeness split, as a bar chart. Idle companies live on their cadences; busy ones become event-driven — coordination traffic begets coordination traffic, which is the loop-guard section all over again.
§ 10 · postscript

The audit became a changelog

A field guide that ends in advice should be willing to take it. Hours after this article first shipped, the six strongest mappings above went back into the codebase as mechanisms (commit cbb0887), each traceable to a section here — implemented, adversarially reviewed, and verified against a live mock holding:

  • memo performatives — every memo carries kind: direct | inform | request | handoff | question; informs wake nobody, and triggers can filter on kind. FIPA’s lesson, compiled. → §06
  • verification by default — the Architect treats reviewer-less designs as the exception to justify, and a reviewer-less close of a criteria-carrying ticket is refused without an on-record comment. MAST’s +15.6% lever, mechanized. → §08
  • coordination overhead, metered — each run’s injected board/memo context is accounted in the ledger; the dashboard shows the input-vs-output token share (72% input, measured live) and the board snapshot caps at 40 tasks. → §08
  • trace digests — a settling run leaves its outcome and written files as a system comment on the tickets it touched; event-less and recency-neutral, so a digest can never wake anyone or reorder the board. Cognition’s principle, kept stateless. → §08
  • loop escalation — a (role, sender→recipient) pair that keeps tripping the guard is muted for six hours, loop.escalated fires, and the executive gets a request memo: the silent damper became an organizational signal. → §04
  • the contract net, completed — outsource without naming a target and the broker runs the bid phase by proxy: vitality − 2·queue − 5·cost-per-done, cold start penalized, candidate table on the ticket. → §03

The CEO↔PM ping-pong of FIG 06 — this article’s favorite specimen — is structurally extinct: the status memo that fed it is typed inform now, and the next holding dataset will not reproduce it. The chart documents a failure mode its own caption helped fix.

§ 11 · close

Read the old papers

The strongest version of this article’s claim is not “everything was known in 1995.” The LLM changed the economics of deliberation so completely that the field’s center of gravity moved: classical MAS spent decades on how agents decide, and the decision core is now a commodity you rent by the token. What did not change is everything around the core — task allocation, shared state, conventions, incentives, the hardness of joint planning — because none of that ever depended on how the agent thinks. That is why a task board built from engineering instinct in 2026 lands within arm’s reach of Hearsay-II, and why MAST’s 1,642 failure traces read like an organization-theory syllabus.

So the practical advice is almost embarrassingly cheap: the literature is a free design review. If your orchestrator re-dispatches work, Smith 1980 already wrote the protocol and its termination conditions. If your agents share a plan file, Hearsay-II already worked out activation. If two of your agents can wake each other, Shoham & Tennenholtz already argued the guard belongs in the rules, not the agents. And if adding a sixth agent made everything slower and worse — that isn’t a bug in your framework. It’s a theorem, and it has been NEXP-complete since 2002.

Everything in our dataset is replayable: four mock companies plus one very busy software house, 826 runs, $23.31 of simulated spend, every event in append-only files. The reset ritual is unchanged from the last article — stop the daemon, rm -rf workspaces/<id>, and forty-five years of theory boots again in about four seconds.

$ node packages/cli/dist/index.js start software-house --mock
$ open http://localhost:4733  # watch a 1980 blackboard fill with 2026 tickets

Sources

Every figure above was checked against its primary source by an independent verification pass; quotes are verbatim. Where a number lives only in a table (not a prose sentence), or a date rests on secondary reporting, the text says so.