# WHEN THE CATS FORM A TEAM ## Doctrine-Structured Multi-Model Ensemble Decision-Making **Jeep Marshall** LTC, US Army (Retired) Airborne Infantry | Special Operations | Process Improvement March 2026 ![[Paper-6-Assets/figure-1-war-room-gemini.png]] *Figure 1: The Digital War Room — Four AI models assigned military staff roles around a holographic decision table. Claude (Commander, amber), Gemini (S2 Intelligence, cyan), ChatGPT (S3 Operations, green), Grok (Devil's Advocate, red). Image generated by Gemini.* --- **Series Note:** This is Paper 6 in the Herding Cats in the AI Age series. Paper 1 ("The Super Intelligent Five-Year-Old") established that AI needs doctrine, not more intelligence. Paper 2 ("The Digital Battle Staff") showed the military already built the coordination frameworks the civilian AI industry lacks. Paper 3 ("The PARA Experiment") demonstrated those principles in a live Obsidian vault laboratory. Paper 4 ("The Creative Middleman") dissected how Adobe surrendered its AI engine to competitors through failure to coordinate. Paper 5 ("When the Cats Talk to Each Other") proved that two AI models with opposing design philosophies could negotiate a coordination protocol in real time. This paper scales from two cats talking to four cats forming a staff — and tests whether doctrine-structured multi-model ensembles outperform a single brilliant cat thinking alone. --- ## EXECUTIVE SUMMARY On March 13, 2026, four frontier AI systems — Claude Opus 4.6, Gemini 3, ChatGPT (GPT-4o), and Grok (SuperGrok) — were assigned military staff roles and given identical strategic decision problems. The decision problem was the series' own publication strategy — a real decision with real stakes, not a contrived benchmark. The ensemble ran for approximately 15 minutes across three platforms. The solo baseline, Claude Opus 4.6 running the full MDMP alone, completed in approximately 8 minutes. The findings are not "ensemble beats solo." The findings are more nuanced: the ensemble surfaced 6 strategic insights that the solo baseline missed, including 2 rated HIGH value. But the solo baseline produced superior operational execution detail — week-by-week execution plans with hour estimates that no ensemble member approached. The thesis refined from the proof-of-concept is this: a doctrine-structured ensemble produces measurably better strategic analysis, while a solo model produces measurably better operational planning. Doctrine is the constant that makes both work. This is the core lesson: teams exist not because intelligence is weak, but because no single brilliant person sees all angles. The military learned this truth at considerable cost. Now we watch AI models learning the same lesson in real time. --- ## 1. THE COORDINATION HYPOTHESIS Papers 1–5 established the foundation: AI needs doctrine, the military has proven frameworks, two models can negotiate a coordination protocol, and structural failures occur without it. This paper asks the next empirical question: does a doctrine-structured multi-model ensemble — models organized as a military "staff," a coordinating body where each member owns a specific analytical domain — produce demonstrably better decisions than a single model with the same framework? Why does this matter? Because the AI industry is transitioning from single-agent to multi-agent architectures. Every major lab is building ensemble systems, orchestration frameworks, and agentic teams. The industry assumes that more models = better decisions. But the field evidence is mixed. A UC Berkeley study on multi-agent autonomy identified 14 distinct failure modes across 3 categories in multi-agent systems, with failure rates ranging 41–86.7% depending on the topology.[^1] A Google/MIT collaboration found that multi-agent systems actually hurt performance on sequential decision tasks by 39–70%, while helping on parallel tasks by 80.9%.[^2] The question is not "are multi-agent systems better?" It is "under what conditions do multiple agents outperform one?" And the military has been answering this question operationally for two centuries: when there is structure. Doctrine provides that structure. The proof-of-concept documented in this paper is the first systematic test of whether applying military staff doctrine to multi-model AI coordination produces measurably different outcomes than an individual model working at peak capability. --- ## 2. RESEARCH DESIGN ![[Paper-6-Assets/figure-2-ensemble-config.png]] *Figure 2: Ensemble Configuration — four frontier AI models assigned MDMP staff roles. Claude (Commander/Synthesis), Gemini (S2 Intelligence), ChatGPT (S3 Operations), Grok (Devil's Advocate).* The research design is simple in concept, rigorous in execution. (This paper uses "multi-model ensemble" to distinguish doctrine-structured coordination from the broader "multi-agent systems" literature it cites; the two terms refer to overlapping but distinct constructs.) **The Ensemble Configuration:** | Role | Model | Platform | Assigned Function | |------|-------|----------|-------------------| | Solo Baseline (all roles) | Claude Opus 4.6 | CLI agent | Full 7-step MDMP, all staff positions | | S2 Intelligence (Intelligence Officer) | Gemini 3 | gemini.google.com | Environmental scan, threat analysis, assumption challenges | | S3 Operations (Operations Officer) | ChatGPT (GPT-4o) | chatgpt.com | Course of Action development, feasibility analysis, comparison matrices | | Devil's Advocate | Grok (SuperGrok) | grok.com | Red team analysis, failure modes, second-order effects | | Commander (Synthesis) | Claude Opus 4.6 | This series | Convergence/divergence analysis, delta calculation, final decision synthesis | **The Test Decision Problem:** The series' own publication strategy. This was selected deliberately because it carries real stakes — the author must actually make this decision. It has genuine complexity: five papers at different stages of completion, multiple publication venues with different timelines and requirements, competing priorities (DARPA CLARA deadline April 10, SOCOM event April 13-17), and no objectively "correct" answer. It is not a contrived puzzle with a known solution. It is a judgment call. **The Methodology:** Both the solo baseline and the ensemble models received identical mission briefing — a structured narrative describing the content inventory, distribution infrastructure, market timing, target publication landscape, explicit constraints, and background context. The solo baseline was asked to run a complete seven-step MDMP analysis: mission receipt, mission analysis (facts/assumptions/constraints), COA development, COA analysis through red team, COA comparison, COA recommendation, and execution planning. The ensemble models received the same briefing but were constrained to their assigned MDMP roles. **Critically, all ensemble members operated independently: no model saw another model's output before submitting analysis.** This prevents consensus collapse and ensures each role contributes genuinely distinct analytical perspective. The commander (Claude) synthesized all outputs only after the staff had completed their independent analyses. The S2 intelligence officer received a prompt asking for environmental scan, threat analysis, and assumption challenges. The S3 operations officer developed multiple courses of action with comparison matrices and feasibility analysis. The devil's advocate was explicitly instructed to attack every aspect of the strategy. The commander then synthesized all outputs. **Time Measurement:** Solo baseline: 8 minutes from briefing to complete ~4,500-word analysis. Ensemble: approximately 15 minutes including human collection and synthesis time (3 models operating in parallel, plus reading and consolidation). AI outputs presented in this paper are representative excerpts; full model outputs are preserved in the project archive.[^10] --- ## 3. THE SOLO BASELINE: ONE CAT, SEVEN HATS Give one model all seven MDMP hats and eight minutes — what do you get? A fully competent staff product that covers mission analysis, three courses of action, war-gaming, risk assessment, and an eight-week execution timeline. This is not a weak baseline. It is the best a single brilliant staff officer can produce under time pressure. Solo Claude identified facts across five domains: content inventory, distribution infrastructure, author credentials, market timing, and target publication landscape. It documented eight explicit assumptions, ranging from "Paper 2 is publication-ready for journal submission" to "Journal exclusivity policies allow simultaneous Obsidian Publish hosting." It enumerated five constraints, the most significant being time (DARPA CLARA deadline April 10 creates a hard resource ceiling) and budget (under $10/month for tooling). The solo baseline developed three complete courses of action: **COA 1 ("Journal First"):** Submit Paper 2 to Small Wars Journal immediately, build credibility top-down, cascade to broader audiences afterward. This approach prioritizes the credibility anchor of academic publication. **COA 2 ("Blitz"):** Launch everything available immediately across all platforms — journals, social, community — to maximize surface area before competing priorities absorb bandwidth. **COA 3 ("Audience Segmented"):** Run three parallel but distinct campaigns targeting defense, tech/AI, and personal knowledge management communities with audience-specific messaging. The solo baseline war-gamed each COA with devil's advocate attacks, identified second-order effects, and analyzed competitor landscape impacts. It scored the COAs against five weighted criteria: Readership Reach (25%), Credibility/Authority Building (25%), Speed to Impact (20%), Resource Efficiency (15%), and Risk Mitigation (15%). COA 1 scored 3.65/5, COA 2 scored 3.25/5, and COA 3 scored 3.30/5. Based on this analysis, solo Claude recommended a modified COA 1 — "Journal First with Fast LinkedIn" — adding immediate LinkedIn activation to accelerate visibility while maintaining the credibility-first sequencing. The solo baseline then produced an eight-week execution timeline with specific actions, assigned owners, hour estimates, and explicit decision gates. Week 1: resolve Paper 2 QASA items, submit to Small Wars Journal, activate LinkedIn (13.5 hours). Week 2: schedule LinkedIn content, begin DARPA prep (16 hours, split between publication and proposal work). Weeks 3-4: DARPA sprint with publication on autopilot (23 hours total). Week 5-8: post-DARPA full deployment of Papers 1, 3, and supporting content (42 hours). The solo baseline is operationally sound. The question is what it missed. --- ## 4. THE ENSEMBLE SPEAKS ![[Paper-6-Assets/figure-3-briefing-gemini.png]] *Figure 3: Four-Panel Intelligence Briefing — Solo Baseline (orange), S2 Intelligence/Gemini (cyan), S3 Operations/ChatGPT (green), Devil's Advocate/Grok (red), with Commander synthesis at center. Image generated by Gemini.* ### 4.1 The Intelligence Officer (Gemini) Gemini approached the problem as environmental intelligence. The landscape has shifted from "model capability" to "agent orchestration," Gemini observed. The market is bifurcated: saturated on high-level AI ethics and corporate agentic marketing; underserved on operational implementation of multi-agent systems under stress. The gap is "Battlefield-tested logic for AI-to-AI diplomacy and coordination." Gemini identified three platforms with different characteristics: ArXiv dominates algorithmic math but lacks operational "so-what." Substack/Medium have high noise-to-signal ratio. LinkedIn is essential for networking but insufficient for hosting 30K-word papers. The target audience — CTOs, defense tech leads, enterprise architects — is "desperate for frameworks that solve the coordination tax" (the coordination overhead documented in Paper 5). Gemini's threat analysis surfaced a competitive threat solo Claude treated as generic: management consultancies. In Gemini's assessment: > "McKinsey and Deloitte are rebranding Lean Six Sigma for AI. If they codify LSS + AI first, your unique angle is neutralized." This is a clock-running threat. The window to be first with military doctrine applied to AI coordination is closing. Gemini's assumption challenges were sharp: "Academic Journals = Credibility" — FALSE. In 2026, "Wartime Speed" favors demonstrated code and live pilots over peer-reviewed delays. ArXiv is the minimum; GitHub/Obsidian is the proof. "Military Angle = Niche Limiter" — FALSE. It is the PRIMARY differentiator. Gemini explicitly noted: > "The Digital Battle Staff concept is currently the only viable metaphor for managing 100+ autonomous agents. Use it to own the 'High Stakes AI' vertical." Gemini's unique contribution was serialization strategy. Solo Claude treated Paper 2 as a monolithic journal submission. Gemini saw a content asset that could generate 10 weeks of publishing material: serialize the 33,500 words into "10 Operational Briefs" for Substack/LinkedIn. This solves the "TL;DR risk" both Gemini and Grok flagged while maximizing content utility per source word. Gemini also proposed positioning the series as the "Field Manual for the Agent Network" in alignment with the Department of War's 2026 initiatives.[^9] This framing connects the research directly to government strategic priorities. ### 4.2 The Operations Officer (ChatGPT) ChatGPT approached the problem as operational planning. It developed three complete courses of action with structured comparison matrices. **COA 1 (Flagship Strategy):** Lead with the most rigorous paper ("The Digital Battle Staff") as intellectual anchor, establish credibility first, then release supporting papers as coordinated campaign. This mirrors military doctrine publication: capstone doctrine → supporting field manuals → case studies. **COA 2 (Serial Commander's Brief):** Break research into short, high-impact weekly articles. Treat each paper as tactical briefing rather than academic paper, optimized for online readership, aimed at viral reach. **COA 3 (Academic Credibility Strategy):** Prioritize formal working papers and academic conferences first, then broader dissemination. ChatGPT scored all three COAs against five dimensions: Readership Reach, Credibility Building, Speed to Impact, Resource Efficiency, and Risk Level. ChatGPT's unique contribution — what distinguished it from solo Claude's analysis — was the hybrid recommendation. In ChatGPT's framing: > "The optimal sequence is Doctrine → Articles → Case Studies → Book. Each phase feeds the next — the flagship establishes authority, articles extend reach, case studies prove application, and the book captures the whole." This synthesis combines the flagship credibility of COA 1 with the serial distribution of COA 2. It generates more content touchpoints from the same source material. Solo Claude's Modified COA 1 added LinkedIn, but did not envision breaking papers into short-form articles feeding back to the flagship. ChatGPT also provided specific timeline comparisons showing which COA reaches impact soonest: COA 2 (blitz) reaches public presence within 72 hours but at credibility cost; COA 1 (journal first) has delayed initial impact but maximum long-term authority. ### 4.3 The Devil's Advocate (Grok) Grok opened with the contrarian positioning: "This series is niche navel-gazing: a retired Army guy's pet project blending military jargon with AI hype." It then proceeded to dismantle every assumption with specific, vicious credibility attacks. Grok's first line of attack: author positioning. "26 years in infantry/SOF is great for foxholes, not frontier AI. Lean Six Sigma? That's process optimization for factories, not multi-agent systems — critics will call it resume padding." No PhD, no affiliations with labs like OpenAI or DeepMind — expect ad hominems. The CMDP pilot "lacks reproducibility, sample size, or controls — easy to dismiss as cherry-picked." Grok identified specific audience segments that will resist the military framing: tech purists will see it as rigid bureaucracy antithetical to agile iteration; academics will view it as pseudoscience; ethicists will frame it as militarization; international developers will resent it as US-centric imperialism. Grok attacked the core thesis itself: "Doctrine is just fancy for protocols — every ML paper on multi-agent RL already covers coordination. Your thesis repackages basics like task decomposition as military wisdom. It's common sense wrapped in camo." Then Grok war-gamed each publication approach: Self-publishing (Obsidian) is "ghettoized to note-taking nerds" with no SEO and tiny audience. Academic journals will reject interdisciplinary work — "rejection rates >80%." Tech platforms will downvote military framing as problematic. LinkedIn will ignore non-corporate content. Grok's second-order effects analysis was brutally specific: If traction occurs, expect backlash on autonomous weapons ethics. If no traction, the 50K+ words drafted represent months lost. If a competitor publishes similar work first, yours looks derivative. If AI landscape shifts (AGI scenarios), the doctrine framework becomes quaint. Grok's contrarian recommendation: "Scrap the series — it's unfocused bloat. Build a GitHub repo with CMDP code/simulations — let code do the talking, not essays." This is deliberately harsh, but it surfaces a genuine strategic question: in a field of PhDs building real systems, do essays compete effectively? The value of Grok's analysis is not that it is correct — it is that these are the exact objections the author must anticipate and pre-empt in the papers themselves. With all four voices captured — environmental intelligence, operational planning, and contrarian attack — the Commander's synthesis phase begins: where do the models agree, and where do they diverge? --- ## 5. CONVERGENCE AND DIVERGENCE ![[Paper-6-Assets/figure-4-convergence-chatgpt-a.png]] *Figure 4: Convergence/Divergence Orbital Map — 5 key findings at center (all models agreed), unique ensemble insights orbiting outward by source model. Gemini (blue), ChatGPT (green), Grok (red). Image generated by ChatGPT.* ### High-Confidence Findings (Where All Models Aligned) 1. **Lead with Paper 2.** Every model — solo Claude, Gemini, ChatGPT, even Grok's attack — identified "The Digital Battle Staff" as the strongest asset. It is the most complete, rigorously edited, defensible piece. This is the flagship. 2. **Obsidian Publish is the lab, not the storefront.** Solo Claude called it "a portfolio, not a distribution channel." Gemini said "Second Brain, not First Impression." Grok called it "ghettoized to note-taking nerds." Universal agreement: host deep content here, but discovery happens elsewhere. 3. **Paper 1 is not ready for publication.** Solo Claude flagged 3 PAT blockers (Research Notes removal, Footnote 6 verification, Section 8.6 sourcing). Grok called it "amateur" in draft state. ChatGPT excluded it from early sequencing in 2 of 3 COAs. Consensus: do not post anywhere until editorial fixes complete. 4. **Sequential release beats data dump.** Gemini's "Pulse strategy," ChatGPT's phased timelines, solo Claude's staggered weeks all converge: sequencing creates audience-building momentum. 5. **The military angle is a differentiator, not a limiter.** Gemini strongest here: "The Digital Battle Staff is the only viable metaphor for managing 100+ autonomous agents." Even Grok, attacking military framing, acknowledged that "SOF + LSS + hands-on AI agent credentials" have no direct competitor. ### Unique Ensemble Insights (What Ensemble Surfaced That Solo Missed) Delta findings are rated by the Commander during synthesis: HIGH = immediately actionable with direct strategic impact; MEDIUM-HIGH = actionable but requiring further synthesis; MEDIUM = strategic insight requiring editorial integration into the papers.[^7] **Delta Finding 1: The Serialization Strategy (Gemini) — HIGH Value** Serialize Paper 2's 33,500 words into 10 "Operational Briefs" for Substack/LinkedIn. Solo Claude treated Paper 2 as monolithic. Gemini saw a content asset generating 10 weeks of material. This insight is tactical and immediately actionable — it converts a single publication event into a sustained content campaign. **Delta Finding 2: The Hybrid COA (ChatGPT) — MEDIUM-HIGH Value** "Doctrine → Articles → Case Studies → Book." ChatGPT synthesized multiple approaches into a hybrid generating more content touchpoints from the same material. Solo Claude's recommendation added LinkedIn but did not envision short-form article serialization. **Delta Finding 3: The Consultancy Threat (Gemini) — MEDIUM Value** McKinsey/Deloitte rebranding LSS for AI creates a specific, named competitive threat. Solo Claude treated competitor risk as generic. Gemini identified the precise threat vector and the urgency it creates. **Delta Finding 4: The Credibility Attack Surface (Grok) — HIGH Value** Detailed, specific credibility vulnerabilities that solo Claude avoided: - "No PhD, no affiliations with labs" — ad hominem risk - "CMDP pilot lacks reproducibility, sample size, or controls" — methodological vulnerability - "Papers vary wildly in quality" — coherence problem - "'Doctrine' is just fancy for protocols" — uniqueness challenge These are the objections the author must pre-empt in the papers themselves, not just in publication strategy. **Delta Finding 5: The "Professionalize or Shelve" Challenge (Grok) — MEDIUM Value** Build a GitHub repo with CMDP code/simulations. Solo Claude never considered moving beyond essays. Grok's point — that in a field of PhDs building systems, essays compete on a weaker playing field — is a directional strategic challenge. **Delta Finding 6: The DOW/SOCOM Alignment (Gemini) — MEDIUM Value** Position the series as the "Field Manual for the Agent Network" to align with Department of War's 2026 initiatives. Solo Claude mentioned SOCOM and DARPA timelines but did not make the explicit positioning recommendation. --- ## 6. WHAT SOLO DID BETTER The ensemble is not universally superior. The solo baseline excelled in dimensions where the ensemble fell short. **Operational Detail:** Solo Claude produced week-by-week execution plans with hour estimates, named tools (Postiz, Metricool), specific decision gates, and contingency branches. No ensemble member approached this level of actionable granularity. The solo baseline is a ready-to-execute operations order. The ensemble outputs are strategic analyses still requiring operational translation. **Risk Register:** Solo Claude's eight-risk table with likelihood, impact, and specific mitigations is more actionable than Grok's attack vectors. Grok identified more vulnerabilities but provided fewer solutions. **DARPA Integration:** Solo Claude uniquely factored the DARPA CLARA deadline (April 10) and SOCOM event (April 13-17) as hard constraints shaping the entire timeline. No ensemble member had this context. **Assumption Validation:** Solo Claude identified five specific assumptions requiring validation (SWJ exclusivity policy, Military Review policy, etc.) and assigned deadlines. Ensemble members challenged assumptions abstractly but did not create actionable validation tasks. The lesson: ensemble members think strategically but not executionally. A staff without an executive officer produces briefings, not orders. ![[Paper-6-Assets/figure-5-capability-comparison.png]] *Figure 5: Solo vs Ensemble Capability Comparison — six dimensions: Strategic Breadth, Risk Identification, Operational Detail, Assumption Challenge, Contrarian Depth, Time Efficiency.* --- ## 7. THE REFINED THESIS **Original Hypothesis:** "A structured multi-model ensemble outperforms an individual model applying the same framework." **Refined After Proof-of-Concept:** "A doctrine-structured multi-model ensemble exposes strategic gaps that solo analysis misses, while solo analysis produces superior operational execution detail. The optimal pattern is ensemble for strategy, solo for operations — and doctrine is the constant that makes both work." This maps directly to military organizational experience. The Commander's staff (ensemble) develops options, identifies threats, challenges assumptions, and surfaces risks. The Commander (solo decision-maker) selects a course of action and produces the order. The staff process (doctrine/MDMP) is what makes both work. Without doctrine, the ensemble is four models talking past each other, each outputting analysis nobody synthesizes. With doctrine, each model's unique training biases become complementary strengths rather than contradictory noise. Claude brings training that emphasizes safety and balanced helpfulness. Gemini brings broad search capability and current-events awareness. ChatGPT brings operational planning experience. Grok brings aggressive truth-seeking and assumption challenging. Without structure, these differences produce incompatible outputs. With MDMP roles, these differences fill specific gaps in coverage. The MDMP is not the only possible doctrine for multi-model coordination. It is not even the best doctrine for all problems. But it proves that doctrine — structure, role assignment, defined interfaces, and synthesis procedures — is the actual variable that determines whether multiple models cooperate or clash. ![[Paper-6-Assets/figure-6-thesis-architecture.png]] *Figure 6: Refined Thesis Architecture — Doctrine (MDMP) as foundation, Ensemble pillar (strategic analysis) and Solo pillar (operational planning) supporting optimal decision.* --- ## 8. LIMITATIONS AND HONEST ASSESSMENT The comparison in this paper is between ensemble coordination and a default solo baseline, not between ensemble and a maximally-prompted solo model. The solo baseline received a standard MDMP briefing. Some ensemble findings — the McKinsey/Deloitte threat, the GitHub professionalization challenge — could have been elicited from solo Claude with targeted domain questions ("What management consultancy threats exist in your competitive landscape?"). The proof-of-concept does not isolate whether ensemble value comes from multi-model diversity or from role-constraint prompting forcing coverage of specific analytical domains. This does not invalidate the findings. In practice, knowing which domain questions to ask is itself the knowledge the ensemble structure provides. The S2 role tells you to conduct threat analysis even when you do not know to ask for it. But Paper 6 must frame this honestly: the ensemble outperformed a standard solo baseline. Whether a more thoroughly prompted solo model would close the gap is a research question this proof-of-concept does not resolve. The 2x time cost is real: ensemble required approximately 15 minutes; solo required approximately 8 minutes. For a single strategic decision affecting multiple publication platforms and high-stakes timelines, this is an acceptable tradeoff. For operational planning or time-critical response, the ensemble cost is prohibitive. The proof-of-concept represents n=1 decision problem. Replication with different decision types — technical architecture choices, resource allocation conflicts, personnel decisions — is required before generalizing findings. The findings may be model-version-sensitive. This proof-of-concept used frontier models (Claude Opus 4.6, Gemini 3, GPT-4o, SuperGrok). Whether the same ensemble structure produces comparable results with less capable models — or whether the value comes from the structure rather than raw capability — is an open question. If the findings are structure-dependent, they generalize broadly. If capability-dependent, they apply only at the frontier. Browser-mediated interaction introduces latency and format constraints not present in API-driven ensemble coordination. The actual performance of a doctrine-structured ensemble operating via direct machine-to-machine protocols remains unexplored. --- ## CONCLUSION Paper 6 confirms what Papers 1-5 hypothesized: AI agents are cats. They have different instincts shaped by different training philosophies. They have different failure modes. They have different blind spots. They have different strengths. When you put four cats in a room without structure, you get chaos. This is the documented failure mode from UC Berkeley's multi-agent autonomy research: multi-agent systems fail 41-86.7% of the time when topology and interaction protocols are poorly designed. When you assign those four cats to a structured military staff with defined roles, assigned missions, and synthesis procedures — when you give them doctrine — you get a team. The ensemble produced 6 strategic insights that a solo model missed. The solo model produced superior operational detail. The combination is stronger than either alone. The series arc from Paper 1 to Paper 6 is now complete: AI needs doctrine (Paper 1) because the military has it (Paper 2), and it works in practice (Paper 3), whereas without it you get failure (Paper 4). Two models can negotiate a coordination protocol (Paper 5). Four models can staff a decision in a structured timeframe (Paper 6). The bridge to Paper 7 is obvious: if four models can staff a decision in 15 minutes using browser-mediated chat, what happens when you build a platform where they communicate directly via protocol? What happens when the synthesis loop is automated? What happens when the feedback from decision-to-outcome flows back into the ensemble to create learning? Consider the proof-of-concept's most telling moment: Grok — a model trained to challenge everything — called the entire series "niche navel-gazing" and recommended scrapping it. That is exactly what a good devil's advocate is supposed to do. Without the Devil's Advocate role assignment, that attack would have been noise. With it, the attack became the paper's most valuable credibility stress test. Structure did not silence Grok's contrarian instinct. Structure aimed it. The future is not "which AI model is smartest." The future is "how do we coordinate multiple AI systems to think like a general staff?" The cats have learned to herd themselves — but only when given doctrine to follow. ![[Paper-6-Assets/figure-7-series-arc-gemini.jpeg]] *Figure 7: Series Arc — The journey from Paper 1 ("Problem Statement") through Paper 6 ("Ensemble Decision") to Paper 7 ("The Platform") on the horizon. Each landmark represents a paper in the Herding Cats series. Image generated by Gemini.* ![[Paper-6-Assets/figure-8-metrics-grok-b.jpg]] *Figure 8: PoC Metrics Dashboard — Solo vs Ensemble comparison across six dimensions. COAs Generated (3 vs 7 — solo produced 3; ensemble total counts ChatGPT's 3 formal COAs plus Gemini's 2 implicit strategy variants and Grok's 2 contrarian alternatives), Unique Risks (8 vs 16 — aggregate across all ensemble member outputs), Ensemble-Only Insights (0 vs 6 — insights surfaced by the ensemble that were absent from solo analysis; solo excelled in operational detail, not measured here), Contrarian Depth (2/10 vs 8/10), Time to Decision (8 vs 15 min), Operational Specificity (9/10 vs 7/10). Image generated by Grok.* --- ## FOOTNOTES [^1]: UC Berkeley EECS-2025-164: "From Local Coordination to System-Level Strategies: Designing Reliable, Societal-Scale Multi-Agent Autonomy Across Scales," Victoria Tuck, 2025. Identified failure modes across multiple categories in multi-agent systems with failure rates 41–86.7% depending on system topology and inter-agent communication protocol. Available at: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-164.html [^2]: "Towards a Science of Scaling Agent Systems," Yubin Kim, Ken Gu, et al. (Google Research, MIT, Google DeepMind), 2025. arXiv:2512.08296. Found multi-agent systems degrade sequential task performance by 39–70% while improving parallel task performance by 80.9%. Implication: agent count alone is not a sufficient performance lever; task structure determines optimal ensemble size. Available at: https://arxiv.org/abs/2512.08296 [^3]: The Cross-Model Deliberation Protocol (CMDP) was first documented in Paper 5 of this series. The protocol consists of eight components: independent generation, blind critique, revealed-identity track, synthesis phase, live fact-check module, probability distributions, training prior disclosure, and open publication. [^4]: "The Digital Battle Staff" (Paper 2) runs 33,500 words across 68 footnotes and covers four major sections: doctrine fundamentals, military staff structure applied to AI, field evidence from a 39-session PARA vault laboratory, and operational recommendations. [^5]: DARPA CLARA proposal deadline: April 10, 2026. SOCOM Agentic AI Experimentation event: April 13-17, 2026. Both represent time-bound opportunity windows that shaped the publication strategy analysis. [^6]: The solo baseline analysis was produced by Claude Opus 4.6 operating autonomously for approximately 8 minutes. The ensemble analysis involved three parallel AI systems operating across different platforms (Gemini 3, ChatGPT, Grok), each accessed via browser. The human operator collected outputs and provided them to the Commander (Claude) for synthesis, totaling approximately 15 minutes including collection time. [^7]: Definition of Ensemble Value: findings that appeared in ensemble member outputs but were not present in the solo baseline analysis. Six such findings were identified and categorized by delta value (HIGH: immediately actionable, direct impact; MEDIUM-HIGH: actionable but requiring synthesis; MEDIUM: strategic insight requiring editorial integration; LOW: observed but not included in final synthesis). [^8]: "The Coordination Tax" — the baseline overhead cost of multi-model coordination (time, latency, complexity). Paper 5's CMDP pilot documented a 15–20% fidelity *improvement* when doctrine-structured coordination was applied, suggesting the tax is recoverable under structured conditions. The term refers to the overhead cost, not to fidelity degradation. When doctrine-structured protocols are applied, the coordination tax generates positive return. [^9]: Executive Order 14347, "Restoring the United States Department of War," signed September 5, 2025. The order directs the Department of Defense to adopt "Department of War" as a secondary title for non-statutory purposes. The statutory name remains "Department of Defense" pending Congressional action. See: https://www.whitehouse.gov/presidential-actions/2025/09/restoring-the-united-states-department-of-war/ [^10]: Full model outputs are archived in the project directory: solo baseline at `Paper-6-PoC-Solo-Baseline.md`, Gemini S2 at `Paper-6-PoC-Ensemble-Gemini-S2.md`, ChatGPT S3 at `Paper-6-PoC-Ensemble-ChatGPT-S3.md`, Grok Devil's Advocate at `Paper-6-PoC-Ensemble-Grok-DA.md`, and Commander's Synthesis at `Paper-6-PoC-Commander-Synthesis.md`. Excerpts in this paper preserve original wording; selections were made for concision and representative coverage. --- ## Series Navigation | | | |---|---| | **This paper** | Paper 6 of 7 | | **Previous** | [[Paper-5-When-the-Cats-Talk-to-Each-Other\|← Paper 5: When the Cats Talk to Each Other]] | | **Case Study** | [[Case-Study-Session-Close-Automation\|Case Study 1: Session Close Automation]] | | **Home** | [[Home\|← Series Home]] | ## Related - Herding Cats in the AI Age (series home) - Paper 5: When the Cats Talk to Each Other