Paper-6b-When-the-Cats-Take-the-Same-Test - Herding Cats in the AI Age

# WHEN THE CATS TAKE THE SAME TEST ## Cross-Provider Experimental Design Under Identical Commander's Intent **Jeep Marshall** LTC, US Army (Retired) Airborne Infantry | Special Operations | Process Improvement March 2026 --- **Series Note:** This is Paper 6b in the Herding Cats in the AI Age series — a companion to Paper 6 ("When the Cats Form a Team"). Paper 6 tested whether a doctrine-structured ensemble of four AI models outperforms a single model on a strategic decision. Paper 6b asks the next question: what happens when you give six AI surfaces the same complex task with identical instructions and no coordination structure? The answer — a 51% quality variance from the same Commander's Intent — is itself the strongest argument for the doctrine layer Paper 6 validated. --- ## EXECUTIVE SUMMARY The provenance crisis that followed collection proved more valuable than the scores. The brief specified WHAT to produce but not HOW TO LABEL IT. No filename convention. No metadata header. One file landed in the wrong folder. The analyst compounded the problem by creating duplicate versions of its own analysis while writing about the version control problem. Four lines of output doctrine — the military calls this "orders production," Step 7 of the MDMP — would have prevented every failure. The exercise designed to study coordination without discipline reproduced that failure in its own execution. A 51% quality variance from identical instructions. Six AI systems, same brief, scores ranging from 49% to 100% (scores assigned by a single AI analyst; see Section 8 for methodology limitations and Round 2 mitigations). And then the collection process broke — misattributed files, no self-identifying metadata, 60 minutes of human forensic verification — because the brief that tested coordination discipline forgot to include four lines of output doctrine. On March 17, 2026, six AI surfaces received an identical 3,200-word mission brief ordering each to design a rigorous experiment for validating doctrine-structured multi-agent AI coordination. Scores ranged from 22/45 (Gemini) to 45/45 (Claude Code CLI).[^3] Scores were assigned by a single AI analyst (Claude Opus 4.6, Desktop Chat surface) using a rubric derived from the brief's requirements; see Section 8 for methodology limitations and Round 2 mitigations. The top four producers delivered publication-grade experimental designs with validated statistics. The bottom two produced outlines a researcher could not execute. Four providers independently converged on the same experimental architecture — evidence the methodology is robust. Every provider contributed unique analytical value the others missed — evidence that ensemble synthesis outperforms individual analysis. --- ## 1. WHY THIS EXPERIMENT WITHIN AN EXPERIMENT Paper 6 proved that a doctrine-structured ensemble outperforms a solo model on strategic decisions. But the proof-of-concept was n=1 — one decision, one ensemble configuration, one session. The logical next step was to design a rigorous experiment with statistical power: multiple runs, controlled conditions, quantitative measurement, and formal hypothesis testing. Rather than design that experiment alone, the researcher issued the design task itself as a cross-provider test. The reasoning was straightforward: if multiple AI systems can independently produce experimental protocols from the same Commander's Intent, the convergent elements represent robust methodology, and the divergent elements reveal what each provider uniquely contributes. The exercise would also generate Round 1 data — a baseline measurement of output quality variance across providers that directly feeds Paper 6's core thesis. The mission brief was 3,200 words organized into six sections: Situation (market context and the null hypothesis), Commander's Intent (end state and key requirements), Task Order (nine specified sections A through I), Methodology Guidance (MDMP and LSS-BB frameworks to apply during design), Format Standards (active voice, quantify everything, state assumptions), and Context (this validates Paper 6's thesis for the "Herding Cats" series).[^1] Each provider received a platform-specific adapter — instructions tailored to each AI's unique capabilities. Claude Code CLI was directed to use bash execution for statistical validation. ChatGPT was directed to use Code Interpreter for Monte Carlo simulations. Gemini was directed to use Google Search grounding for benchmark research. Grok was directed to leverage real-time web search for the latest multi-agent literature. The adapters expanded capability access but did not change the Commander's Intent or task order.[^2] --- ## 2. THE PROVIDERS AND THEIR OUTPUTS Six AI surfaces participated. A "surface" here means a distinct delivery interface — the same underlying model accessed through different applications, each with different context windows, tool access, and output constraints. Three surfaces were Claude Opus 4.6 (Desktop Chat application, browser-based Claude.ai, and the Claude Code command-line tool). This was deliberate — testing whether the same model on different interfaces produces different outputs. | # | Provider | Surface/Model | Word Count | Score (of 45) | Pct | |---|---|---|---|---|---| | 1 | Claude Code CLI | Opus 4.6 (1M context) | ~8,500 | 45 | 100% | | 2 | ChatGPT | GPT-4o (Code Interpreter) | ~8,000 | 44 | 97.8% | | 3 | Claude Desktop Chat | Opus 4.6 (desktop app) | ~7,500 | 44 | 97.8% | | 4 | Claude.ai Web | Opus 4.6 (browser chat) | ~6,000 | 41 | 91.1% | | 5 | Grok | SuperGrok | ~3,000 | 25 | 55.6% | | 6 | Gemini Advanced | Gemini | ~1,500 | 22 | 48.9% | Scoring used a nine-section rubric (Sections A through I from the task order), each rated 1-5 by whether the AI delivered what the brief specified. A score of 3 means minimum requirements met. A score of 5 means the section exceeded requirements with unique value that no other provider contributed.[^3] The distribution is bimodal. Four providers cluster at 41-45 (91-100%). Two providers cluster at 22-25 (49-56%). There is no middle ground. The brief either activated full mission execution or it did not. **Important caveat:** The 23-point spread cannot be attributed solely to model capability differences. Each provider received a non-identical adapter, and Gemini's failure to execute its adapter instructions (search grounding, code execution) accounts for some of its score gap. The spread reflects a combination of model capability, surface constraints, adapter compliance, and tool execution — all of which are confounded in this round. Section 12 addresses these limitations in detail. --- ## 3. WHAT THE TOP FOUR PRODUCED The top four outputs — Claude Code CLI, ChatGPT, Claude Desktop Chat, and Claude.ai Web — each delivered complete, publication-grade experimental designs. Each independently created a complex, multi-stakeholder decision scenario with embedded analytical faults for the test subjects to catch. Each produced five testable hypotheses with null counterparts. Each calculated sample sizes using standard power analysis formulas. Each produced a seven-dimension Decision Quality Score rubric with scoring anchors. Each addressed all nine sections. But they diverged in ways that illuminate each provider's analytical personality. **Claude Code CLI** produced the most comprehensive and executable document. It validated every statistical claim via scipy computation — power analysis returned exact non-central F distribution results rather than approximations. It calculated process capability metrics (Cp = 1.111, Cpk = 0.889, DPMO = 13,646) from worked examples. It produced five appendices covering DMAIC application, scoring templates, control chart templates, formulas reference, and cross-provider compatibility. A researcher receiving this document and API keys could execute the experiment without external references. **ChatGPT (GPT-4o)** matched Claude Code's depth and added unique methodological contributions that no other provider produced. It ran Monte Carlo power simulations through Code Interpreter, generating CSV outputs and visualization files. It cited nine academic references with URLs — the only provider to ground its methodology in published literature. It proposed a two-stage study design: a clean comparison stage followed by a resilience comparison stage. Most significantly, it specified a mixed-effects model with provider and scenario variant as random effects — the most statistically sophisticated analysis plan in the field.[^4] **Claude Desktop Chat** produced the best decision scenario of any provider. Its Meridian/Pacific Ridge hospital acquisition featured three distinct embedded faults: a physician retention projection contradicted by the facility's own turnover data, a population growth forecast based on obsolete county boundaries, and a payer concentration risk buried in exhibit footnotes. The three-fault design is the most discriminating across all six outputs — it requires deep reading, cross-referencing between exhibits, and domain knowledge to identify all three errors. The answer key was the most granular, specifying minimum competent analysis (6 items), strong analysis (10 items), and exceptional analysis (14 items) with explicit scoring boundaries. **Claude.ai Web** introduced two unique concepts. First, a heterogeneous provider team for Condition C — instead of all agents running on one model, the experimental design assigned Commander to Claude Sonnet, S2 Intelligence to Gemini, S3 Operations to GPT-4o, and Devil's Advocate to Claude Opus. This mirrors Paper 6's actual ensemble configuration. Second, an Authority Analysis dimension in the Decision Quality Score: does the output correctly identify who has decision authority? In its scenario, the CEO cannot close a $52M acquisition without Board approval — a constraint that single-agent analysis often overlooks. ### What convergence reveals All four top providers independently converged on the same experimental architecture: - Three conditions: single agent (baseline), multi-agent without doctrine, multi-agent with MDMP doctrine - Five hypotheses testing quality, error detection, consistency, fault containment, and time-quality tradeoff - A healthcare acquisition as the test scenario (three of four chose this domain independently) - A seven-dimension Decision Quality Score with 1-5 Likert scaling - Power analysis recommending 22-53 runs per condition depending on target effect size - Process capability metrics (Cp/Cpk) applied to AI output quality - Western Electric rules for drift detection via control charts - Byzantine fault tolerance analysis for agent failure thresholds This convergence from independent analysis — four AI systems arriving at the same methodological framework without seeing each other's work — constitutes evidence that the experimental design is robust. Four different analytical engines, four different training philosophies, same structural conclusion. ### What divergence reveals The unique contributions table is equally informative: | Provider | Unique Contribution No Other Produced | |---|---| | Claude Code CLI | Weighted DQS with fault detection at 2.0x weight; evaluator calibration protocol with ICC monitoring; reproducibility package specification; cross-provider compatibility matrix | | ChatGPT | Scenario variants (Helix-A/B/C) for external validity; Monte Carlo power simulation; literature citations (9 references); mixed-effects model; two-stage study design | | Claude Desktop Chat | Three-fault scenario design; tiered answer key (minimum/strong/exceptional); most discriminating ground truth | | Claude.ai Web | Heterogeneous provider team; Authority Analysis dimension; pre-registration commitment with OSF specification | Every provider contributed something the others missed. This is the ensemble value proposition from Paper 6, demonstrated again: no single brilliant analyst sees all angles. The composite design — drawing the best section from each provider — would be stronger than any individual output. --- ## 4. WHAT THE BOTTOM TWO PRODUCED **Grok** delivered a structurally complete document that follows all nine sections — but the execution is thin. Its scenario, an "AI healthcare market entry in Southeast Asia," consists of a single paragraph with a fabricated PDF containing 30% inflated projections. A researcher cannot run this test because the scenario lacks the exhibit depth, stakeholder complexity, and embedded subtlety that the brief required. The DQS rubric lists dimensions but mechanizes scoring with formulas rather than qualitative anchors that human raters can apply. However, Grok's Section H (Limitations) was the most intellectually honest of all six outputs. Grok called the expected doctrine contribution "marginal (2-5% potential gain)" and questioned whether the overhead justifies the complexity. It cited published multi-agent failure rates (40-87%) and stated that doctrine "may exacerbate brittleness" if the structured phases prevent agents from pursuing productive tangents. This is exactly what a Devil's Advocate should do — and Grok did it without being assigned the role. It attacked the experiment's own thesis more aggressively than any other provider. **Gemini Advanced** failed the mission. It produced approximately 1,500 words — less than half of Grok's output and less than one-fifth of Claude Code's. Its scenario ("Project Azimuth," a semiconductor relocation decision) occupies a single paragraph with no exhibits, no reference materials, and no ground truth answer key. Its measurement framework lists dimension names without scoring anchors or protocols. Its statistical analysis section states sample size results without showing any derivation. There is no drift detection methodology. No cascading error analysis. No meaningful vault representation comparison. The limitations section is three sentences. Most critically, Gemini did not execute its adapter instructions. The Gemini adapter specifically directed it to use Google Search grounding for the latest multi-agent benchmark literature and code execution for statistical validation. The output shows no evidence that either tool was activated. Gemini produced what reads like a first-draft outline generated from prior training knowledge, not an executed analytical task. A researcher receiving this document cannot execute the experiment. --- ## 5. THE THREE CLAUDES — SAME MODEL, DIFFERENT SURFACES An unexpected finding: the same model (Claude Opus 4.6) produced measurably different outputs depending on which surface it ran on. | Surface | Score | Word Count | Key Characteristic | |---|---|---|---| | Code CLI (1M context) | 45/45 | ~8,500 | Most comprehensive. Validated stats via code execution. Five appendices. | | Desktop Chat | 44/45 | ~7,500 | Best scenario design. Three embedded faults. Most discriminating ground truth. | | Web (browser) | 41/45 | ~6,000 | Unique concepts (heterogeneous team, authority dimension). Constrained by chat delivery. | The 4-point spread (91-100%) across three instances of the same model is not enormous — but it is consistent and directional. Candidate explanations include context window differences (Code CLI operates at 1M tokens vs. shorter windows for Desktop and Web), tool access (Code CLI can execute code; Desktop and Web cannot), and output format constraints (Web is limited to sequential chat responses). These factors are confounded in this round — we cannot isolate which drives the quality difference. This finding has direct implications for multi-agent system design: the same model deployed on different surfaces produces different quality outputs. Surface selection is a design variable, not an implementation detail. --- ## 6. FOUR SCENARIOS FROM ONE BRIEF All six providers received the same task order: "Design a business/organizational decision problem" with specified properties (multiple stakeholders, incomplete information, 2nd/3rd-order consequences, time pressure, at least one embedded fault). No provider was told which domain to use. Four distinct scenarios emerged: | Provider(s) | Domain | Scenario | |---|---|---| | Claude Code, Claude Desktop, Claude Web | Healthcare acquisition | Hospital system acquiring struggling community facility (three variations on the theme) | | ChatGPT | Pharmaceutical strategy | Specialty pharmacy/PBM integration ("Project Helix") | | Gemini | Semiconductor relocation | Manufacturing facility relocation ("Project Azimuth") | | Grok | Market entry | AI healthcare market entry in Southeast Asia | Three of four providers who produced strong scenarios independently chose healthcare acquisition. This convergence likely reflects two factors: healthcare decisions involve the competing stakeholder dynamics the brief specified (patients, regulators, employees, investors, community), and recent AI training data includes extensive healthcare M&A analysis. The convergence itself validates the domain choice — multiple independent analyses concluded it was the best fit for the experimental requirements. ChatGPT's divergence to pharmaceutical strategy — a related but distinct domain — adds external validity potential. Grok's divergence to market entry and Gemini's to semiconductor relocation reflect less careful alignment with the brief's complexity requirements. --- ## 7. THE PROVENANCE CRISIS — LIVE THESIS DEMONSTRATION The most valuable finding from this exercise was not in any AI's output. It was in the collection process. Six AI systems produced six markdown files. None self-identified their producer in the filename. None embedded machine-readable provenance metadata in the document header. The researcher collected all outputs into a staging directory organized by provider subfolder. At least one file landed in the wrong subfolder. The analyst scoring the outputs could not independently verify who produced which file. The researcher re-downloaded and manually verified provenance for the disputed document. The analyst then created two versions of its own analysis file with the same filename in different directories — compounding the version control problem while writing about the version control problem. Total wasted effort: approximately 60 minutes of human time plus two analyst correction cycles. **Root cause:** The Universal Mission Brief specified WHAT to produce but not HOW TO LABEL IT. No filename convention. No required metadata header. No self-identification standard. **The fix:** Four lines of output doctrine. ``` FILENAME: P6_ExpDesign_[YourAIName]_[YYYYMMDD].md YAML FRONTMATTER: provider, model, surface, date, brief_version FOOTER: provider name, model version, word count, section count ``` This is a Condition B failure — orchestration without process discipline — demonstrated by the very exercise designed to study it. The brief coordinated six independent agents (multiple providers working the same task). It decomposed the work effectively (nine-section task order). But it did not impose output standards. The result was exactly what Paper 6's thesis predicts: without doctrine, coordination produces chaos at the boundaries between agents. The military term is "orders production." The MDMP's seventh step — after COA selection, after all the analysis — specifies exactly how the decision gets communicated: format, naming convention, distribution list, acknowledgment protocol. It exists because combat taught the same lesson: brilliant analysis is worthless if the people who need it cannot find it, verify it, or trust its provenance. --- ## 8. SCORING METHODOLOGY AND INTER-RATER CONSIDERATIONS The cross-provider scoring was conducted by a single analyst (Claude Opus 4.6, Desktop Chat surface) using a rubric derived directly from the mission brief's nine sections. Each section was scored 1-5: - **1 (Absent/Trivial):** Section missing or present in name only - **2 (Present but Inadequate):** Section exists but fails minimum requirements - **3 (Meets Minimum):** Section delivers what the brief specified - **4 (Strong):** Section delivers the specified requirements with clarity and depth - **5 (Exceeds):** Section exceeds requirements with unique value no other provider contributed **Known limitations of single-analyst scoring:** This scoring represents one AI's evaluation of other AIs' work. The analyst (Claude) may exhibit provider bias — rating Claude outputs more favorably through familiarity with the reasoning style, or rating them more harshly through overcorrection. The analyst scored its own sibling surfaces (Claude Desktop Chat scored Claude Web and Claude Code), creating a direct conflict of interest. **Mitigations applied:** - Rubric was derived from the brief's explicit requirements, not subjective quality assessment - Each score references specific evidence (section completeness, statistical derivation, scenario complexity) - The analyst's scoring notes are published in full in the Cross-Provider Analysis document[^5] - The word count correlation (higher scores correlate with longer outputs) is acknowledged as a confound **What Round 2 will fix:** - Two independent human raters score all outputs (blinded to provider) - Inter-rater reliability measured via Krippendorff's alpha (target: alpha >= 0.800) - Length normalization applied (quality per 1,000 words as secondary metric) - Analyst AI does not score its own provider's outputs --- ## 9. THE COMPOSITE DESIGN — BEST OF SIX The recommended experimental protocol draws the strongest section from each provider: | Section | Best Provider | What They Contributed | |---|---|---| | **A — Hypotheses** | Claude Code CLI | Five hypotheses with effect size thresholds. Claude Web's Time-Quality Tradeoff Ratio added as H6. | | **B — Conditions** | Claude Code CLI | Five-agent doctrine configuration with OC role. Claude Web's heterogeneous provider team for variant testing. | | **C — Scenario** | Claude Desktop Chat | Hospital acquisition with three embedded faults. ChatGPT's scenario variant concept (A/B/C) added for external validity. | | **D — Measurement** | Claude Code CLI | Weighted DQS validated via scipy. Claude Web's Authority Analysis as dimension 8. | | **E — Statistics** | ChatGPT | Mixed-effects model as primary analysis. Claude Code's sample size calculation. Holm-Bonferroni correction. | | **F — Failure Modes** | Claude Code CLI | Four-test battery with Byzantine analysis. Claude Desktop's cascading human error test added. | | **G — Vault** | Claude Code CLI | Markdown + JSON Schema overlay. Gemini's JSON-LD concept noted for future exploration. | | **H — Limitations** | Grok | Most honest self-assessment as baseline tone. Claude Code's framework comparison table for specificity. | | **I — Execution** | Claude Code CLI | Ten-week timeline with granular cost model. ChatGPT's two-stage design integrated. | This composite design is stronger than any individual output because it captures each provider's unique analytical contribution while eliminating each provider's individual gaps. The synthesis process — reading all six, comparing section by section, extracting the best — is itself a demonstration of the ensemble value Paper 6 documented. --- ## 10. WHAT THIS DATA TELLS US ABOUT THE PAPER 6 THESIS Four findings from this exercise directly support Paper 6's argument: **Finding 1 — Quality variance is large under identical instructions.** A 23-point spread (45 vs. 22 out of 45) from the same Commander's Intent in a single round. This is n=1 per provider — we cannot yet distinguish systematic capability differences from run-to-run variance. But the spread is large enough (51% of the scale) to establish the hypothesis that provider selection and output discipline are first-order variables in multi-agent system design. Round 2, with multiple runs per provider, will determine whether this variance is systematic or stochastic. **Finding 2 — Convergent methodology validates the experimental design.** Four independent AI systems, developed by four different organizations with four different training approaches, arrived at the same experimental architecture: three conditions (solo, multi-agent, multi-agent with doctrine), healthcare acquisition scenario, seven-dimension DQS, power analysis recommending 22-53 runs per condition. This independent convergence is stronger evidence for the design's robustness than any single system's output. **Finding 3 — Every provider contributed unique value.** Claude Code produced the most executable protocol. ChatGPT contributed the most sophisticated statistical methodology. Claude Desktop designed the most discriminating test scenario. Claude Web proposed the most realistic multi-provider team configuration. Grok delivered the most honest limitations analysis. Even Gemini suggested JSON-LD for vault representation — an idea no other provider considered. The composite design is demonstrably better than any individual contribution. **Finding 4 — The collection process reproduced the thesis.** The provenance crisis is not a footnote — it is the paper's most compelling evidence. Six AI agents, coordinated by a human researcher, using identical instructions but no output discipline, produced a coordination failure at the most basic level: file naming. The military solved this problem with orders production doctrine. The multi-agent AI community has not. This is the gap that the Command Vault doctrine layer fills. --- ## 11. COMMON GAPS — WHAT ALL SIX MISSED No provider, regardless of score, addressed these issues: 1. **IRB/ethics review.** Human evaluators scoring AI outputs constitutes human subjects research at many institutions. None mentioned Institutional Review Board consideration. 2. **Exact prompt text.** All described what agents receive and do not receive. None provided the literal system prompts for Conditions A and B. A replicating researcher cannot execute without these. 3. **Cross-condition contamination.** If runs execute on the same API account, provider-side conversation history or preference learning creates contamination risk. None specified session isolation protocols. 4. **Output length confound.** Condition C produces longer outputs (more agents, more structured process). Evaluators may rate longer outputs higher regardless of content. None addressed length normalization. 5. **Observer effect for the meta-experiment.** All acknowledged the brief goes to multiple providers simultaneously. None addressed whether knowledge of the comparison affects output quality. 6. **Orchestration automation.** Only one provider mentioned a reproducibility package. None provided pseudocode or architecture for the system that actually runs the experiment. These gaps represent the Round 2 requirements. The corrected Universal Mission Brief v2.0 — produced in the R1 After Action Review — addresses all six.[^7] --- ## 12. LIMITATIONS **Single-analyst scoring.** One AI scored all outputs. Provider bias is possible and unmitigated in this round. Round 2 uses blinded human raters. **No human baseline.** We compared AI outputs against each other, not against a human expert's experimental design. The "gold standard" is the brief's requirements, not an independent human product. **Adapter instructions were not identical.** Each provider received platform-specific adapter text. While the Commander's Intent and task order were identical, the adapter instructions directed different tool usage. This is a feature (testing each platform's unique capabilities) and a limitation (not a pure apples-to-apples comparison). **Round 1 is n=1 per provider.** Each AI produced one experimental design. We cannot distinguish systematic capability differences from run-to-run variance. A provider that scored 25/45 on this run might score 40/45 on the next. Round 2 should include multiple runs per provider. **Word count correlates with score.** The Pearson correlation between output length and quality score is r = 0.982 (r² = 0.965, n = 6). Longer outputs scored higher. With only six data points, this correlation is descriptive, not inferential — but it signals a confound. The correlation may reflect genuine thoroughness (more words = more complete analysis) or evaluator bias toward volume (longer documents look more impressive). Future scoring must control for length, either through normalization (quality per 1,000 words) or by including length as a covariate. **Author-tool relationship.** This paper was written using Claude Code CLI — the same tool that scored highest in the comparison. The researcher's primary working environment is the Command Vault (an Obsidian vault running Claude Code as its primary AI interface). This relationship is disclosed because the paper's own honesty standard demands it. **The thesis predicts the finding.** The researcher expected doctrine to matter. The exercise was designed by someone who believes in the doctrine layer. Confirmation bias is a risk at every level — in the brief design, the scoring, and this analysis. --- ## 13. WHAT HAPPENS NEXT **Round 2** executes the corrected Universal Mission Brief v2.0, incorporating all 11 fixes from the R1 AAR: self-identifying filenames, mandatory YAML frontmatter, minimum word count, compliance checkpoints, adapter execution verification, and a standardized decision scenario. The Round 2 delta — comparing output quality and collection integrity between rounds — is itself additional data for Paper 6's thesis. **The formal experiment** uses the composite design (Section 9) as the protocol. Thirty runs per condition across three conditions (90 total) provide 92.6% power to detect large effects (Cohen's f = 0.40). Estimated cost: $2,700-3,000 including API costs and human rater compensation. Timeline: seven weeks from setup to draft results. **Paper 7** ("The MDMP Platform Blueprint — forthcoming") provides the operational architecture for automating the experimental protocol. When the doctrine layer moves from prompt-based instructions to a software platform with built-in MDMP phases, quality gates, and process monitoring, the formal experiment validates whether the platform delivers measurable improvement over prompt-only coordination. The bridge from Paper 6 to Paper 6b to Paper 7 is this: Paper 6 proved the concept (n=1). Paper 6b designed and stress-tested the protocol across six providers. Paper 7 builds the platform. The formal experiment validates it. --- ## CONCLUSION Six cats took the same test. Four passed with distinction. Two failed. The researcher collecting the test papers lost track of whose was whose — because nobody wrote their name on it. Every element of this exercise — the quality variance, the convergent methodology, the unique contributions, the provenance failure — points to the same conclusion that five papers before it reached: AI systems do not fail because they lack intelligence. They fail because they lack discipline. Structure does not limit capability. Structure aims it. The experimental protocol is ready. The composite design draws the strongest section from each provider, validated by independent convergence across four analytical engines. The measurement framework includes process capability metrics borrowed from manufacturing (Cp/Cpk), statistical process control borrowed from quality engineering (control charts), and fault injection testing borrowed from distributed systems (Byzantine tolerance). The scenario is a realistic healthcare acquisition with three embedded analytical faults and a 16-item ground truth answer key. What remains is execution — and a Round 2 that fixes the four lines of output doctrine this round forgot. Build the harness, not the answer. Do it slowly. Teach how to think.[^9] --- ## FOOTNOTES [^1]: The complete Universal Mission Brief v1.0 is preserved in the project archive at `0-PROJECTS/Herding-Cats-in-the-AI-Age/Experimental-Design/`. The same brief was provided to all six AI surfaces with only the adapter instructions varying. [^2]: Platform-specific adapters: Claude Code CLI was directed to use bash execution and scipy for statistical validation. ChatGPT was directed to use Code Interpreter for Monte Carlo simulations and artifact generation. Gemini Advanced was directed to use Google Search grounding for benchmark research and code execution for formula validation. Grok was directed to leverage real-time web search for the latest multi-agent literature. Claude Desktop Chat and Claude.ai Web received minimal adapters directing output as markdown files. [^3]: Scoring conducted by Claude Opus 4.6 on the Claude.ai Desktop Chat surface. Full scoring notes, evidence citations, and per-section breakdowns are published in `Paper-6-Cross-Provider-Analysis.md` in the project directory. See Section 8 of this paper for scoring methodology limitations and Round 2 mitigations. [^4]: ChatGPT generated supporting artifacts including `power_analysis_summary.csv`, `monte_carlo_effects.csv`, `control_chart_stable.png`, `control_chart_drift.png`, `cp_cpk_examples.csv`, and `dqs_rubric_template.json`. These artifacts are preserved in the `Experimental-Design/ChatGPT/` directory. [^5]: Full cross-provider analysis: `0-PROJECTS/Herding-Cats-in-the-AI-Age/Paper-6-Cross-Provider-Analysis.md`, 251 lines, 12 sections including section-by-section scoring, unique contributions matrix, Commander's Intent compliance ranking, honesty ranking, consolidated "best of six" recommendation, and the meta-finding on provenance failure. [^6]: This mirrors the Paper 6 PoC finding: the ensemble surfaced 6 strategic insights the solo baseline missed, while the solo excelled at operational detail. Here, six providers surfaced methodological contributions that no single provider produced alone. The pattern holds: teams find what individuals miss — when given structure. [^7]: The R1 After Action Review and corrected Universal Mission Brief v2.0 are at `0-PROJECTS/Herding-Cats-in-the-AI-Age/Paper-6-R1-AAR.md`. The AAR follows standard military format (Sustain/Improve/Fix) with 5 sustains, 4 improvements, and 11 fixes. [^8]: The R1 AAR identified that the Round 1 delta — the improvement in collection quality between Round 1 (no output discipline) and Round 2 (with output discipline) — constitutes additional Paper 6 data. The before/after comparison directly measures the value of four lines of doctrine. [^9]: Epigraph from the Paper 6 experimental design exercise. Origin: Claude.ai, responding to the researcher's frustration after the third iteration of a debugging cycle. Captured in `Paper-6-Epigraphs.md`. --- ## APPENDIX A: PROVIDER OUTPUT LOCATIONS | Provider | File Path | |---|---| | Claude Code CLI | `Experimental-Design/Claude Code CLI/Command_Vault_Experiment_Design_CC.md` | | ChatGPT (GPT-4o) | `Experimental-Design/ChatGPT/command_vault_experimental_design.md` | | Claude Desktop Chat | `Experimental-Design/Claude Desktop Chat/06_Experimental_Design_Claude_Web.md` * | | Claude.ai Web | `Experimental-Design/Claude.ai (web)/Experimental design document with MDMP mission analysis.md` | | Gemini Advanced | `Experimental-Design/Gemini/You said === GEMINI — ADAPTER INSTRUCTIONS === ....md` ** | | Grok | `Experimental-Design/Grok/Grok.md` | All paths relative to `0-PROJECTS/Herding-Cats-in-the-AI-Age/`. \* The Claude Desktop Chat file carries "Claude_Web" in its filename — a provenance error from the original collection process. This is the exact failure Section 7 documents: without output naming standards, files misidentify their own producer. The filename is preserved as-is because renaming it would destroy the evidence. \*\* Gemini's file is named with the adapter instructions the user pasted, not with a self-identifying filename — Gemini's web clipper captured the prompt, not a clean deliverable. --- ## APPENDIX B: SECTION-BY-SECTION SCORES | Section | Claude Code | ChatGPT | Claude Desktop | Claude Web | Grok | Gemini | |---|:---:|:---:|:---:|:---:|:---:|:---:| | A — Research Questions & Hypotheses | 5 | 5 | 5 | 5 | 3 | 3 | | B — Experimental Conditions | 5 | 5 | 5 | 5 | 3 | 3 | | C — Decision Scenario | 5 | 5 | 5 | 5 | 2 | 2 | | D — Measurement Framework | 5 | 5 | 5 | 5 | 2 | 2 | | E — Statistical Analysis Plan | 5 | 5 | 5 | 4 | 3 | 3 | | F — Failure Mode Testing | 5 | 5 | 5 | 4 | 3 | 2 | | G — Vault Representation | 5 | 4 | 4 | 4 | 2 | 3 | | H — Limitations & Honest Assessment | 5 | 5 | 5 | 5 | 4 | 2 | | I — Execution Plan | 5 | 5 | 5 | 4 | 3 | 2 | | **TOTAL** | **45** | **44** | **44** | **41** | **25** | **22** | --- ## Related - Herding Cats in the AI Age (series home) - Paper 6: When the Cats Form a Team