Process Tracing (rival-explanation adjudication)
When one thing has happened and several stories compete to explain why, the instinct is to tally evidence: pile up what supports each story and pick the side with the bigger pile. That rewards whichever explanation attracted the most loosely-relevant chatter. Process tracing refuses the tally. It weighs each piece of within-case evidence by its diagnosticity - its power to eliminate or confirm an explanation - so that one decisive observation outranks any amount of weak, ambiguous support. The durable move is to make each rival explanation concrete as a causal mechanism chain (the step-by-step way that story would have produced this outcome, and the observable fingerprints each step would leave), state those expected fingerprints before weighing the evidence, then type each piece of evidence by certainty (if the explanation is true, must we see this?) and uniqueness (could the rivals also produce it?). A single failed hoop test eliminates a rival no matter how much straw-in-the-wind support it had. The output is a rival-explanation evidence ledger: the rivals, each one’s mechanism chain, every evidence item typed per rival, the surviving explanation with its residual uncertainty, and the single most decisive observation still missing. It is explicitly not a cross-case generalization, not a consistency-scoring matrix, and not a manufactured winner when nothing available is diagnostic.
When to Use
Section titled “When to Use”- One outcome has occurred and there are genuinely rival stories about why it happened: an incident postmortem with three competing root-cause theories, a churn spike (a pricing change versus a competitor launch versus an onboarding regression), a lost deal, a metric anomaly, a contested past decision.
- Mechanism-level evidence is available or obtainable - logs, timestamps, documents, sequence of events, who knew what when - that could discriminate the rivals rather than merely decorate them.
- The argument has become a shouting match between narratives, and the useful reframing is “what would I expect to see if THIS story were true that the others would not produce?”
When NOT to Use
Section titled “When NOT to Use”- Do not use it when there are no rivals on the table. With a single causal story there is nothing to discriminate. Descend the levels of that one story with
think-iceberg-model, or decompose its coverage withthink-issue-tree. Process tracing needs at least two genuine rival explanations. - Do not use it for cross-case generalization. “Does X generally cause Y?” or “which combination of conditions produces success across our markets?” is comparative and configurational work over many cases. Process tracing’s jurisdiction is one case, N equals one. (That cross-case space is QCA’s territory, rejected in this library for fit.)
- Do not run it on an all-straw-in-the-wind evidence pool. When nothing available is diagnostic, running the ritual anyway produces false confidence. The honest output is “non-diagnostic - here is the observation that would discriminate,” never a manufactured winner. This is the central wall.
- Do not let it degenerate into an evidence-by-hypothesis tally matrix. Scoring every item against every hypothesis for consistency and picking the least-inconsistent is Analysis of Competing Hypotheses, whose controlled record with professional analysts is null-to-negative (see
evidence/dossier.md). If there is no single case and no mechanism chain - just generic multi-hypothesis scoring - decline rather than becoming that matrix under another name. The value lives in the mechanism chains and the per-item typing, not in a tally. - Do not assign test types after seeing the evidence. Grading a found item as a “smoking gun” post hoc inflates its diagnosticity and invites motivated grading. The expected fingerprints for each rival must be stated before the evidence is weighed.
Instructions
Section titled “Instructions”When asked to figure out which of several explanations actually caused a single outcome, follow these steps:
- State the focal outcome and the case. Name, in one line, the specific thing that happened and the single case it happened in. If the question is really cross-case (“does this generally happen?”), stop - this is the wrong tool.
- Surface the rival explanations. List the genuinely competing stories for why the outcome occurred. There must be at least two; if there is only one, stop and use a single-story diagnosis instead. Keep the rivals genuinely distinct, not relabelings of one story.
- Make each rival a mechanism chain. For each rival, write the step-by-step causal chain by which that story would have produced this outcome in this case. Each chain is the spine the evidence will be tested against.
- State the expected fingerprints per step, before weighing evidence. For each step in each chain, name the observable traces it would have left if it actually operated (logs, timestamps, documents, sequence, who knew what when). Write these down before grading any actual evidence - this is what blocks post-hoc inflation.
- Type each evidence item per rival. For each piece of evidence, ask the two questions against each rival: certainty (if this rival is true, must we see this?) and uniqueness (could the other rivals also produce it?). Classify the resulting diagnosticity:
- Hoop (certain, not unique): failing it eliminates the rival; passing keeps it alive without confirming it.
- Smoking gun (unique, not certain): finding it strongly confirms the rival; not finding it only mildly weakens it.
- Straw in the wind (neither): a weak nudge, never decisive.
- Doubly decisive (both): confirms one rival and eliminates the others - rare, and what the search aims at.
- Update each rival item by item. Let the decisive items do the work. A single failed hoop eliminates a rival regardless of how much straw-in-the-wind support it had. Track the running status of each rival (alive / eliminated / confirmed).
- Read the result honestly. Name the surviving explanation and its residual uncertainty. If every available item is straw-in-the-wind and nothing discriminates, say so: the output is “non-diagnostic,” not a winner.
- Name the single most decisive missing observation. State the one observation - ideally a hoop or doubly-decisive test - that would most cut the remaining uncertainty if it could be obtained next.
- Emit the rival-explanation evidence ledger per
references/TEMPLATE.md: the outcome and case, the rivals with their mechanism chains and expected fingerprints, the typed evidence table, the surviving explanation with residual uncertainty, and the most decisive missing observation. Carry the evidence caveat through into the artifact.
Output Format
Section titled “Output Format”Use the template in references/TEMPLATE.md. The deliverable is the filled rival-explanation evidence ledger - the rivals, their mechanism chains, the per-item diagnosticity typing, the surviving explanation with residual uncertainty, and the most decisive missing observation - not a prose essay and not a tally. Never select a winner by counting supporting items; let the typed, decisive items decide. When nothing is diagnostic, the ledger says “non-diagnostic” and names the discriminating observation to seek.
Quality Checklist
Section titled “Quality Checklist”Before finalizing, verify:
- The focal outcome and single case are stated in one line, and the question is genuinely within-case (not cross-case generalization).
- There are at least two genuinely rival explanations, each made concrete as a step-by-step causal mechanism chain.
- The expected observable fingerprints for each step are stated before any evidence is graded (no post-hoc test-type assignment).
- Each evidence item is typed per rival by certainty and uniqueness into hoop / smoking gun / straw-in-the-wind / doubly decisive.
- Rivals are updated item by item, with a single failed hoop eliminating a rival - not selected by counting supporting items.
- If the evidence pool is all straw-in-the-wind, the output is “non-diagnostic” with the discriminating observation named - not a manufactured winner.
- It has not become a consistency-matrix tally across hypotheses (that is ACH, declined here); the work is in the mechanism chains and the per-item typing.
- The surviving explanation is reported with its residual uncertainty, and the single most decisive missing observation is named.
- The output is the rival-explanation evidence ledger artifact, not prose.
- No overclaiming: the evidence is practitioner-grade methodology and transferred from human case-study research; claim a structured-adjudication aid, not a measured gain in reasoning accuracy (see
evidence/dossier.md).
Evidence
Section titled “Evidence”Tier P (governing). Process tracing has a deep, peer-reviewed, actively self-critical methodological literature (Van Evera 1997; George and Bennett 2005; Collier 2011; Mahoney 2012; Bennett and Checkel 2015; Beach and Pedersen 2019; the Bayesian formalization by Fairfield and Charman 2017), but that literature concerns inferential validity in case-study research - whether and when this logic licenses causal conclusions - not controlled human-reasoning outcomes. There is no randomized or controlled trial testing whether using process tracing improves judgment accuracy, for humans or for agents; one external research run graded it S on methodological pedigree and that grade is rejected here, because pedigree is not outcome evidence. The only nearby controlled evidence is negative and belongs to the cousin method ACH (Mandel, Karvetski and Dhami 2018; Dhami, Belton and Mandel 2019); it attaches to ACH’s matrix-tally procedure, sets no tier here in either direction, and motivates the hard anti-ACH wall. All of it is transferred from human methodological contexts and not validated for AI-augmented use, which independently caps the grade at P. The skill ships as a structured single-case adjudication aid with a hard “non-diagnostic is a valid answer” wall, never as a measured accuracy improver. Full grading, sources, and caveats: evidence/dossier.md.
Examples
Section titled “Examples”See references/EXAMPLE.md for a completed rival-explanation evidence ledger on a real decision.
Deep dive: worked example
Section titled “Deep dive: worked example”A full worked run (the shared Northwind scenario)
Rival-Explanation Evidence Ledger - Worked Example
Section titled “Rival-Explanation Evidence Ledger - Worked Example”A completed run of the process-tracing skill on a real, consequential question. This is the quality bar a generated ledger should meet.
Uses the shared recurring scenario (Northwind, a B2B SaaS that launched a self-serve free tier) so examples across skills read as one coherent product. Where
think-premortemimagines one specified failure of the launch before it ships, andthink-scenario-planningbuilds alternative external worlds the launch might land in, this skill takes one outcome that has ALREADY happened - a sharp churn spike six weeks after launch - and adjudicates the rival explanations of why. Seedocs/internal/AUTHORING.md.
Evidence is weighed by diagnosticity, not by count. The winner is decided by the decisive typed items (a failed hoop, a smoking gun), not by which explanation collected the most supporting mentions. The expected fingerprints were written down before the evidence was graded.
Focal outcome and case
Section titled “Focal outcome and case”- Outcome: Six weeks after Northwind launched its self-serve free tier, paid-conversion-cohort retention dropped sharply - the week-4 retention of users who started on the free tier and upgraded fell from a steady 88% to 71% in a single cohort, and has stayed there.
- Case: The one cohort of free-tier-originated paid accounts that activated in the two weeks after launch (the “launch cohort”). N equals one - this is “why did THIS cohort’s retention crater?”, not “what drives B2B retention in general.”
- Within-case check: Single-case, backward-looking “why did this specific drop happen?” question. Confirmed in scope. (A cross-case “what generally causes retention loss across all our cohorts?” question would be the wrong tool.)
Rival explanations and their mechanism chains
Section titled “Rival explanations and their mechanism chains”Three genuinely competing stories surfaced in the postmortem. Each is made concrete as a mechanism chain, with the observable fingerprints stated before any evidence was graded.
Rival A: Free-tier dilution (the funnel changed who upgrades)
Section titled “Rival A: Free-tier dilution (the funnel changed who upgrades)”- Mechanism chain: free tier opens -> a flood of low-intent, low-fit users sign up -> some upgrade impulsively without a real use case -> they never reach the habit-forming workflow -> they churn fast.
- Expected fingerprints (stated first): the launch-cohort upgraders should look demographically and behaviorally different from prior upgraders (smaller accounts, fewer seats, lower pre-upgrade activation); churn should concentrate in the never-activated segment; the effect should be present from day one of the cohort, not triggered by a later event.
Rival B: Onboarding regression (a launch-day code change broke activation)
Section titled “Rival B: Onboarding regression (a launch-day code change broke activation)”- Mechanism chain: the free-tier launch shipped alongside a rewritten onboarding flow -> a regression in that flow silently broke a key first-run step for a subset of users -> those users never complete setup -> unset-up accounts churn.
- Expected fingerprints: an error-rate or drop-off spike at a specific onboarding step beginning at the launch deploy timestamp; the broken step’s failure logs; churn concentrated among accounts that hit the broken step; retention of accounts created just BEFORE the deploy should be unaffected.
Rival C: Competitor launch (an external pull, not an internal break)
Section titled “Rival C: Competitor launch (an external pull, not an internal break)”- Mechanism chain: a competitor launched an aggressive free offering in the same window -> Northwind’s newest, least-committed users are the most poachable -> they leave for the competitor -> the launch cohort churns.
- Expected fingerprints: the competitor’s launch dated inside the window; churned accounts citing or switching to the competitor (support notes, cancel-reason field, win/loss); the churn should hit competitor-overlapping segments hardest; timing of churn should track the competitor’s launch date, not Northwind’s deploy.
Evidence typed per rival
Section titled “Evidence typed per rival”Each item typed against the rivals by certainty (must we see this if the rival is true?) and uniqueness (could the others produce it too?). The expected-fingerprint column was predicted before each find.
| Evidence item | Expected fingerprint (stated first) | Diagnosticity | Test type | Effect on rivals |
|---|---|---|---|---|
| Onboarding step-3 completion rate fell from 94% to 61% starting exactly at the launch deploy timestamp; error logs show a null-state crash for accounts without a seeded workspace | Rival B predicted a step-level drop-off spike at the deploy timestamp with failure logs | If B is true we MUST see a broken step (certain); a dilution funnel or a competitor would NOT break step-3 completion with a deploy-timed crash (unique) | Doubly decisive for Rival B | B confirmed; A and C cannot easily produce a deploy-timed step-3 crash |
| Churned launch-cohort accounts are 4x concentrated among users who hit the step-3 crash | B predicted churn concentrated among accounts that hit the broken step | Certain for B; A would predict churn concentrated in never-activated low-fit users instead | Hoop for B (B passes) | B stays alive and strengthened; weak against A |
| Launch-cohort upgraders look slightly smaller (median seats 4 vs 6) than prior upgraders | A predicted demographically different, smaller upgraders | Consistent with A but a free tier always shifts the mix somewhat; not unique, not certain | Straw in the wind for A | A nudged up slightly - not decisive |
| Accounts created in the 2 weeks BEFORE the deploy retained at the normal 88% | B predicted pre-deploy accounts unaffected; A predicted dilution should also affect any free-adjacent funnel, less time-bound | Certain test for B (pre-deploy must be normal if B); discriminates B from A’s day-one prediction | Hoop for A (A fails) | A weakened: A predicted a from-day-one funnel effect, but the break is sharply deploy-timed, not cohort-wide |
| Cancel-reason field and support notes: 2 of 41 churned accounts mention a competitor; no competitor free launch is dated in the window | C predicted competitor citations and a dated competitor launch | If C drove the spike we MUST see meaningful competitor signal and a dated launch (certain); near-absence fails the hoop | Hoop for C (C fails) | C eliminated |
Running status per rival
Section titled “Running status per rival”| Rival | Status after the typed evidence | What decided it |
|---|---|---|
| Rival A: free-tier dilution | alive but minor, not the driver | failed the “from day one” hoop (the break is deploy-timed, not cohort-wide); only straw-in-the-wind support |
| Rival B: onboarding regression | confirmed | the doubly-decisive deploy-timed step-3 crash plus a passed hoop on churn concentration |
| Rival C: competitor launch | eliminated | failed its hoop - no dated competitor launch, negligible competitor citations |
Surviving explanation and residual uncertainty
Section titled “Surviving explanation and residual uncertainty”- Surviving explanation: The churn spike was caused chiefly by an onboarding regression - a null-state crash at step 3 that shipped with the launch deploy and silently broke first-run setup for accounts without a seeded workspace (Rival B).
- Residual uncertainty: Free-tier dilution (Rival A) is a real but secondary effect (slightly smaller, lower-fit upgraders) that the data cannot fully separate from the regression’s footprint; it would not, on its own, explain the sharp deploy-timed drop. How much of the residual 71%-to-88% gap closes once the crash is fixed is not yet known and is the test of this conclusion.
Most decisive missing observation
Section titled “Most decisive missing observation”Fix the step-3 null-state crash and measure week-4 retention of the NEXT post-fix cohort. If retention returns to roughly 88%, that is a doubly-decisive confirmation of Rival B and bounds Rival A as minor. If it recovers only partway (say to 80%), the gap is the live measure of the dilution effect (Rival A), and the funnel-quality question becomes the next thing to work. This single forward observation discriminates the surviving explanation from its residual rival better than any further mining of the existing logs.
Evidence caveat (carried from the artifact)
Section titled “Evidence caveat (carried from the artifact)”This ledger is a structured single-case adjudication aid, not a measured accuracy improver. The governing evidence tier is P (practitioner): process tracing’s methodological literature establishes that the diagnosticity logic is valid, but no controlled trial shows that using it improves reasoning accuracy, for humans or agents, and the evidence is transferred from human case-study methodology (not agent-validated). The nearby controlled evidence is negative and belongs to the cousin method ACH; it sets no tier here. The conclusion above is the best-supported rival given the typed evidence, not a proof - which is why the most decisive missing observation is named and run next. See
evidence/dossier.md.
Note how this differs from its neighbors on the same Northwind launch. The think-premortem example imagines one specified way the launch could fail BEFORE it ships and reasons back to causes. The think-scenario-planning example builds alternative external futures the launch might land in. This ledger does neither: the outcome has already happened, and the work is to adjudicate the genuinely rival explanations of why - by typing each piece of within-case evidence by its diagnosticity and letting the decisive items eliminate or confirm, not by tallying support. The deliverable is the surviving explanation with its residual uncertainty and the next decisive observation, not a forecast and not a risk list.
Grounding: the full evidence dossier
Section titled “Grounding: the full evidence dossier”What the research does and does not show, with graded sources
Evidence Dossier: Process Tracing
Section titled “Evidence Dossier: Process Tracing”The single source of truth for the
process-tracingskill. TheSKILL.md, the sidecar (skill.meta.yml), and the eval cases all derive from this file. If a claim is not here, it does not belong in the skill. Reformatted from_local/proposed-builds/process-tracing/dossier.mdand admitted as a Build at tier P (the candidate’s preliminary verdict is upheld; the external wave-3 S-on-pedigree grade is rejected and stays rejected).
| Skill | thinking-framework-skills.process-tracing (installable name think-process-tracing) |
| Family | systems-and-consequences |
| Evidence tier | P governing (deep peer-reviewed methodology, no controlled reasoning-outcome trial - see “What the evidence shows”) |
| Confidence | Moderate that per-item diagnosticity typing against rival mechanism chains discriminates causal stories better than an evidence tally; low that any reasoning-accuracy effect transfers to agents (there is no such study, for humans or agents) |
| Status | cand (the v0.7.0 phase-2 reconciliation upheld Build at P; built as a skill here) |
1. The mechanism (what actually does the work)
Section titled “1. The mechanism (what actually does the work)”Process tracing adjudicates rival causal explanations of a single case by weighing each piece of within-case evidence by its diagnosticity - its power to eliminate or confirm an explanation - rather than by how much evidence piles up on each side. The move has two coupled steps.
First, make each rival explanation concrete as a causal mechanism: the step-by-step chain that explanation claims links cause to outcome in this case, and the observable fingerprints each step would have left if it actually operated (logs, timestamps, documents, who knew what when). The fingerprints are stated before the evidence is weighed.
Second, type each piece of evidence by two questions (Van Evera 1997): certainty (if the explanation is true, must we see this?) and uniqueness (could the rivals also produce it?). The combinations give the four classic tests:
- Hoop test (certain, not unique): an explanation that fails it is eliminated; passing keeps the explanation alive without confirming it.
- Smoking gun (unique, not certain): finding it strongly confirms the explanation; not finding it only mildly weakens it.
- Straw in the wind (neither): a weak nudge either way, never decisive.
- Doubly decisive (both): one observation that confirms an explanation and eliminates its rivals - rare, but what the search is aimed at.
Belief in each rival is updated item by item, with the decisive items doing the work; a single failed hoop removes a rival no matter how much straw-in-the-wind support it has accumulated. The Bayesian formalization (Fairfield and Charman 2017) treats the four tests as limiting cases of likelihood-ratio reasoning: how much more probable is this observation under one explanation than under the others.
The durable cognitive move is eliminating or confirming rival causal explanations of one case by typing each piece of within-case evidence by its diagnosticity - its necessity and its sufficiency - against each rival’s implied mechanism chain, rather than by tallying how much evidence supports each side. Two things distinguish it from ordinary evidence-weighing: the object is a single case with genuinely rival stories about why it happened (not a cross-case generalization, not a single accepted story), and the operation is per-item diagnosticity typing with elimination (not a consistency tally across hypotheses).
The output is a rival-explanation evidence ledger: the rivals, each rival’s implied mechanism chain, each evidence item typed per rival (hoop / smoking gun / straw-in-the-wind / doubly decisive), the surviving explanation with its residual uncertainty, and the single most decisive observation still missing.
2. Lineage
Section titled “2. Lineage”The term entered social science from the cognitive psychology of decision making: Alexander George imported it for within-case analysis of decision processes (George and McKeown 1985, “Case Studies and Theories of Organizational Decision Making”), and George and Bennett (2005), Case Studies and Theory Development in the Social Sciences, made it the centerpiece of qualitative causal inference. Stephen Van Evera (1997) named the four tests (certainty by uniqueness). David Collier (2011) turned them into the standard teaching framework as a necessity-by-sufficiency two-by-two. James Mahoney (2012) formalized their logic set-theoretically. Bennett and Checkel (2015) codified best-practice criteria. Derek Beach and Rasmus Brun Pedersen (2013; 2nd ed. 2019) split the method into theory-testing, theory-building, and outcome-explaining variants and operationalized mechanism evidence. The Bayesian turn runs through Fairfield and Charman (2017; 2022), with Sherry Zaks’s “Updating Bayesian(s)” as the sharpest internal critique. The method has substantial applied uptake in program and impact evaluation (for example Befani and Mayne 2014 on combining process tracing with contribution analysis).
The terms “process tracing,” “hoop test,” “smoking gun,” “straw in the wind,” and “doubly decisive” are generic and descriptive within the methodological literature; the durable move is named for what it does, and the skill ships documented descriptively with the lineage credited here rather than branded. The attribution string credits Van Evera (the four tests), Collier, Beach and Pedersen, and the Bayesian formalization by Fairfield and Charman.
Start with Collier (2011) for the tests in an afternoon; read Beach and Pedersen (2019) to do it properly; read Fairfield and Charman (2017) to understand what the tests really are underneath.
3. What the evidence shows, and what it does NOT show
Section titled “3. What the evidence shows, and what it does NOT show”The honest grade is P (practitioner). The methodological literature is deep, peer-reviewed, and actively self-critical, but it concerns inferential validity in case-study research - whether and when this logic licenses causal conclusions - not controlled human-reasoning outcomes. No randomized or controlled trial tests whether using process tracing improves judgment accuracy, for humans or agents.
What the record supports. The logic of the four diagnosticity tests is rigorously worked out and formally grounded.
- Stephen Van Evera (1997), Guide to Methods for Students of Political Science (Cornell University Press). Coined the four-test typology (certainty by uniqueness). Methodological prescription; grade P.
- Alexander George and Andrew Bennett (2005), Case Studies and Theory Development in the Social Sciences (MIT Press). The canonical statement of process tracing as within-case inference on causal mechanisms. Methodology; grade P.
- David Collier (2011), “Understanding Process Tracing,” PS: Political Science and Politics 44(4), 823-830. Systematized the four tests as a necessity-by-sufficiency two-by-two with teaching exercises; the standard accessible exposition. Grade P.
- James Mahoney (2012), “The Logic of Process Tracing Tests in the Social Sciences,” Sociological Methods and Research 41(4), 570-597. Set-theoretic formalization of what each test can and cannot establish. Formal methodology; grade P.
- Andrew Bennett and Jeffrey T. Checkel, eds. (2015), Process Tracing: From Metaphor to Analytic Tool (Cambridge University Press). Best-practice criteria for rigorous application across domains. Grade P.
- Derek Beach and Rasmus Brun Pedersen (2013; 2nd ed. 2019), Process-Tracing Methods: Foundations and Guidelines (University of Michigan Press). Distinguishes theory-testing, theory-building, and outcome-explaining variants and operationalizes mechanism evidence. Grade P.
- Tasha Fairfield and Andrew Charman (2017), “Explicit Bayesian Analysis for Process Tracing,” Political Analysis 25(3), 363-380. Formalizes the method as explicit Bayesian updating, and doubles as an internal critique: outside deductive limiting cases the four-test classification does not always sensibly classify evidence, so reason in likelihood ratios. Their 2022 book (Social Inquiry and Bayesian Inference, Cambridge) extends this; Sherry Zaks’s critical evaluation of Bayesian process tracing contests parts of the practice. Formal methodology plus live debate; grade P.
What the record does NOT support. Any claim that the procedure measurably improves reasoning outcomes. There is no randomized or controlled human study, and none for agents. One wave-3 external research run graded process tracing S on methodological pedigree; that grade was rejected in the registry adjudication and stays rejected here - pedigree is not outcome evidence. The governing grade is P, and it is honest: the literature establishes that the logic is valid, not that running the procedure makes a reasoner more accurate.
4. Transferred-evidence flag (required honesty for this library)
Section titled “4. Transferred-evidence flag (required honesty for this library)”There is no on-target evidence in either direction, so the only nearby controlled evidence is negative and belongs to a cousin method, not to this one.
The cousin is Analysis of Competing Hypotheses (ACH), which scores every evidence item against every hypothesis in a consistency matrix and selects the least-inconsistent by tally. A randomized study with professional intelligence analysts (Dhami, Belton and Mandel 2019, Applied Cognitive Psychology) found trained analysts skip steps and showed mixed-to-negative effects, and Mandel, Karvetski and Dhami (2018, Judgment and Decision Making 13(6)) found ACH failed to improve analysts’ probabilistic judgments, with slightly worse coherence and accuracy than no method. That record attaches to ACH’s matrix-tally procedure - a different operand and a different operation - and per this library’s rules transferred evidence sets no tier in either direction. It is recorded here as a documented caution for the whole structured rival-hypothesis genre and as the reason the built skill enforces a hard anti-ACH wall (no consistency matrix, no least-inconsistent tally; the move lives in the mechanism chains and the per-item necessity/sufficiency typing). The ACH X does not transfer onto process tracing’s P, and process tracing’s P does not rescue ACH.
All of the methodological evidence in section 3 is on human case-study research practice; none studies a process-tracing ledger produced by or with an AI agent. The evidence is transferred from human methodological contexts and not validated for AI-augmented use, which independently caps the grade at P. The AI value is mechanical and modest: an agent makes the discipline cheap to run (state each rival’s mechanism chain and its expected fingerprints before weighing evidence, type each item by certainty and uniqueness, let a single failed hoop eliminate a rival) and produces a durable, inspectable ledger - benefits that do not depend on any contested outcome claim.
Name collision, not evidence. In judgment-and-decision-making psychology, “process tracing” names laboratory data-collection methods (eye tracking, information boards, think-aloud protocols) for studying how people decide (see the handbook edited by Schulte-Mecklenbeck, Kuhberger and Ranyard). That literature is a homonym and lends this method no support.
5. When it works / when it fails (drives the eval negative cases and “When NOT to Use”)
Section titled “5. When it works / when it fails (drives the eval negative cases and “When NOT to Use”)”Works best when:
- There is exactly one case, genuinely rival stories about why it happened, and mechanism-level evidence available to discriminate them: an incident postmortem with three competing root-cause theories, a churn spike with rival explanations (a pricing change versus a competitor launch versus an onboarding regression), a lost deal, a metric anomaly, a contested historical decision.
- The question is “what would I expect to see if THIS story were true that the others would not produce?” - converting a shouting match between narratives into a search for discriminating observations.
Fails or misleads when (poor-fit / anti-patterns):
- There are no rivals on the table. With a single causal story, there is nothing to discriminate; use a level-descent diagnosis (iceberg-model) or a coverage decomposition (issue-tree) instead.
- The question is cross-case (“does X generally cause Y?”, “which combination of conditions produces success across our markets?”). That is the comparative and configurational space (QCA’s territory, rejected here for fit); process tracing’s jurisdiction is one case, N equals one.
- The evidence pool is all straw-in-the-wind. When nothing available is diagnostic, running the ritual anyway produces false confidence; the honest output is “non-diagnostic - here is the observation that would discriminate,” not a manufactured winner.
- It degenerates into a generic evidence-by-hypothesis tally matrix. Scoring every item against every hypothesis for consistency and picking the least-inconsistent is Analysis of Competing Hypotheses, whose controlled record with professional analysts is negative. Process tracing’s value lives in the mechanism chains and the per-item necessity/sufficiency typing, not in a tally. If there is no single case and no mechanism chain, decline rather than becoming an ACH matrix under another name.
- Test types are assigned after seeing the evidence. Grading a found item as a “smoking gun” post hoc inflates its diagnosticity; the typology invites motivated grading unless the expected fingerprints are stated before the evidence is weighed (the caution in Fairfield and Charman 2017, and the thrust of Zaks’s critique).
6. Output artifact
Section titled “6. Output artifact”The skill must emit a rival-explanation evidence ledger, not prose: the focal outcome and case; the rival explanations, each made concrete as a causal mechanism chain with the observable fingerprints each step would leave; every evidence item typed per rival (hoop / smoking gun / straw-in-the-wind / doubly decisive) with its expected-fingerprint stated before the find; the running elimination/confirmation per rival; the surviving explanation with its residual uncertainty; and the single most decisive observation still missing. When the available evidence is all non-diagnostic, the honest ledger says “non-diagnostic” and names the discriminating observation to seek, rather than declaring a winner.
7. Sources
Section titled “7. Sources”- Stephen Van Evera, Guide to Methods for Students of Political Science (Cornell University Press, 1997). Coined the four-test typology (certainty by uniqueness). Methodological prescription. (P)
- Alexander L. George and Andrew Bennett, Case Studies and Theory Development in the Social Sciences (MIT Press, 2005). The canonical statement of process tracing as within-case inference on causal mechanisms. (P)
- David Collier, “Understanding Process Tracing,” PS: Political Science and Politics 44(4):823-830 (2011). Systematizes the four tests as a necessity-by-sufficiency two-by-two; the standard accessible exposition. (P)
- James Mahoney, “The Logic of Process Tracing Tests in the Social Sciences,” Sociological Methods and Research 41(4):570-597 (2012). Set-theoretic formalization of what each test can and cannot establish. (P)
- Andrew Bennett and Jeffrey T. Checkel, eds., Process Tracing: From Metaphor to Analytic Tool (Cambridge University Press, 2015). Best-practice criteria across domains. (P)
- Derek Beach and Rasmus Brun Pedersen, Process-Tracing Methods: Foundations and Guidelines, 2nd ed. (University of Michigan Press, 2019). Distinguishes theory-testing, theory-building, and outcome-explaining variants. (P)
- Tasha Fairfield and Andrew Charman, “Explicit Bayesian Analysis for Process Tracing,” Political Analysis 25(3):363-380 (2017). Formalizes the four tests as likelihood-ratio limiting cases, and is itself an internal critique of mechanical test-typing. (P)
- Barbara Befani and John Mayne, “Process Tracing and Contribution Analysis: A Combined Approach to Generative Causal Inference for Impact Evaluation,” IDS Bulletin 45(6) (2014). Applied uptake in program and impact evaluation. (Practitioner application.)
Adjacent evidence, transferred, sets no tier here: Mandel, Karvetski and Dhami, “Boosting intelligence analysts’ judgment accuracy: What works, what fails?,” Judgment and Decision Making 13(6) (2018), and Dhami, Belton and Mandel (2019), Applied Cognitive Psychology. These are the controlled, randomized, null-to-negative results on ACH’s matrix-tally procedure. They attach to ACH’s entry (a different operand and operation), not to process tracing, and motivate the anti-ACH wall here without setting or laundering a grade in either direction.
Excluded on the evidence rule: no decision-accuracy or reasoning-improvement effect size for process tracing is asserted as fact in this dossier, because no such controlled study exists. The external wave-3 S-on-pedigree grade is recorded as rejected: a deep methodological pedigree is not outcome evidence.