Skip to content

Argumentation Schemes with Critical Questions

Most everyday arguments are not deductive proofs. They are defeasible, presumptive moves: an expert says X, so presumably X; this case is like that case, so presumably the same verdict; doing A leads to bad consequence B, so presumably do not do A. Douglas Walton’s insight is that these arguments come in a finite set of stereotyped patterns, that each pattern is legitimate (not a fallacy) when its conditions hold, and that each pattern has its own characteristic ways of failing. The durable move is classify-then-probe-with-keyed-defeaters: first identify which stereotyped scheme an argument instantiates, then interrogate it with the standard critical questions keyed to that scheme. Two things make this more than generic objection-raising. The defeaters are retrieved, not improvised - each scheme’s question set encodes the accumulated knowledge of how that specific pattern fails, so coverage of the standard vulnerabilities does not depend on what occurs to the evaluator in the moment. And the semantics are presumptive - the output is not “valid or invalid” but a burden-of-proof ledger: which questions were answered, which remain open, and whether the presumption survives. The output is a scheme critique sheet, not prose.

  • A single, usually short, defeasible argument has to be judged for whether its presumption deserves acceptance: a recommendation resting on an authority’s say-so, an analogy doing load-bearing work, a slippery-slope objection, a “users are asking for it” popular-practice appeal, or a consequence-based case for or against an action.
  • An argument map would be overkill. The scheme method is strongest where a map is weakest: a one-premise pattern argument (“the analyst report says the market is contracting, so we should not enter”) maps to a trivial two-node tree, while the scheme method immediately yields the standard expert-opinion probes.
  • Critique needs discipline in both directions - to block naive acceptance (“an expert said it”) and naive dismissal (“appeal to authority, ignored”). These patterns are not automatic fallacies.
  • Do not use it on structurally complex, multi-premise arguments. When the question is how the whole argument hangs together and where its weakest links are, that is think-argument-mapping’s job. The scheme method evaluates one typed inference at a time and has no view of overall structure. This is the central routing wall.
  • Do not force a deductive or statistical argument into a scheme. The schemes formalize presumptive reasoning; a mathematical proof or a regression result is not an instance of any of them, and forcing one in degrades the analysis. Route that material out.
  • Do not skip stating the classification, and do not trust a confident mis-type. Every downstream critical question is keyed to the scheme, so a wrong match (reading an argument from sign as an argument from cause) produces a confident interrogation of the wrong vulnerabilities. State the scheme explicitly, name the runner-up scheme, and flag a low-confidence match. Even the five most common schemes machine-separate at only 63-91% accuracy (Feng and Hirst, 2011), so mis-typing is live, not rare.
  • Do not run it as checklist theater. Walking the critical questions and recording shallow answers produces the appearance of scrutiny. The presumption verdict is only as good as the honesty of the answers; an answered checklist is not a soundness proof.
  • Do not treat naming the scheme as the verdict. “Appeal to authority” is a classification, not a refutation. Naming the scheme is the beginning of evaluation, not its end.

When asked to evaluate a single defeasible argument - an authority appeal, an analogy, a slippery slope, a consequence case, or similar - follow these steps:

  1. Check the gate. Confirm the argument is a single, short, presumptive inference. If it is structurally complex and multi-premise, route to think-argument-mapping and stop. If it is a deductive proof or a statistical result, say so and stop - it is not an instance of any scheme.
  2. Restate the argument. Extract its conclusion and its stated premises in plain form. This is the object the rest of the procedure works on.
  3. Classify the scheme - explicitly and contestably. Match the argument against the scheme catalog (appeal to expert opinion, argument from analogy, argument from sign, argument from cause to effect, argument from consequences, argument from popular opinion or practice, slippery slope, practical reasoning, and so on). Name the scheme, name the most plausible runner-up, and flag the match confidence. Do not skip this - every later step depends on it.
  4. Instantiate the premise slots. Fill the chosen scheme’s premise template against the argument. This mechanically exposes the implicit premises the pattern requires (for expert opinion: E is an expert, in the relevant field, E actually asserted X, E is credible and unbiased, X is consistent with other experts and the evidence). Name the implicit premises the argument left unstated.
  5. Put the scheme’s keyed critical questions to the argument. Retrieve the standard critical-question battery for the chosen scheme and answer each: answered (the argument or context discharges it), open (unaddressed), or defeated (a question shifts a burden the argument cannot meet). Record who carries the burden for each open question.
  6. Render the presumption verdict. State whether the presumption STANDS (critical questions answered or dischargeable), STANDS-PENDING (survives but with named open questions to discharge), or FALLS (a question shifts a burden that goes unmet). Name the single binding open question that most controls the verdict.
  7. Emit the scheme critique sheet per references/TEMPLATE.md: the restated argument, the named scheme with its runner-up and confidence, the instantiated premise slots with the implicit premises, the keyed critical questions with answer status and burden, and the presumption verdict with the binding open question. Carry the pre-printed evidence caveat into the sheet. Never present an answered checklist as a soundness proof, and never present the scheme name as a refutation.

Use the template in references/TEMPLATE.md. The deliverable is the filled scheme critique sheet - the restated argument, the contestable scheme classification, the instantiated premise slots, the keyed critical questions with answer status and burden, and the presumption verdict with its binding open question - not a prose essay. The verdict is a presumptive, burden-of-proof read, never a verdict of valid or invalid.

Before finalizing, verify:

  • The gate held: the argument is a single short presumptive inference, not a multi-premise structure (which routes to think-argument-mapping) and not a deductive or statistical proof.
  • The argument is restated as a conclusion plus stated premises.
  • The scheme is named explicitly, the runner-up scheme is named, and the match confidence is flagged - the classification is contestable, not hidden.
  • The scheme’s premise slots are instantiated, and the implicit premises the pattern requires are surfaced.
  • Each keyed critical question for that scheme is answered with a status (answered / open / defeated) and a burden note - not skipped, not given a shallow rubber-stamp answer.
  • A presumption verdict (stands / stands-pending / falls) is rendered, with the single binding open question named.
  • The scheme name is not presented as a refutation, and the answered checklist is not presented as a soundness proof.
  • No overclaiming: the verdict is a presumptive burden-of-proof read at tier P on transferred human-subjects evidence; it is not a soundness proof and not a measured gain in evaluation accuracy (see evidence/dossier.md).

Tier P (governing). The method rests on a 30-year theoretical literature (Walton 1996; Walton, Reed and Macagno 2008), a formal AI-and-law adoption line that models critical questions as typed premises carrying burden of proof (Gordon, Prakken and Walton 2007), software embodiments (Reed and Rowe 2004), an annotated corpus with an honest measure of how confusable the scheme types are (Feng and Hirst 2011: 63-91% one-against-others), and an active LLM benchmark showing the keyed-question apparatus is not already free in a plain-prompted model (Calvo Figueras and Agerri 2025: top system 67.6). There are two positive controlled classroom studies, but they measure weeks of scheme-and-critical-question INSTRUCTION on student writing and discussion (Song and Ferretti 2013, the cleanest; Nussbaum and Edwards 2011) - an adjacent claim to “applying the schemes once improves an evaluation,” and human-subjects only. No study measures single-application evaluation accuracy, and nothing is validated on AI agents; the transfer is an explicit, untested assumption. Per this library’s conservative rule the governing grade is P, not M. The skill ships as an argument-evaluation aid with hard walls, never as a soundness proof. Full grading, sources, and caveats: evidence/dossier.md.

See references/EXAMPLE.md for a completed scheme critique sheet on a real decision.

A full worked run (the shared Northwind scenario)

A completed run of the walton-argumentation-schemes skill on one short, load-bearing argument inside a real decision. This is the quality bar a generated scheme critique sheet should meet.

Uses the shared recurring scenario (Northwind, a B2B SaaS weighing a self-serve free-tier launch) so examples across skills read as one coherent product. Where think-scenario-planning builds the alternative external worlds Northwind’s free-tier bet must survive, this skill zooms all the way in: it takes ONE presumptive argument circulating in the room - an analyst’s say-so that has started to drive the call - and asks whether its presumption survives the standard defeaters for its pattern. See docs/internal/AUTHORING.md.

Evidence caveat (ships with every sheet, by construction). This verdict is a presumptive, burden-of-proof read, not a soundness proof. The method is graded tier P (practitioner) on this library’s conservative rule: the supporting evidence is a 30-year theoretical literature, a formal adoption line, and two controlled CLASSROOM-INSTRUCTION studies (Song and Ferretti 2013; Nussbaum and Edwards 2011) - an adjacent claim measured on human students over weeks of teaching, not on a single application of the method, and not on AI agents. The evidence is transferred from human contexts and untested for agent use. An answered checklist is not proof; naming the scheme is not a refutation. Treat the verdict as a disciplined defeater-coverage read, never a guarantee.


  • Single, short, presumptive argument? Yes. The argument in the room is a one-step inference from an authority’s say-so. It is not a multi-premise structure (no whole-argument tree to lay out, so not think-argument-mapping) and not a deductive or statistical proof. It is a textbook presumptive move, so the scheme method fits.

The context: Northwind’s leadership is debating whether to commit to a self-serve free tier as the primary growth motion. A circulated note from a respected industry analyst has started to settle the question. The note’s argument, isolated:

  • Conclusion: Northwind should not launch a self-serve free tier.
  • Stated premises: A senior analyst at a well-known industry research firm published a note concluding that free tiers no longer convert at viable rates in B2B SaaS, and that “the free-tier era is over.” Therefore Northwind should not launch one.

2. Scheme classification (stated and contestable)

Section titled “2. Scheme classification (stated and contestable)”
  • Scheme: Appeal to expert opinion (argument from authority). The argument’s whole force is “an expert in the field asserted it, so presumably accept it.”
  • Runner-up scheme: Argument from popular practice (“the market is moving away from free tiers, so we should too”). The note gestures at a trend, so a reading as “everyone is abandoning free tiers” is available - but the load-bearing premise as stated is the analyst’s authority, not the count of firms, so appeal to expert opinion is the better fit.
  • Match confidence: High. The note is explicitly “an analyst says,” and the conclusion rides on that say-so. (The runner-up is named precisely because if the team were actually leaning on the trend data rather than the analyst, the keyed questions would shift to the popular-practice battery - so the classification is left contestable.)

The appeal-to-expert-opinion premise template, filled against the argument.

Premise slot (required by the scheme)Instantiated against this argumentStated or implicit?
Source E is an expertThe author is a senior analyst at a recognized research firmStated
E is an expert in the relevant field FThe relevant field is Northwind’s specific segment and buyer, not B2B SaaS in generalImplicit - the note speaks to B2B SaaS broadly, not to Northwind’s segment
E actually asserted proposition AThe note asserts “free tiers no longer convert at viable rates” and “the era is over”Stated (but A is a general claim, not “Northwind specifically should not”)
A falls within E’s domain of competenceFree-tier conversion economics is within an industry analyst’s competenceStated, plausibly
E is trustworthy and unbiasedUnaddressed - the firm’s funding model and the analyst’s incentives are not examinedImplicit / missing
A is consistent with what other experts sayUnaddressed - no second source is cited or comparedImplicit / missing
A is consistent with the available evidenceUnaddressed - no conversion data from comparable companies is givenImplicit / missing
  • Implicit premises the pattern requires: that the analyst’s general-market claim transfers to Northwind’s specific segment; that the analyst is unbiased; that other experts and the underlying data agree. The argument as stated supplies none of these.

The standard critical-question battery for appeal to expert opinion (the six Walton CQs), each answered.

#Critical question (keyed to the scheme)StatusWho carries the burden / note
CQ1 - ExpertiseHow credible is E as an expert?AnsweredGenuinely a recognized analyst; expertise is not the weak point.
CQ2 - FieldIs E an expert in the field that A is in?OpenE speaks to B2B SaaS broadly; the live question is Northwind’s specific segment and buyer, where free-tier economics can differ sharply. The proponent of the conclusion must show the general claim transfers.
CQ3 - OpinionWhat did E actually assert, and does it imply A?OpenE asserted a general market claim (“the era is over”), not “Northwind should not launch.” The leap from the general to Northwind’s specific case is unstated and is doing the real work.
CQ4 - TrustworthinessIs E personally reliable and unbiased?OpenThe firm’s funding model and any vendor relationships were not examined. Not a defeat, but an unmet burden.
CQ5 - ConsistencyIs A consistent with what other experts say?OpenNo second expert or contrary view was sought; several practitioner sources argue the opposite (free tiers still work for product-led bottom-up motions).
CQ6 - Backing evidenceIs A based on evidence?OpenThe note gives a verdict, not the conversion data behind it. The presumption cannot be discharged without seeing it.
  • Verdict: STANDS-PENDING, leaning toward FALLS. The appeal creates a real presumption (CQ1 expertise is satisfied), but five of the six critical questions are open, and two of them - CQ2 (field fit to Northwind’s segment) and CQ3 (the unstated jump from a general claim to Northwind’s specific case) - shift a burden the note does not meet. The argument is not refuted, but it is nowhere near strong enough to settle the decision on its own.
  • Binding open question: CQ2/CQ3 together - does the analyst’s general-market claim actually transfer to Northwind’s specific segment and buyer? Until that is shown, the say-so is about a different population than the one being decided on.
  • What would change the verdict: segment-specific conversion data for companies like Northwind (discharges CQ2, CQ3, CQ6), plus a second independent expert view (discharges CQ5) and a check on the firm’s incentives (discharges CQ4). With those, the presumption either firms up or clearly falls - on evidence rather than on authority.

Note how this differs from its neighbors on the same Northwind decision. think-scenario-planning builds four external worlds the free-tier bet must survive; think-argument-mapping would lay out the full structure of Northwind’s case for the launch as a tree. This skill does neither: it isolates one presumptive argument (the analyst’s say-so), classifies it as a known type, and runs that type’s standard defeater battery to see whether its presumption survives. The deliverable is a burden-of-proof read on one inference - and the discipline that blocks both “a respected analyst said so, case closed” and “analysts are always wrong, ignored.” The scheme name (appeal to expert opinion) is the start of the evaluation, not the refutation. Re-read the evidence caveat above before acting on the verdict.

What the research does and does not show, with graded sources

Evidence Dossier: Argumentation Schemes with Critical Questions

Section titled “Evidence Dossier: Argumentation Schemes with Critical Questions”

The single source of truth for the walton-argumentation-schemes skill. The SKILL.md, the sidecar (skill.meta.yml), the references, and the eval cases all derive from this file. If a claim is not here, it does not belong in the skill. Reformatted from the vetted proposed dossier (_local/proposed-builds/walton-argumentation-schemes/dossier.md) and admitted as a Build at tier P (confirming the wave-3 preliminary cand/build).

Skillthinking-framework-skills.walton-argumentation-schemes (installable name think-walton-argumentation-schemes)
Familyreasoning-clarity (registry catalog family: synthesis-and-reasoning-clarity)
Evidence tierP governing (honest read; an M-flavored controlled signal exists but only on an adjacent claim - see “What the evidence shows”)
ConfidenceModerate that scheme-keyed critical questions surface more relevant defeaters than improvised objection-raising; low that single-application evaluation accuracy or any decision-quality effect transfers to agents
Statusdraft (admitted from the v0.7.0 phase-2 sweep; the sole survivor of the argumentation trio)

1. The mechanism (what actually does the work)

Section titled “1. The mechanism (what actually does the work)”

Most everyday arguments are not deductive proofs. They are defeasible, presumptive moves: an expert says X, so presumably X; this case is like that case, so presumably the same verdict; doing A leads to bad consequence B, so presumably do not do A. Douglas Walton’s insight is that these arguments come in a finite set of stereotyped patterns, that each pattern is legitimate (not a fallacy) when its conditions hold, and that each pattern has its own characteristic ways of failing. The method packages that insight into a two-step evaluation procedure: first identify WHICH stereotyped scheme an argument instantiates, then interrogate it with the standard critical questions keyed to that scheme.

The durable cognitive move is classify-then-probe-with-keyed-defeaters. Concretely:

  1. Extract the argument’s conclusion and stated premises.
  2. Match it against the scheme catalog - appeal to expert opinion, argument from analogy, argument from sign, argument from cause to effect, argument from consequences, argument from popular opinion or practice, slippery slope, practical reasoning, and so on (Walton 1996 defines 25 schemes; the Walton, Reed and Macagno 2008 compendium organizes roughly 60 main schemes plus about 44 sub-schemes, the count usually quoted as “96”).
  3. Instantiate the scheme’s premise slots, which mechanically exposes the implicit premises the pattern requires (for expert opinion: E is an expert, in the relevant field, E actually asserted X, E is credible and unbiased, X is consistent with what other experts say and with the evidence).
  4. Put the scheme’s critical questions to the argument.
  5. Render a presumption verdict: the argument creates a presumption that STANDS if its critical questions are answered or discharged, and FALLS where a question shifts the burden of proof and the burden goes unmet.

Two properties distinguish this from generic objection-raising. The defeaters are RETRIEVED, not improvised: each scheme’s question set encodes the accumulated knowledge of how that specific pattern fails, so coverage of the standard vulnerabilities does not depend on what occurs to the evaluator in the moment. And the semantics are presumptive: the output is not “valid/invalid” but a burden-of-proof ledger - which questions were answered, which remain open, and whether the presumption survives. The Carneades formal model (Gordon, Prakken and Walton 2007) makes this precise by modeling critical questions as typed premises (assumptions versus exceptions) that allocate the burden of proof differently per question.

The scheme idea descends from Aristotle’s topoi and was revived in the 20th century by Perelman and Olbrechts-Tyteca and by Arthur Hastings’ 1962 dissertation, but the method in its usable form is Douglas Walton’s (University of Winnipeg, then the University of Windsor’s CRRAR), beginning with Argumentation Schemes for Presumptive Reasoning (1996), which paired 25 schemes with their critical questions and grounded evaluation in burden of proof. The mature reference is Walton, Reed and Macagno, Argumentation Schemes (Cambridge, 2008), with its compendium of roughly 60 main schemes and 44 sub-schemes; Walton and Macagno’s later work addresses classification. J. Anthony Blair (2001) wrote the standard critique. The computational line runs through Chris Reed’s Dundee group (Araucaria, 2004; the Argument Interchange Format; argument mining) and the AI-and-law formalizations (Gordon, Prakken and Walton’s Carneades, 2007; Prakken’s ASPIC+). The pedagogy line is E. Michael Nussbaum (critical questions in classroom argumentation) and Yi Song with Ralph Ferretti (writing instruction). The LLM-era line is Blanca Calvo Figueras and Rodrigo Agerri’s critical-questions-generation benchmark and the CQs-Gen shared task at ArgMining 2025.

The terms “argumentation scheme,” “presumptive reasoning,” and “critical question” are generic and descriptive; the durable move is named for what it does (classify-then-probe-with-keyed-defeaters), and the skill ships documented descriptively with the lineage credited here rather than branded. The attribution credits Douglas Walton (1996) and Douglas Walton, Chris Reed and Fabrizio Macagno (2008).

3. What the evidence shows, and what it does NOT show

Section titled “3. What the evidence shows, and what it does NOT show”

The honest grade is P (practitioner), confirming the preliminary registry grade. The split, stated in full: the foundational literature is conceptual argumentation theory and the AI-and-law adoption is formal modeling, neither of which is outcome evidence; there ARE two controlled-ish classroom studies with positive results, but they measure weeks of scheme-and-critical-question INSTRUCTION on student writing and discussion quality - an adjacent claim to “applying the schemes once improves an evaluation,” and human-subjects only. By this library’s conservative rule the governing grade is P, not M.

What the record supports. This is a 30-year theoretical literature with a coherent, well-worked rationale; a formal AI-and-law adoption line; software embodiments; an annotated corpus; an active LLM benchmark; and two positive controlled classroom studies. The cleanest of those is Song and Ferretti (2013): college students in three conditions (critical questions for two schemes; the two schemes without their critical questions; no instruction); the group taught the critical questions wrote higher-quality essays with more counterarguments, alternative standpoints, and rebuttals than either contrasting condition. Notably, it was the critical questions, not the schemes alone, that carried the effect. Nussbaum and Edwards (2011), a multi-month design experiment in middle-school social-studies classes, found the critical-questions group produced more arguments integrating both sides and constructed more salient critical questions.

What the record does NOT support. No study measures single-application argument-evaluation accuracy - the actual move this skill performs. Both supportive studies measure weeks of writing and discussion instruction, which is an adjacent claim, not the move. Mis-typing is a live risk, not a corner case: Feng and Hirst (2011) found even the five most common schemes are only machine-separable at 63-91% one-against-others accuracy (against a 50% baseline), so a wrong scheme match - which corrupts every downstream critical question - is a real failure mode. And the keyed-question apparatus is NOT already free in a plain-prompted model: Calvo Figueras and Agerri (2025) built a benchmark of about 5,000 manually annotated critical questions and the companion CQs-Gen shared task, and the task is genuinely hard for current models (the top shared-task system reached only 67.6 accuracy).

Why not M: the only controlled results are two small studies of sustained classroom instruction in writing contexts; moderate-tier would require controlled evidence that applying the method once improves an evaluation or a decision, which does not exist. Why not C: the theoretical literature, the formal adoption line, the software, the corpus, the benchmark, and the two positive controlled classroom studies put it well past conceptually-plausible-but-untested. P is the honest governing grade.

4. Transferred-evidence flag (required honesty for this library)

Section titled “4. Transferred-evidence flag (required honesty for this library)”

Every study above is on human subjects - students in classroom and writing settings - and every positive result is on sustained INSTRUCTION, not on a single application of the method. None studies a scheme critique produced by or with an AI agent, nor whether an agent-produced critique improves a human’s judgment. The evidence is transferred from human contexts and not validated for AI-augmented use, which independently caps the grade at P. The Calvo Figueras and Agerri (2025) benchmark, while not human-reasoning outcome evidence, is directly relevant the other way: it indicates the keyed-question apparatus is not already free in a plain-prompted model, which is the operative question for an agent-skills library and the gap a scheme-keyed skill closes. The AI value is mechanical and modest: an agent makes the method cheap to run, forces the discipline (a stated and contestable scheme classification, the instantiated premise slots, the full keyed-question battery, the presumption verdict), and produces a durable, inspectable artifact - benefits that do not depend on any contested outcome claim. The skill ships honestly as a P-tier argument-evaluation aid with hard walls, never as a soundness proof.

5. When it works / when it fails (drives the eval negative cases and “When NOT to Use”)

Section titled “5. When it works / when it fails (drives the eval negative cases and “When NOT to Use”)”

Works best when:

  • A single, usually short, defeasible argument has to be evaluated for whether its presumption deserves acceptance: a recommendation resting on an authority’s say-so, an analogy doing load-bearing work in a proposal, a slippery-slope objection in a policy debate, a “users are asking for it” appeal, a consequence-based case for or against an action.
  • An argument map would be overkill or unhelpful. The scheme method is strongest exactly where a map is weakest: a one-premise pattern argument (“the analyst report says the market is contracting, so we should not enter”) maps to a trivial two-node tree, while the scheme method immediately yields the six standard expert-opinion probes.
  • Critique needs discipline in both directions. Walton’s central point is that these patterns are NOT automatic fallacies, so the method blocks both naive acceptance (“an expert said it”) and naive dismissal (“appeal to authority, ignored”).

Fails or misleads when (poor-fit / anti-patterns):

  • The case is structurally complex and multi-premise. When the question is how a whole argument hangs together and where its weakest links are, that is argument-mapping’s job; the scheme method evaluates one typed inference at a time and has no view of overall structure. This is the central routing wall.
  • The argument is deductive or statistical. The schemes formalize presumptive reasoning; a mathematical proof or a regression result is not an instance of any of them, and forcing one into a scheme degrades the analysis. Route such material out.
  • The scheme is mis-typed. Every downstream critical question is keyed to the classification, so a wrong match (reading an argument from sign as an argument from cause) produces a confident interrogation of the wrong vulnerabilities. The classification must be stated explicitly and be contestable, with the runner-up scheme named and low-confidence matches flagged (Feng and Hirst 2011 confirm mis-typing is live, not rare).
  • It becomes checklist theater. Walking the critical questions and recording shallow answers produces the appearance of scrutiny; the presumption verdict is only as good as the honesty of the answers. An answered checklist is not a soundness proof.
  • Naming the scheme is treated as the verdict. Naming the scheme is the beginning of evaluation, not a refutation. “Appeal to authority” is a classification, not a defeat.

The skill must emit a scheme critique sheet, not prose. It contains: the argument restated as conclusion plus stated premises; the identified scheme with the classification made contestable (named, with the runner-up scheme noted and a confidence flag); the instantiated premise slots, including the implicit premises the pattern requires; each keyed critical question with its answer status (answered / open / defeated) and the burden note (who must discharge it); and a presumption verdict (stands / falls / stands-pending) with the single binding open question named. A standing evidence caveat ships in the artifact by construction: the verdict is a presumptive, burden-of-proof read at tier P on transferred human-subjects evidence, never a soundness proof. The walls are enforced inside the sheet: presumptive arguments only, the scheme classification stated and contestable, no answered checklist presented as proof, no scheme name presented as a refutation.

The single durable move it adds: classify a defeasible argument as an instance of a stereotyped scheme, then test its presumptive standing with that scheme’s keyed critical questions, unanswered questions defeating the presumption.

  • Closest shipped skill, HIGH overlap face: think-argument-mapping (S, shipped). Honest accounting first: the skeleton is shared. Both take one argument, extract its conclusion and premises, surface implicit premises, and flag weaknesses - roughly a quarter to a third of the working whole, which presses this library’s overlap ceiling and is stated rather than hidden. The wall is in the evaluative engine, which is disjoint: argument-mapping lays out THIS argument’s particular structure as a tree and generates objections ad hoc from its content; the scheme method classifies the argument as an instance of a known TYPE and retrieves that type’s standard defeater battery, then renders a presumption verdict under burden-of-proof semantics that no map carries (a map flags weak links; it has no concept of a question shifting a burden that then goes unmet). The two also fail differently: mapping fails by garbage structure, scheme critique fails by mis-typing. Routing wall, usable by the advisor: a structurally complex multi-premise case, or “how does this whole argument hang together” - think-argument-mapping; a short typed presumptive argument, or “does this pattern’s presumption survive its standard defeaters” - this skill.
  • Why a mode or sequence cannot already produce it. Argument-mapping’s objection step contains no typology, no keyed retrieval, and no defeat semantics; think-red-team-light builds the strongest opposing CASE for a proposal (generative advocacy, not typed interrogation of one inference); think-ladder-of-inference-check audits one reasoning chain from data to conclusion with no pattern catalog; think-evidence-vs-inference-sort classifies statements, not argument types. No chain of these retrieves the expert-opinion battery when it sees an expert-opinion argument. The empirical point that this is not free in the model: Calvo Figueras and Agerri (2025) found models perform poorly at generating useful critical questions. The formal record agrees the apparatus is additional machinery, not a notational variant: Araucaria had to import Walton’s catalog wholesale, and Carneades had to invent typed premises and proof standards to model the critical questions.
  • The sole survivor of the argumentation trio (v0.7.0 phase-2 reconciliation). Of {walton-argumentation-schemes, toulmin-argument-model, issue-position-argument-mapping} versus shipped argument-mapping, at most one builds and exactly one does. toulmin-argument-model folds into argument-mapping (claim/data/warrant/rebuttal map about 1:1; argument-mapping’s own dossier names Toulmin as its ancestor). issue-position-argument-mapping (IBIS) becomes a recipe (its three node types each map to a shipped move). This candidate builds because it is the only one with a move argument-mapping lacks: scheme typing plus a retrieved per-scheme defeater battery plus a burden-of-proof presumption verdict.
  • The runner-up reading, recorded honestly. Fold-with-enrichment into argument-mapping (add “where an inference instantiates a known scheme, apply that scheme’s critical questions as the objection generator” to its objection step) captures a real fraction of the value. What it loses: the scheme catalog itself, the complete per-scheme question batteries, the presumption/burden verdict, and the named artifact - machinery that would roughly double the target skill and change its evaluation procedure, which is the signature of a second method rather than a mode. The dossier’s judgment is Build; the fold is the defensible second-place verdict if catalog parsimony is weighted over the residue.
  1. Douglas N. Walton (1996), Argumentation Schemes for Presumptive Reasoning, Lawrence Erlbaum. The canonical statement: defines 25 presumptive schemes and matches a set of critical questions to each, with the burden-shifting account of how a presumptive argument is evaluated. Foundational and conceptual; defines the method, measures nothing. (Foundational.)
  2. Douglas Walton, Chris Reed and Fabrizio Macagno (2008), Argumentation Schemes, Cambridge University Press. The mature compendium: systematic analysis of the major schemes plus a user’s compendium (roughly 60 main schemes and 44 sub-schemes, often summarized as “96”), with the classification problem treated head-on. (Foundational; conceptual.)
  3. J. Anthony Blair (2001), “Walton’s Argumentation Schemes for Presumptive Reasoning: A Critique and Development,” Argumentation 15: 365-379. The standard internal critique: presses on scheme individuation and the unsettled logical status of the critical questions. Evidence of serious methodological scrutiny, not of outcomes. (Critique; P.)
  4. Thomas F. Gordon, Henry Prakken and Douglas Walton (2007), “The Carneades model of argument and burden of proof,” Artificial Intelligence 171(10-15): 875-896. Formalizes critical questions as typed premises (assumptions, exceptions) allocating burden of proof per question, with proof standards. Demonstrates the critical-question apparatus required NEW machinery beyond argument structure. Not outcome evidence. (Formalization / field uptake; P.)
  5. Chris Reed and Glenn Rowe (2004), “Araucaria: Software for Argument Analysis, Diagramming and Representation,” International Journal on Artificial Intelligence Tools 14(3-4): 961-980. Argument-diagramming software designed from the outset to handle schemes; the key overlap fact is that the analysis tradition treats schemes as a layer that mapping tools had to import from Walton. (Adoption; P.)
  6. Vanessa Wei Feng and Graeme Hirst (2011), “Classifying arguments by scheme,” Proceedings of ACL-HLT 2011. Machine classification of arguments into the five most common schemes on the Araucaria corpus (about 660 annotated arguments): 63-91% one-against-others, 80-94% pairwise, against a 50% baseline. Computational feasibility plus an honest measure of how confusable the types are; the basis for the mis-typing wall. (Computational feasibility, not reasoning outcomes; P.)
  7. E. Michael Nussbaum and Ordene V. Edwards (2011), “Critical Questions and Argument Stratagems: A Framework for Enhancing and Analyzing Students’ Reasoning Practices,” Journal of the Learning Sciences 20(3): 443-488. Multi-month design experiment in middle-school social-studies classes: the critical-questions group produced more arguments integrating both sides and constructed more salient critical questions. Positive but small, quasi-experimental, classroom, sustained instruction; an adjacent claim to the single-application move. (Controlled instruction study.)
  8. Yi Song and Ralph P. Ferretti (2013), “Teaching critical questions about argumentation through the revising process: effects of strategy instruction on college students’ argumentative essays,” Reading and Writing 26: 67-90. College students, three conditions; the group taught the critical questions wrote higher-quality essays with more counterarguments, alternative standpoints, and rebuttals than either contrasting condition. The cleanest controlled result in the record, and notable that the critical questions, not the schemes alone, carried the effect; still writing instruction, small N, an adjacent claim. (Controlled instruction study.)
  9. Blanca Calvo Figueras and Rodrigo Agerri (2025), “Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models,” Findings of EMNLP 2025 (arXiv:2505.11341), with the companion CQs-Gen shared task at the 12th Workshop on Argument Mining (ACL 2025). About 5,000 manually annotated critical questions grounded in argumentation-scheme theory; the benchmark and shared task show the task is genuinely hard for current models (the top shared-task system reached only 67.6 accuracy). Not human-reasoning outcome evidence, but directly relevant: the keyed-question apparatus is not already free in a plain-prompted model. (LLM benchmark.)

Excluded on the evidence rule: no single-application argument-evaluation accuracy figure is asserted as fact, because no study measures it; the two positive controlled effects (Song and Ferretti; Nussbaum and Edwards) are reported with their sustained-instruction, writing-context, human-subjects limitations, and the field-feasibility and benchmark evidence is not laundered into an outcome claim. The governing grade is the conservative P.

Was this page helpful?
Thinking Framework Skills v0.8.0 · 56 frameworks