Pairwise Comparison

Ranking a set of options holistically forces an unstable judgment: “rank these six from best to worst” or “score each from 1 to 10” demands a fixed internal scale and a memory of how earlier items were scored, and that scale wobbles. Pairwise comparison replaces the one hard holistic ranking with a series of isolated two-item judgments. For every pair it asks the single easier question - which of these two is better? - then tallies each item’s wins into a matrix and reads the ranking off the tally. The durable move is psychophysical: a person (or an agent) holds a far more stable signal for “A beats B” than for “A is a 7,” because the binary judgment needs no absolute scale and no criteria axis. The matrix also exposes its own quality - a cycle (A beats B, B beats C, C beats A) is a visible inconsistency to revisit, not a hidden error.

The output is a binary-vote comparison matrix: every pair judged A-beats-B, the derived ranking (each option’s win count, ties broken by head-to-head), and a consistency check that surfaces any cycles. There is deliberately no criteria column and no absolute score. This is the narrow, honest reading of the technique - rank when you cannot score - and it is scoped precisely to the case the weighted-matrix skills disclaim.

When to Use

Items must be ordered, but no one can defend a 1-to-10 scale and the criteria that would justify a score genuinely cannot be articulated (qualitative artifacts - writing samples, design submissions, shortlisted proposals - where holistic marking is noisy).
The decision-maker can reliably say “this one beats that one” for any pair, even though they cannot say “this one is a 7.”
The set is small enough to compare every pair by hand (roughly up to 6-8 items; the full set is n(n-1)/2 judgments - 15 for six items, 28 for eight).
A surfaced cycle (an intransitivity) would be useful information - a prompt to re-examine two judgments - rather than noise to suppress.

When NOT to Use

Do not use it when criteria are nameable and a scale is defensible. That is think-decision-option-review - the criteria-weighted option matrix. Its procedure already defines and weights the criteria, scores each option on a stated scale, and surfaces tradeoffs. If you can say what a high score means, that skill is faster and more inspectable, and pairwise voting on the criteria is just an elaborate way to fill one weight column it already owns. Pairwise comparison occupies the case think-decision-option-review explicitly disclaims (“when the criteria genuinely cannot be articulated”).
Do not use it to weight criteria for a scoring model. Comparing two criteria at a time to set their relative importance (the AHP / PAPRIKA elicitation) produces no new artifact - it fills the weight vector of think-decision-option-review. Route criteria-weighting there as an optional elicitation, not here.
Do not use it to fix a repeatable formula over named cues. When you have explicit cues and want a weighted rule applied to many future cases (the same prediction made again and again), that is think-linear-model-aggregation. It needs the very cues and scale pairwise comparison refuses; pairwise produces a one-off ranking, not a reusable model.
Do not use it to anchor a number on a base rate. Estimating a quantity by comparing the case to a reference class of similar past cases is think-reference-class-forecasting. That anchors one number on outside-view data; pairwise comparison orders a fixed set by internal head-to-head votes.
Do not use it when the item count is large. Past roughly 8 items the n(n-1)/2 judgment count (45 for ten, 120 for sixteen) becomes punishing, and pruning the matrix while preserving the order needs adaptive or incomplete-design tooling a markdown-only agent cannot run by hand. Cut the set down first, or use a scored method.
Do not treat the derived ranking as objective output. A passing consistency check does not make a manufactured preference correct. Pairwise comparison launders subjective inputs into a clean-looking order; the cleanliness is presentational, and a near-duplicate option added to the set can shift the others’ ranking (a structural artifact, not a judgment error).

Instructions

When asked to rank a set of options where no criteria or scale can be defended, follow these steps:

State the ranking question and list the items. Name the single comparative question that will be asked of each pair (“which of these two better serves X?”) and list the items to be ordered. Confirm there is genuinely no defensible absolute scale and no articulable criteria axis - if there is, stop and route to think-decision-option-review.
Check the item count. Compute n(n-1)/2 for the list. If it is more than roughly 28 (eight items), say so and either cut the set down or hand off to a scored method - do not attempt a large matrix by hand.
Enumerate every unordered pair. For n items there are n(n-1)/2 pairs. List them so none is missed.
Judge each pair A-beats-B. For each pair, record which item wins the single comparative question. To dampen order effects, consider the pair in both orders before committing the winner. Ties are allowed only if genuinely undecidable; prefer forcing a winner.
Build the comparison matrix. Fill an n-by-n grid: cell (row, col) records whether the row item beat the column item. Each off-diagonal pair is one win and one loss (the matrix is the votes, not scores).
Derive the ranking. Count each item’s wins (its row total). Order by win count, highest first. Break ties by the head-to-head result between the tied items.
Run the consistency check. Scan for cycles - any A beats B, B beats C, C beats A triangle. Report each cycle found. A cycle is not auto-corrected: it is flagged as the pair(s) of judgments to revisit, because it signals the comparative question shifted or two judgments conflict.
Emit the artifact per references/TEMPLATE.md: the ranking question, the comparison matrix, the derived ranking with win counts, and the consistency check (cycles found or “transitive, none found”). State plainly that the order is a forced-choice ranking, not an absolute score.

Output Format

Use the template in references/TEMPLATE.md. The deliverable is the filled artifact - the comparison matrix of binary A-beats-B votes, the derived ranking with win counts, and the consistency check - not a prose essay and not a scored or criteria-weighted matrix. Never attach an absolute score or a criteria column to the order.

Quality Checklist

Before finalizing, verify:

There is genuinely no defensible absolute scale and no articulable criteria axis - otherwise this is the wrong skill (route to think-decision-option-review).
The single comparative question is stated once and applied identically to every pair (a shifting question is what manufactures cycles).
Every one of the n(n-1)/2 pairs is judged - none skipped.
The item count was checked; a set too large to compare by hand was cut down or handed off, not forced.
The ranking is derived from win counts, with ties broken by head-to-head - no absolute scores and no criteria column appear anywhere.
The consistency check is run and reported: cycles are surfaced as judgments to revisit, not silently dropped or auto-resolved.
The output is framed as a forced-choice ranking, not an objective measurement; no false precision is claimed.
No overclaiming: the evidence is practitioner-grade and transferred; claim an easier-and-more-stable ranking aid, not a measured gain in decision quality (see evidence/dossier.md).

Evidence

Tier P (governing). Pairwise comparison has a deep lineage and a robust psychometric core for scaling - Thurstone’s Law of Comparative Judgment (1927) places stimuli on an interval scale from paired comparisons, and educational comparative-judgement work (Pollitt, from 2004) reports high scale-separation reliability for ranking open-ended work. But that evidence is for the reliability of a scaling artifact, not for better decisions by an agent that runs the method, and it is contested: Bramley (2015) and Bramley and Vitello (2019) show the adaptivity in Adaptive Comparative Judgement inflates the reliability statistic (a reported 0.97 deflating to 0.84), and Verhavert and colleagues (2022) question the rationales. The often-quoted “88 percent versus 74 percent” weight-elicitation figure is real but belongs to Direct Rating versus Point Allocation (Bottomley, Doyle and Green 2000) and does not test pairwise comparison at all - it is excluded. In the AHP tradition the apparatus is criticised on its own terms (rank reversal under consistent judgments; an over-strict consistency ratio). The classic necessity argument (Tversky 1969, intransitivity of preference) has largely failed to replicate (Regenwetter and colleagues, 2011). All of this is from human subjects; the nearest agent-relevant evidence is the LLM-as-judge literature, which finds pairwise evaluation approximates human preference better than pointwise scoring but suffers a position bias that must be corrected by running both orders (“Judging the Judges,” 2024) - and that is an agent evaluating outputs, not a decider running a decision. The evidence is transferred from human contexts and not validated for AI-augmented decision-making, which caps the grade at P. The skill ships as an easier-and-more-stable ranking aid, scoped to the no-scale case, never as an objective scorer. Full grading, sources, and the excluded figures: evidence/dossier.md.

Examples

See references/EXAMPLE.md for a completed comparison matrix on a real decision.

Deep dive: worked example

A full worked run (the shared Northwind scenario)

Pairwise Comparison Matrix - Worked Example

A completed run of the pairwise-comparison skill on a real, consequential decision. This is the quality bar a generated comparison matrix should meet.

Uses the shared recurring scenario (Northwind, a B2B SaaS weighing a self-serve free-tier launch). Pairwise comparison does not fit the whole free-tier bet - that is a criteria-and-tradeoffs decision for think-decision-option-review (named criteria, a defensible scale). It fits a sub-decision inside the launch where no scale applies: choosing the name and core message for the free tier from five qualitative candidate concepts that marketing, product, and the founders each react to differently and that no one can defend a 1-to-10 score for. That is exactly the “rank qualitative artifacts where holistic marking is noisy and criteria cannot be articulated” case. See docs/internal/AUTHORING.md.

This is a forced-choice ranking, not an absolute score. There is no criteria column and no 1-to-10 scale below.

Ranking question and items

Ranking question: “For a developer or team lead landing on the free-tier signup page cold, which of these two concepts better makes them want to start using Northwind today?”
Why no scale: The team tried scoring each concept 1-to-10 on “clarity,” “appeal,” and “fit” and the scores drifted every session - no one could defend what a 7 meant, and the criteria themselves were contested (is “playful” a plus or a minus here?). But for any two concepts, people could reliably say which one they would rather land on. So: rank by head-to-head, no scale.
Items to rank (n = 5):
- A: “Northwind Free” - plain, literal, says exactly what it is
- B: “Start with Northwind” - action-framed, invitational
- C: “Northwind Spark” - playful sub-brand, hints at a smaller/lighter product
- D: “Northwind for Teams, on us” - leads with the team use case and the gift framing
- E: “Try Northwind free, no card” - leads with the friction-removal (no credit card)

Item-count check

Pairs to judge: n(n-1)/2 = 5(4)/2 = 10. Small enough to judge every pair by hand. No pruning or hand-off needed.

The comparison matrix (binary A-beats-B votes)

Read each cell as “does the ROW concept beat the COLUMN concept on the ranking question?” Each pair was considered in both orders before the winner was committed. The diagonal is blank.

	vs A	vs B	vs C	vs D	vs E	Wins
A “Northwind Free”	-	L	W	W	L	2
B “Start with Northwind”	W	-	W	W	L	3
C “Northwind Spark”	L	L	-	L	L	0
D “Northwind for Teams, on us”	L	L	W	-	W	2
E “Try Northwind free, no card”	W	W	W	L	-	3

Pair-by-pair record (the 10 judgments):

A vs B: B (the invitation beats the bare label)
A vs C: A (“Spark” reads as a toy; plain is safer)
A vs D: A (broad beats narrowing to teams for a cold visitor)
A vs E: E (removing the card objection beats just saying “Free”)
B vs C: B
B vs D: B (action framing beats the gift framing)
B vs E: E (no-card concreteness edged out the generic invitation)
C vs D: D
C vs E: E
D vs E: D (for a team landing, the team framing edged out no-card)

Derived ranking

Order by win count, highest first; ties broken by head-to-head.

Rank	Item	Wins	Tie-break note
1 (tie on wins)	E “Try Northwind free, no card”	3	E and B both have 3 wins. E beat B head-to-head, so E ranks first.
2	B “Start with Northwind”	3	Lost to E head-to-head.
3 (tie on wins)	A “Northwind Free”	2	A and D both have 2 wins. A lost to D head-to-head - see the consistency note below.
4	D “Northwind for Teams, on us”	2	Beat A head-to-head.
5	C “Northwind Spark”	0	Lost every pair - the clear cut.

Consistency check

Scan for cycles (A beats B, B beats C, C beats A triangles).

Result: Cycle found. Among the three middle concepts: A beats D (A vs D), D beats E (D vs E), but E beats A (A vs E). That is a 3-cycle: A > D > E > A. The win counts hide it because they aggregate across all pairs, but the head-to-head votes are intransitive here.
Action: Revisit these three judgments. The likely cause is that the ranking question quietly shifted - “D beats E” was judged with a team visitor in mind (“for a team landing, the team framing wins”), while “E beats A” and the rest were judged for a generic cold developer. The comparative question must be held fixed. Re-decide the audience for the page, then re-judge the A/D/E triangle under that single audience. Until then, treat the A-vs-D ordering (ranks 3-4) as unsettled; the top of the ranking (E, then B) and the bottom (C) are stable and unaffected by the cycle.

Honest framing

This order is a forced-choice ranking derived from head-to-head votes, not an objective measurement of concept quality. The surfaced A > D > E > A cycle is the artifact earning its keep: it caught a drifting comparison question that a single holistic score would have buried. A passing consistency check would not have made any concept “correct,” and adding a sixth near-duplicate concept (say another no-card variant) could shift these rankings - so the matrix informs the naming call, it does not settle it.

Note how this differs from its neighbors. think-decision-option-review would handle the larger free-tier bet - named criteria (reach, conversion, support cost), a defensible scale, weighted scores, and a recommendation with stated tradeoffs. This skill is for the sub-problem where that scale collapses: five qualitative concepts no one can score but anyone can compare two at a time. The deliverable is a consistency-checked order, not a weighted total - and crucially, the cycle it surfaced is information, not an error to hide.

Grounding: the full evidence dossier

What the research does and does not show, with graded sources

Evidence Dossier: Pairwise Comparison

The single source of truth for the pairwise-comparison skill. The SKILL.md, the sidecar (skill.meta.yml), and the eval cases all derive from this file. If a claim is not here, it does not belong in the skill. Promoted from frameworks/_proposed/pairwise-comparison/dossier.md and admitted as a Build at tier P.


Skill	`thinking-framework-skills.pairwise-comparison` (installable name `think-pairwise-comparison`)
Family	decision-and-option-evaluation
Evidence tier	P governing (a recognised practitioner technique with a real psychometric core for scaling, but no clean controlled evidence that an agent who runs it decides better, and its strongest evidence belongs to assessment reliability and is partly deflated by its own field - see “What the evidence shows”)
Confidence	Moderate that forced binary comparison yields a more stable signal than holistic scoring and surfaces intransitivity; low that any decision-quality effect transfers to an agent
Status	draft (admitted as a guarded Build at tier P; scoped to the rank-without-a-scale reading only)

1. The mechanism (what actually does the work)

Pairwise comparison replaces one hard holistic ranking with a series of isolated two-item judgments. Instead of “rank these six options from best to worst” or “score each option from 1 to 10,” it asks, for every pair, the single easier question “which of these two is better?” - then tallies each item’s wins into a comparison matrix and reads a ranking (or a set of relative weights) off the matrix.

The durable claim is psychophysical: humans hold a far more stable signal for “A beats B” than for “A is a 7,” because the binary judgment needs no fixed internal scale and no memory of how earlier items were scored. The matrix also exposes its own quality - a cycle (A beats B, B beats C, C beats A) is a visible inconsistency to revisit rather than a hidden error.

The honest description has to separate two operations that share the binary-judgment core but land on different artifacts and different verdicts:

Rank-the-options-with-no-usable-scale (the Thurstone / comparative-judgement sense): there are items to order - candidate essays, design submissions, shortlisted bids, options whose merit resists any agreed rubric - and no defensible way to score them absolutely. Judge pairs and derive the order. The product is a ranking (and a derived interval scale) built without criteria.
Weight-the-criteria-by-pairwise-voting (the AHP / PAPRIKA sense): a criteria-and-options decision already exists, and pairwise comparison is used only to set the relative importance of the criteria (and sometimes of options within a criterion), feeding a weighted scoring model. The product is a vector of criterion weights for a decision matrix.

That split is the central fact for this library: reading (2) is a sub-procedure of a method the catalog already ships, while reading (1) is the one separable move. This skill ships only reading (1). It emits a binary-vote comparison matrix, a derived ranking, and a consistency check, with no criteria axis and no absolute scoring.

2. Lineage

The psychometric foundation is Louis Leon Thurstone, “A Law of Comparative Judgment,” Psychological Review 34 (1927) - the result that paired comparisons yield an interval scale - extended by the Bradley-Terry (1952) and Luce choice models. The decision-analytic popularisation is Thomas L. Saaty’s Analytic Hierarchy Process (from the 1970s), which uses ratio-scaled pairwise comparison matrices and a consistency ratio to derive criterion weights; it is read alongside its critics on rank reversal and consistency (Triantaphyllou; Dyer’s 1990 critique). For the modern decision-software lineage, Hansen and Ombler (2008) describe the PAPRIKA method (commercialised as 1000minds), which elicits weights through pairwise trade-off questions. For the assessment revival and its reliability debate, Alastair Pollitt, “Let’s stop marking exams” (2004), and the counter-weight of Tom Bramley (Cambridge Assessment, 2015; Bramley and Vitello 2019) on reliability inflation, plus Verhavert et al. (2022) on the rationales. For the necessity argument and its failure to replicate, Amos Tversky, “Intransitivity of Preferences” (1969), against the modern transitivity literature (Regenwetter, Dana and Davis-Stober 2011).

“Pairwise comparison” and “paired comparison analysis” are generic descriptive terms in common use - no trademark or attribution required beyond crediting Thurstone and Saaty - so this entry is documented descriptively and is not flagged as branded. Specific commercial implementations (AHP toolchains, 1000minds / PAPRIKA) are their owners’ products and are cited as lineage, not shipped.

3. What the evidence shows, and what it does NOT show

The honest governing grade is P (practitioner), and this entry has to be unusually careful, because pairwise comparison is a case where genuinely strong-looking research exists but measures an adjacent claim - the reliability of psychometric scaling, not the quality of a decision - and is itself contested.

What the record supports. Pairwise comparison has a real, deep lineage and a robust psychometric core for scaling. Thurstone’s Law of Comparative Judgment (1927) established that a series of paired comparisons can place stimuli on an interval scale, and it is mathematically related to the Bradley-Terry-Luce model. In educational assessment, Pollitt’s comparative-judgement programme (from 2004) reports high scale-separation reliability (commonly above 0.80, reaching the mid-0.90s) for ranking open-ended work such as essays - higher than conventional marking in several reported studies. For weight elicitation specifically, the practitioner and decision-analysis literatures hold that pairwise judgments are cognitively easier than ranking or scoring a full list and discriminate well between many criteria. As a stance and a scaling tool, this is well attested.

What the record does NOT support, and the laundering traps. The strong evidence is for reliability of scaling artifacts, not for better decisions by an agent that runs the method, and the assessment evidence is contested:

Bramley (Cambridge Assessment, simulation work 2015; Bramley and Vitello 2019) showed that the adaptivity in Adaptive Comparative Judgement inflates the reported reliability statistic - in one GCSE-English study a reported 0.97 deflated to 0.84, and spurious separation appeared even on random data. Verhavert and colleagues’ 2022 “call for clarity” questions the rationales offered for comparative judgement outright. So even the headline reliability numbers cannot be taken at face value, and they are reliability of a ranking of essays, not decision quality.
The often-quoted weight-elicitation result - that test-retest weights reproduced the same chosen alternative 88 percent of the time versus 74 percent - is Bottomley, Doyle and Green (2000), and it compares Direct Rating against Point Allocation; it does not test pairwise comparison at all. It is frequently mis-cited as evidence for pairwise weighting; counted honestly it does not bear on this move and is excluded from the grade.
In the AHP tradition, the pairwise apparatus is empirically criticised on its own terms: the consistency ratio rejects judgments that are reasonable and non-random, and rank reversal can occur even under strictly consistent comparisons (Triantaphyllou and others). This is contra-evidence on the weighting reading, not support.
The classic case for the necessity of pairwise (Tversky 1969, systematic intransitivity of preference) has substantially failed to replicate; the contemporary consensus is that true preference cycles are vanishingly rare and earlier data are compatible with noisy-but-transitive responses. The strongest argument for “you must compare pairs because holistic preference is intransitive” is therefore weaker than its reputation.

Borrowing the Thurstone / Pollitt scaling reliability, or the Bottomley DR-vs-PA result, to lift this method to M would be laundering an adjacent claim’s robustness onto a move neither one tested. The conservative governing grade is therefore P: a recognised, well-lineaged practitioner technique with a real psychometric core for scaling, but no clean controlled evidence that an agent who runs pairwise comparison decides better, and with its strongest-looking evidence belonging to assessment reliability and partly deflated by its own field.

4. Transferred-evidence flag (required honesty for this library)

Every result above is from human subjects - psychophysics, exam marking, weight-elicitation surveys, and preference experiments. None studies a ranking produced by or with an AI agent. The nearest agent-relevant evidence is the LLM-as-judge literature, which finds pairwise evaluation approximates human preference better than pointwise scoring but suffers position bias that must be corrected by running both orders and aggregating (for example the 2024 “Judging the Judges” position-bias study). That is about an agent evaluating outputs, not a decider running a decision, and even there pairwise is not unambiguously better. The evidence is transferred from human contexts and not validated for AI-augmented decision-making, which independently caps the grade at P. The AI value is mechanical and modest: an agent makes the full pairwise pass cheap to run, forces the discipline (one fixed comparative question, every pair judged, a real consistency check), and produces a durable, inspectable artifact - benefits that do not depend on any contested outcome claim. The skill ships honestly as a P-tier ranking aid for the no-scale case, never as an objective scorer.

5. When it works / when it fails (drives the eval negative cases and “When NOT to Use”)

Works best when:

Absolute scoring is the bottleneck: the criteria are subjective, vague, or competing, and no one can defend a 1-to-10 scale, but any two items can be compared head-to-head.
The items are qualitative artifacts (writing, designs, proposals) where holistic marking is noisy, and a defensible order is what is needed.
The set is small enough to compare every pair by hand (roughly up to 6-8 items).
Surfacing intransitivity is useful: a cycle is a prompt to re-examine, not a defect to hide.

Fails or misleads when (poor-fit / anti-patterns):

The criteria are nameable and a scale is defensible. A criteria-weighted matrix is then faster and more inspectable; this is think-decision-option-review’s job, and pairwise voting on its criteria is just an elaborate way to fill one weight column. Reaching for standalone pairwise here buys process cost for no new artifact.
The task is to weight criteria for a scoring model. That folds into think-decision-option-review as an optional weight-elicitation; it produces no separate artifact here.
A repeatable formula over named cues is wanted. Fixing a weighted rule applied to many future cases is think-linear-model-aggregation; it needs the cues and scale pairwise comparison refuses.
A number is wanted from a base rate. Anchoring an estimate on a reference class is think-reference-class-forecasting; pairwise comparison orders a fixed set, it does not estimate a quantity.
The item count is large. A full set is n(n-1)/2 judgments - 45 for ten, 120 for sixteen - which collapses under its own combinatorics without adaptive or incomplete-design tooling a markdown-only agent cannot run by hand.
The matrix is treated as objective output. A passing consistency check does not make a manufactured preference correct, and a near-duplicate option added to the set can flip the others’ ranking (rank reversal) - a structural artifact, not a judgment error. Pairwise comparison launders subjective inputs into a clean-looking scale; the cleanliness is presentational.

6. Distinctness (why this is a Build, and the wall that earns it)

The verdict is Build, narrowly, at tier P - scoped to the reading the easy reading is not. The Build survives on a wall that one specific shipped skill draws for it; the more obvious reading folds.

The closest shipped skill is think-decision-option-review (the criteria-weighted option matrix, which the registry records as having absorbed Multi-Criteria Decision Analysis).

The criteria-weighting reading folds into think-decision-option-review. “Compare two criteria at a time to set their relative importance” (AHP, PAPRIKA) is exactly that skill’s step “define the criteria that actually matter, and weight them,” performed with a more elaborate elicitation. It produces no new artifact: it fills the weight vector of a matrix think-decision-option-review already owns, and that skill already absorbed MCDA, of which AHP is a member. This reading is not a separable mechanism; it belongs inside think-decision-option-review as an optional weighting technique.
The rank-without-a-scale reading clears the wall, and it is think-decision-option-review that draws it. That skill’s own “When NOT to use” excludes the case “when the criteria genuinely cannot be articulated,” and its procedure requires an absolute scale (“score each option against each criterion … say what a high score means”). Pairwise comparison’s distinct move is precisely the disclaimed case: order items when you cannot name criteria and cannot defend an absolute score, by eliciting only binary “A beats B” judgments and deriving the scale (and its consistency check) from the matrix. There is no criteria axis and no absolute scoring - the mechanism is different in kind, not degree.

No other shipped skill produces it either:

think-linear-model-aggregation fixes a formula over named cues for a repeated prediction - it needs the cues and the scale pairwise comparison refuses, and it builds a reusable model, not a one-off ranking.
think-reference-class-forecasting anchors a number on a base rate (an outside-view quantity), not an order derived from internal head-to-head votes.

So the move that earns a place is narrow: rank when you cannot score, by forced binary comparison, emitting a pairwise comparison matrix and the ranking derived from it. That is a real artifact think-decision-option-review explicitly will not produce.

Why Build rather than Fold or Recipe. It is not a clean fold: the one shipped skill it is nearest to defines itself against the exact situation this move owns, so subsuming it would contradict that skill’s stated boundary. It is not a recipe: deriving a scale from a consistency-checked comparison matrix is a single integrated mechanism, not a fixed chain of existing moves. It is Build, but a guarded Build - P-tier, on transferred and contested evidence, scoped to the no-scale case. The learning value of this entry is the discipline it models: a famous, genuinely useful technique whose headline reliability evidence is real but measures an adjacent claim (essay-scaling reliability, not decision quality) and is partly deflated by its own field, and whose most-advertised use (criteria weighting) is already owned. The library documents all of that and ships only the thin, honest remainder.

7. Sources

Louis L. Thurstone, “A Law of Comparative Judgment,” Psychological Review 34(4) (1927): 273-286. Foundational: established that a series of paired comparisons places stimuli on an interval scale (the discriminal-process model). Measures scaling, not decision quality. (M, for scaling - not for decisions)
Alastair Pollitt, “Let’s Stop Marking Exams” (IAEA, 2004) and subsequent comparative-judgement work. Introduced comparative judgement to assessment; reports high scale-separation reliability (commonly >0.80) for ranking open-ended work versus conventional marking. Reliability of a ranking of artifacts, not of a decision. (M, contested - see Bramley)
Tom Bramley, “Investigating the Reliability of Adaptive Comparative Judgment” (Cambridge Assessment, 2015); Bramley and Vitello (2019). Showed by simulation that adaptivity inflates the scale-separation reliability statistic (e.g. reported 0.97 deflating to 0.84; spurious separation on random data). The key contra-evidence on the assessment reading. (M, critical)
Paul A. Bottomley, John R. Doyle and Rodney H. Green, “Testing the Reliability of Weight Elicitation Methods: Direct Rating versus Point Allocation,” Journal of Marketing Research 37(4) (2000): 508-513. The source of the often-quoted 88% vs 74% test-retest figure - but it compares Direct Rating to Point Allocation, NOT pairwise comparison. Cited to show the figure does not measure this move; excluded from the grade. (M, for an adjacent method - excluded)
Thomas L. Saaty, The Analytic Hierarchy Process (McGraw-Hill, 1980). The decision-analytic apparatus: ratio-scaled pairwise comparison matrices, eigenvector weights, consistency ratio. Practitioner / foundational for the weighting reading; empirically criticised for rank reversal and consistency paradoxes. (P)
Franz Ombler and Paul Hansen, “A new method for scoring additive multi-attribute value models using pairwise rankings of alternatives” (PAPRIKA), Journal of Multi-Criteria Decision Analysis (2008). The pairwise-trade-off weight-elicitation method behind 1000minds; the canonical modern criteria-weighting (folds into think-decision-option-review). Practitioner. (P)
Amos Tversky, “Intransitivity of Preferences,” Psychological Review 76(1) (1969): 31-48. The classic case that holistic preference can be intransitive (motivating pairwise methods); subsequently largely failed to replicate (Regenwetter et al. 2011 find cycles vanishingly rare). Cited to show the necessity argument is weaker than its reputation. (M, but substantially non-replicated)
“Judging the Judges: A Systematic Investigation of Position Bias in Pairwise Comparative Assessments by LLMs” (arXiv 2406.07791, 2024). The agent-relevant transfer: pairwise LLM evaluation approximates human preference better than pointwise but suffers position bias requiring order-swapping. About agents evaluating outputs, not deciding; not validation of the decision move. (P, transferred to agents)

Excluded under the evidence rule: the “88% versus 74%” reliability figure is real but belongs to Direct Rating versus Point Allocation (Bottomley, Doyle and Green 2000), not pairwise comparison, and does not move this grade; and no free-floating “pairwise comparison improves decisions by N percent” statistic with a nameable primary source was located. Any such figure is excluded.

Was this page helpful?

Thinking Framework Skills v0.8.0 · 56 frameworks