Dialectical Bootstrapping

A single estimate of a hard quantity is an anchor: the first number that comes to mind quietly fixes the answer, and merely staring at it again does not move it. Dialectical bootstrapping breaks that anchor by simulating a second opinion from inside one head and then harvesting it the way a crowd is harvested - by averaging. The durable move is to poll the inner crowd and force the synthesis to be arithmetic: make a first estimate, deliberately assume it is wrong and generate a second estimate that draws on at least partly different knowledge, then take the plain arithmetic mean of the two numbers as the committed answer. The statistical reason it works is the wisdom of crowds in miniature - averaging two estimates cancels random error, and when the two bracket the truth (one too high, one too low) it eats into systematic error too. The “dialectical” name is literal: thesis (first estimate), antithesis (the contrarian second estimate), synthesis (the average). The output is a dialectical estimate artifact, not prose: the applicability check, both numbered estimates with the assumed-wrong reasoning that produced the second, and the non-negotiable average.

When to Use

A one-off numeric estimate is about to be committed on a genuinely hard question - a date, a percentage, a count, a forecast - where being off matters.
No second human judge is available or consultation is impossible, so the only “second opinion” obtainable is a second pass from the same mind.
No genuine reference class of comparable past cases exists to anchor an outside view, so reference-class forecasting is not an option.
The quantity lives on a familiar or bounded scale (a year, a share, a percentage), where a deliberately different second guess can plausibly land on the other side of the truth.

When NOT to Use

Do not use it on an easy question or one well within competence. The strongest pre-registered evidence on the modern variant found a forced-different second estimate helps on difficult questions and actively HARMS accuracy on easy ones (Van de Calseyde and Efendic, 2025). When the first estimate is already close, the contrarian second mostly adds error that the average then bakes in.
Do not use it on an unbounded, order-of-magnitude unknown. Muller-Trede (2011) found the gains vanish for general numerical questions whose answers range over orders of magnitude. That is think-fermi-estimation’s home regime - decompose the magnitude into factors instead of re-sampling a holistic guess.
Do not use it when a real second judge or real data is available. Your own second opinion is worth about half of someone else’s (Herzog and Hertwig, 2009); and if a genuine reference class exists, think-reference-class-forecasting (the outside view) dominates simulating a crowd from one mind. The method is a fallback, and it must say so.
Do not use it when the error is one shared load-bearing assumption. Averaging two estimates from the same mind cannot remove a bias both estimates share; the inner crowd tops out near the value of only 1.5 independent judges (van Dolder and van den Assem, 2018). If the whole estimate hangs on one assumption, test that assumption instead of averaging over it.
Do not make the average optional. The discipline IS the mechanical average. Left free, most people cherry-pick the estimate they now prefer or extrapolate outside their own two numbers, and the realized gain disappears (Muller-Trede, 2011). The final answer is the mean of the two estimates - never a single number you liked better, never a value outside their range.
Do not use it on a qualitative judgment. The move is defined for quantitative point estimates only. There is no arithmetic mean of two opinions, so there is nothing to average.

Instructions

When asked to firm up or pressure-test a single numeric estimate, follow these steps:

Run the applicability check first, before touching any number. Confirm all four gates: the question is genuinely hard (not easy or routine); the quantity is a point estimate on a bounded or familiar scale (not an order-of-magnitude unknown); it is a one-off commitment; and no real second judge, no reference class, and no better data are available. If any gate fails, stop and route to the right tool (an easy question needs no method; an unbounded magnitude routes to fermi-estimation; available data routes to reference-class-forecasting; a single load-bearing assumption routes to assumption-testing).
Make the first estimate. State the single best point estimate of the quantity, with its units and the scale it lives on. This is the thesis.
Assume the first estimate is wrong and say why. Deliberately suppose the first number is off the mark. List the assumptions and considerations behind it that could have been mistaken, and what different knowledge a skeptic would bring. The goal is a genuinely different basis, not a token nudge.
Read the direction the doubts imply. From those reasons, judge whether the first estimate was more likely too high or too low. This direction is what gives the second estimate a chance to bracket the truth.
Make the second estimate from the changed perspective. Produce a second point estimate built on the doubts and the implied direction - the antithesis. It should be a real re-estimate from different assumptions, not the first number shaded slightly.
Mechanically average the two estimates. Take the plain arithmetic mean of the first and second estimates. This mean is the committed answer. Do not pick a favorite, do not weight them by which “feels” right, do not land outside the range between them.
Note bracketing and carry the caveat. Record whether the two estimates straddle a plausible truth (one high, one low) - bracketing is where the method earns its keep. Carry the evidence caveat into the artifact: this is an M-tier, human-subjects-validated, modest aid (about a few percent error reduction at best), not a guarantee, and a real judge or real data would beat it.

Output Format

Use the template in references/TEMPLATE.md. The deliverable is the filled dialectical estimate - the applicability check, the first estimate, the assumed-wrong reasoning and its direction, the second estimate, the mechanical average as the committed answer, and the carried caveat - not a prose argument. The final answer is always the arithmetic mean of the two estimates.

Quality Checklist

Before finalizing, verify:

The applicability check passed all four gates (hard question, bounded-scale point estimate, one-off, no better source), and the artifact records it.
There is a first estimate and a genuinely different second estimate - the second is built on assumed-wrong reasoning and a stated direction, not a token nudge from the first.
The committed answer is the plain arithmetic mean of the two estimates - not a cherry-picked single number and not a value outside their range.
The bracketing note states whether the two estimates straddle a plausible truth.
The artifact routes out rather than runs when a gate fails (easy question, unbounded magnitude, available judge or data, single load-bearing assumption, qualitative judgment).
No overclaiming: the carried caveat states the evidence is M-tier and transferred from human studies, the effect is modest, this is not a guarantee, and a real second judge or real reference class would beat it (see evidence/dossier.md).

Evidence

Tier M (moderate, governing). The record has two layers the honest grade keeps apart. The robust layer: averaging two self-generated estimates beats the first estimate (Vul and Pashler, 2008; the pre-registered replication Steegen et al., 2014, dz = 0.34 to 0.72; van Dolder and van den Assem, 2018, about 1.2 million estimates). The contested layer: how much the deliberate consider-the-opposite instruction adds over a plain second guess - the original experiment measured a 4.1 percentage-point gain at d = 0.53 (Herzog and Hertwig, 2009), but White and Antonakis (2013) found no advantage under a different accuracy measure, and the modern variant’s benefit is difficulty-dependent (helpful on hard questions, harmful on easy ones; Van de Calseyde and Efendic, 2025). M, not S: the effects are modest (about 4 percent error reduction at best), the instruction’s increment is measure- and difficulty-dependent, and the critique exchange is unresolved. All evidence is from human subjects (students, online panels, casino patrons); none validates the procedure performed by an AI agent, which independently caps the grade. The skill ships honestly as an estimate-improvement aid with hard walls, never as a guaranteed accuracy gain. Full grading, sources, and caveats: evidence/dossier.md.

Examples

See references/EXAMPLE.md for a completed dialectical estimate on a real decision.

Deep dive: worked example

A full worked run (the shared Northwind scenario)

Dialectical Estimate - Worked Example

A completed run of the dialectical-bootstrapping skill on a real, consequential numeric estimate. This is the quality bar a generated dialectical estimate should meet.

Uses the shared recurring scenario (Northwind, a B2B SaaS weighing a self-serve free-tier launch) so examples across skills read as one coherent product. Here Northwind must commit a single hard number that feeds the launch’s financial model: the free-to-paid conversion rate. Where think-scenario-planning builds alternative external worlds for the same launch and think-premortem imagines the launch failing, this skill does something narrower and quantitative - it firms up one committed number by polling the inner crowd and averaging. See docs/internal/AUTHORING.md.

The committed answer below is the plain arithmetic mean of the two estimates. It is not the number that “felt right,” and it does not sit outside the range of the two. The discipline is the average.

Applicability check (run this first, before any number)

Northwind has never run a free tier, so there is no internal history. The PM needs a single conversion-rate number for the board financial model by Friday, and the analyst who could give an independent read is on leave. The four gates:

Gate	Pass?	Note
The question is genuinely hard	yes	No prior free tier; conversion is sensitive to product, pricing, and segment - genuinely uncertain
Point estimate on a bounded / familiar scale	yes	Free-to-paid conversion is a percentage, bounded 0-100, and realistically in a low single-digit band
One-off commitment	yes	One number is going into the model now; this is not a repeated weekly forecast off fixed cues
No second judge, reference class, or better data	yes	Analyst on leave; no internal history; published SaaS benchmarks are too heterogeneous to be a real reference class for Northwind’s exact motion

Also stop if the estimate hangs on one assumption (it does not - several independent factors move it) or it is qualitative (it is not - it is a percentage).

Applicability verdict: proceed. (Had the quantity been “roughly how big could the free-tier user base get?” - an unbounded order-of-magnitude unknown - this would route to think-fermi-estimation instead. Had a clean cohort of truly comparable launches existed, it would route to think-reference-class-forecasting.)

The quantity

What is being estimated: the rate at which self-serve free-tier signups convert to a paid plan within 12 months of the free tier launching.
Units and scale: a percentage, 0-100, expected in the low single digits.

Estimate 1 (thesis)

First estimate: 4.0% free-to-paid within 12 months.
Basis: Northwind’s product solves a real, recurring pain, and current sales-led trials convert at a healthy clip. The PM mentally anchors on that trial-conversion intuition and the optimistic case the launch deck was built around - a generous free tier that showcases the core value should pull a meaningful slice of free users into paying.

Assume it is wrong - and why (the antithesis reasoning)

Suppose 4.0% is off the mark. Where could the basis be wrong, and what would a skeptic bring?
- The anchor is a sales-qualified-trial conversion rate. Free-tier signups are unqualified, top-of-funnel, and self-selected for “free” - a very different and much colder population than a sales-led trial.
- A generous free tier risks being good enough to keep many users from ever needing to pay (the free-tier cannibalization effect), which depresses conversion rather than lifting it.
- Industry experience with self-serve freemium clusters conversion in the 1-5% band for most products, and broad horizontal tools sit at the low end - the optimistic deck number ignored that gravity.
- The 12-month window includes a long tail of dormant signups that inflate the denominator without ever converting.
Direction the doubts imply: the first estimate was more likely too high, because every correction above pulls conversion down from a warm-trial intuition toward a cold-signup reality.

Estimate 2 (antithesis)

Second estimate: 2.0% free-to-paid within 12 months.
Basis: Treating free signups as a cold, self-selected, partly-dormant population and assuming a generous tier satisfies a real share of users without payment, a low-single-digit rate near the middle-low of the freemium band is the realistic read for a first launch with no optimization yet.

The synthesis (mechanical average)

Committed answer = ( 4.0% + 2.0% ) / 2 = 3.0% free-to-paid within 12 months.
Bracketing note: the two estimates do straddle a plausible truth - 4.0% is the warm-anchor optimistic read, 2.0% is the cold-population skeptical read, and the realized rate for a first launch most plausibly sits between them. Because they bracket, the average is doing the work it is meant to do: it cancels part of the optimism baked into the anchor without swinging all the way to the pessimistic floor. The 3.0% goes into the model, not the 4.0% the deck wanted and not the 2.0% the skeptic argued.

Evidence caveat (carried into every artifact - do not delete)

This dialectical estimate is an M-tier (moderate) aid. The evidence is transferred from human-subjects studies (students, online panels, casino patrons; Herzog and Hertwig 2009 and the crowd-within line); none of it validates the procedure performed by an AI agent. The effect is modest - about a few percent error reduction at best when it applies - and it is not a guarantee: in the original study roughly a quarter of individuals ended up worse off. A real second judge, a real reference class, or real data would beat this, and it does not apply to easy questions or unbounded order-of-magnitude unknowns. Treat the 3.0% as a better-anchored single number for the model, not as a validated forecast. If Northwind can run even a small private beta and measure actual conversion, that real data should replace this estimate immediately. See evidence/dossier.md.

Note how this differs from its neighbors on the same Northwind launch. think-scenario-planning builds several uncontrollable external futures and asks which moves survive them; think-premortem assumes the launch failed and reasons back to causes; think-fermi-estimation would decompose an unbounded magnitude (how many signups could there ever be?) into multiplied factors. This skill takes one hard, bounded number that has to be committed now, generates a deliberately contrarian second read of it, and averages - the durable move no other skill performs is the averaged pair of self-generated estimates.

Grounding: the full evidence dossier

What the research does and does not show, with graded sources

Evidence Dossier: Dialectical Bootstrapping

The single source of truth for the dialectical-bootstrapping skill. The SKILL.md, the sidecar (skill.meta.yml), and the eval cases all derive from this file. If a claim is not here, it does not belong in the skill. Promoted from the vetted proposed-build dossier and admitted as a Build at tier M (the governing grade carried unchanged from the NAME-mode vetting of 2026-06-11).


Skill	`thinking-framework-skills.dialectical-bootstrapping` (installable name `think-dialectical-bootstrapping`)
Family	decision-and-option-evaluation
Evidence tier	M governing (moderate; honest split kept explicit in “What the evidence shows”)
Confidence	Moderate that averaging a deliberately contrarian second self-estimate beats sticking with the first on a hard, bounded-scale, one-off numeric question; low that any specific gain transfers to an AI agent
Status	cand (verdict Build; the skill ships, the registry entry stays cand until shipped)

1. The mechanism (what actually does the work)

Dialectical bootstrapping simulates a second opinion from inside one head and then harvests it the way a crowd is harvested: by averaging. The procedure, as Herzog and Hertwig (2009) ran it, has four steps. First, make your estimate of the quantity. Second, assume that estimate is off the mark and articulate why - which assumptions and considerations behind it could have been wrong. Third, read the direction those doubts imply (was the first estimate probably too high or too low?) and make a second, alternative estimate from that changed perspective. Fourth, take the arithmetic average of the two numbers as the answer. The “dialectical” gloss is the thesis (first estimate), the antithesis (the deliberately contrarian second estimate), and the synthesis (the average); the synthesis step is literally arithmetic, which is what separates this method from every qualitative use of the word dialectical.

The statistical rationale is the same one that powers the wisdom of crowds. A quantitative estimate decomposes into truth plus random error plus systematic error. Averaging two estimates cancels random error, and when the two estimates bracket the true value (one too high, one too low) it also eats into systematic error. Herzog and Hertwig’s gain-range analysis makes the tolerance explicit: if the two estimates bracket the truth, the second estimate can err almost three times as badly as the first and the average still beats the first estimate; without bracketing, averaging merely matches a random choice between the two. The consider-the-opposite style instruction exists to raise the bracketing rate, to make the second estimate draw on at least partly different knowledge so its error has a chance of carrying the opposite sign. The underlying crowd-within effect (that even an uninstructed second guess, averaged with the first, beats the first) is Vul and Pashler (2008).

The durable cognitive move is to poll the inner crowd and force the synthesis to be arithmetic - produce a second, deliberately contrarian point estimate by assuming the first is wrong, then mechanically average the two numbers. The move is defined for quantitative point estimates only: dates, percentages, counts, forecasts expressed as numbers. It produces a small, checkable artifact: the first estimate, the assumed-wrong reasons with their directional reading, the second estimate, and the mechanical average, plus an applicability check before any of it runs.

2. Lineage

The method is Stefan M. Herzog and Ralph Hertwig’s - then at the University of Basel, later at the Max Planck Institute for Human Development’s Center for Adaptive Rationality - named in their 2009 Psychological Science paper. The “crowd within” base effect is Edward Vul and Harold Pashler’s (2008). The second-step elicitation descends from the consider-the-opposite debiasing line: Lord, Lepper and Preston (1984), Koriat, Lichtenstein and Fischhoff (1980), Hirt and Markman (1995), and Soll and Klayman (2004), whose two-step interval elicitation Herzog and Hertwig cite as the same nonredundant-knowledge mechanism.

“Dialectical bootstrapping” is an academic descriptive coinage - no trademark, no vendor. Attribution to Herzog and Hertwig (2009), with the crowd-within base effect credited to Vul and Pashler (2008), suffices. The skill ships documented descriptively with the lineage credited here rather than branded.

3. What the evidence shows, and what it does NOT show

Governing grade: M (moderate). The record has a robust layer and a contested layer, and the honest grade keeps them apart.

The robust layer: averaging two self-generated estimates beats the first estimate. Vul and Pashler (2008; N = 428) established the crowd-within effect: a second guess averaged with the first improves accuracy, modestly when immediate and substantially more after a three-week delay. Steegen, Dewitte, Tuerlinckx and Vanpaemel (2014) replicated it in a high-powered pre-registered study (N = 471 immediate, 140 delayed; effect sizes dz = 0.34 to 0.72), and the replication exchange surfaced that the original data had, if anything, understated the effect. van Dolder and van den Assem (2018) confirmed it at scale in three natural experiments (about 1.2 million casino-contest estimates): real, larger with delay and with more estimates, and structurally modest - the within-person crowd converges on the value of roughly 1.5 independent judges, far short of a real crowd.

The contested layer: how much the dialectical instruction adds over a plain second guess. Herzog and Hertwig (2009; N = 101, 40 date-estimation items) measured a 4.1 percentage-point accuracy gain for averaging first and dialectical estimates against 0.3 points for averaging two uninstructed estimates - their own phrase is “an order of magnitude” - with d = 0.53 (medium) against d = 0.12, bracketing rates of 13.6 percent against 7.9 percent, and an incentive-scheme confound ruled out by a follow-up control (0.7 points, d = 0.20). White and Antonakis (2013) reanalyzed the same data with an accuracy-change measure decoupled from the proportion of identical responses and found no difference between the dialectical and reliability conditions; Herzog and Hertwig’s (2013) reply concedes that, for this elicitation technique, “the advantage of dialectical bootstrapping depends on the accuracy measure,” while defending the item-level robust measure. The modern variant (make the second estimate from the perspective of someone who often disagrees with you) beat a plain second guess in five pre-registered experiments (Van de Calseyde and Efendic 2022; N = 6,425), but Fiechter (2024) showed part of that evidence arises from anticonservative random-intercept models, and the authors’ own pre-registered follow-up (Van de Calseyde and Efendic 2025; N = 2,884) localized the effect: beneficial for difficult questions, harmful for easy ones.

What M covers and what it does not. M covers the package the skill would actually run: on a hard, bounded-scale, one-off quantitative question with no better source available, generating a deliberately different second estimate and mechanically averaging improves the committed number relative to sticking with the first estimate - supported by the original experiment, by the robust crowd-within floor underneath it (at worst, a dialectical second estimate is a second estimate), and by pre-registered work on the variant. M does NOT cover: easy questions (evidence of harm), unbounded order-of-magnitude quantities (evidence of no gain), any claim of matching a real crowd or a real second judge (evidence of clear inferiority), qualitative judgments (out of scope by construction), or the strong reading that the consider-the-opposite instruction itself is indispensable (measure-dependent; the instruction’s increment is the contested part). Not S because the effects are modest (about 4 percent error reduction at best), the increment over a plain second guess is measure- and difficulty-dependent, and the published critique exchange is unresolved on the instruction’s surplus value.

4. Transferred-evidence flag (required honesty for this library)

Every result above is from human subjects - students, online panels, casino patrons. Nothing here validates the procedure performed by or with an AI agent. The adjacent machine result, self-consistency sampling (sample several reasoning paths from one model and aggregate; Wang and colleagues 2022), shows within-one-mind aggregation transfers in spirit to language models, but it tests majority-vote over sampled reasoning chains, not the dialectical instruction or numeric averaging, and is excluded from the grade. The evidence is transferred from human contexts and not validated for AI-augmented use. The AI value is mechanical and modest: an agent makes the method cheap to run, forces the discipline (the applicability check, a genuinely different second estimate, a non-negotiable arithmetic average), and produces a durable, inspectable artifact - benefits that do not depend on any contested outcome claim. The skill ships honestly as an M-tier estimate-improvement aid with hard walls, never as a guaranteed accuracy gain.

5. When it works / when it fails (drives the eval negative cases and “When NOT to Use”)

Works best when:

A one-off numeric estimate is about to be committed on a hard question, no second judge is available (or consultation is impossible), no genuine reference class of comparable past cases exists, and the quantity lives on a familiar or bounded scale (a year, a percentage, a share). That is exactly the regime the experiments cover, and the regime where a single anchored number is most dangerous.

Fails or misleads when (poor-fit / anti-patterns):

The question is easy or well within competence. The strongest pre-registered evidence on the modern variant found that a disagreeing-perspective second estimate helps on difficult questions and actively harms accuracy on easy ones (Van de Calseyde and Efendic 2025): when the first estimate is already close, a forced-different second estimate mostly adds error that averaging then bakes in.
The quantity is an unbounded, order-of-magnitude unknown. Muller-Trede (2011) reproduced the gains for year and percentage questions and found no gains for general numerical questions whose answers range over orders of magnitude. That failure zone is precisely fermi-estimation’s home regime: decompose the magnitude into factors instead of re-sampling a holistic guess.
A real second judge or real data is available. In the original study, averaging with a random other person’s estimate gained 7.1 percentage points against 4.1 for the dialectical estimate - the own second opinion is worth about half of someone else’s (Herzog and Hertwig 2009). If a genuine reference class exists, the outside view (reference-class forecasting) dominates simulating a crowd from one mind.
The error is one shared systematic bias. Averaging two estimates from the same mind cannot remove the bias both estimates share; at the limit, an infinite inner crowd is worth only about 1.5 independent judges (van Dolder and van den Assem 2018). If the whole estimate hangs on one load-bearing assumption, test the assumption, do not average over it.
The average is treated as optional. The discipline IS the mechanical average. Left to choose, only about 10 percent of judges average consistently; over 30 percent of final answers extrapolate outside the range of their own two estimates, and realized gains vanish even where potential gains were real (Muller-Trede 2011). Cherry-picking the estimate now preferred forfeits the effect. Likewise do not expect crowd-scale gains from piling on more self-estimates: returns diminish sharply after the second (Rauhut and Lorenz 2011), and in the original data roughly a quarter of individuals ended up worse off (Herzog and Hertwig 2009) - the gain is an expectation, not a guarantee.

6. Distinctness (why it is a skill here, not an existing one)

The distinct durable move: poll the inner crowd and force the synthesis to be arithmetic - produce a second, deliberately contrarian point estimate by assuming the first is wrong, then mechanically average the two numbers. No shipped skill emits an averaged pair of self-generated estimates; the averaging tail is owned by nobody. The walls, neighbor by neighbor:

fermi-estimation (shipped, M; medium-low overlap). Fermi builds one number from multiplicative decomposition into factor estimates; dialectical bootstrapping never decomposes - it re-samples the whole quantity under changed assumptions and averages. The boundary is empirical, not rhetorical: Muller-Trede’s no-gain question type (unbounded general numerical magnitudes) is exactly the fermi regime, so each method’s documented failure zone is the other’s home turf.
reference-class-forecasting (shipped, S; low overlap). RCF substitutes an outside view built from real comparable cases. Dialectical bootstrapping is the fallback for when no reference class and no second judge exists; where real data exists, RCF dominates and the skill must say so.
linear-model-aggregation (shipped, S; low overlap). LMA mechanically combines named predictive cues with fixed weights for a repeated judgment. This method mechanically combines two holistic self-estimates for a one-off quantity. They share only the Dawes-lineage principle that mechanical aggregation beats holistic adjustment - different inputs, different setting, different artifact.
red-team-light (shipped, P; medium overlap, the lineage cousin). Step two of the bootstrap (“assume it is wrong, list why”) is consider-the-opposite, the same debiasing family red-team-light operationalizes. But red-team-light takes a claim or plan, constructs the strongest objections, and adjudicates which land; dialectical bootstrapping uses the reasons only to read a direction (too high or too low), then emits a second number and an average. No objection-adjudication, no critique artifact; conversely red-team-light has no numeric tail. The shared machinery is the reasons-generation step alone.
decision-journal (shipped, P; low overlap). The journal records prediction and confidence now to score calibration across decisions later. No second estimate, no within-task improvement, no averaging.
estimate-talk-estimate (rejected candidate; the facilitation wall). Delphi is a multi-judge group protocol whose value is social governance an agent cannot reproduce. Dialectical bootstrapping is the explicitly solo, within-one-mind analog - which is exactly why it is agent-executable where the group protocol is not. The boundary cuts the other way too: the moment a real second judge exists, the dyadic average is worth about twice the dialectical one (Herzog and Hertwig 2009), so the skill routes multi-judge situations out rather than imitating them.

No recipe reconstructs the move: no chain of shipped skills produces a second self-estimate with a deliberately opposite error sign and then takes the mean. The strongest argument that this must ship as an enforced procedure rather than advice is in the evidence itself: Muller-Trede’s judges captured almost none of the available gain because, left free, they would not average - the artifact’s value is that the averaging step is mechanical and non-negotiable, with the applicability check (hard question? bounded scale? no better source?) run before the first estimate is touched.

7. Output artifact

The skill must emit a dialectical estimate, not prose: an applicability check (is this a hard, bounded-scale, one-off numeric question with no better source?); the first estimate; the assumed-wrong reasons with the direction each implies; the second estimate from the changed perspective; the mechanical arithmetic average as the committed answer; and a bracketing note plus the carried evidence caveat. The average is non-negotiable - the final answer is the mean of the two estimates, never a cherry-picked single number and never a value outside the range of the two.

8. Sources

Stefan M. Herzog and Ralph Hertwig, “The Wisdom of Many in One Mind: Improving Individual Judgments With Dialectical Bootstrapping,” Psychological Science 20(2) (2009): 231-237. The founding experiment (N = 101, date estimation): dialectical averaging gained 4.1 percentage points (d = 0.53) against 0.3 for uninstructed repetition (d = 0.12) and 7.1 for a real second person (d = 0.86); bracketing 13.6 vs 7.9 percent; 72 percent of participants benefited, 24 percent got worse. (M)
Edward Vul and Harold Pashler, “Measuring the Crowd Within: Probabilistic Representations Within Individuals,” Psychological Science 19(7) (2008): 645-647. The base crowd-within effect (N = 428): averaging two own guesses beats the first; a three-week delay substantially increases the benefit. (M, for the floor)
Sara Steegen, Laura Dewitte, Francis Tuerlinckx and Wolf Vanpaemel, “Measuring the crowd within again: a pre-registered replication study,” Frontiers in Psychology 5:786 (2014). High-powered pre-registered replication of Vul and Pashler (N = 471 immediate, 140 delayed): effect confirmed, dz = 0.34 to 0.72. Replicates the crowd-within floor, not the dialectical instruction. (M, for the floor)
Chris M. White and John Antonakis, “Quantifying Accuracy Improvement in Sets of Pooled Judgments: Does Dialectical Bootstrapping Work?,” Psychological Science 24(1) (2013): 115-116. Reanalysis of the 2009 data: with a measure decoupled from identical-response rates, no dialectical advantage over the reliability condition. The decisive caution on the instruction’s surplus value. (critical literature)
Stefan M. Herzog and Ralph Hertwig, “The Crowd Within and the Benefits of Dialectical Bootstrapping: A Reply to White and Antonakis (2013),” Psychological Science 24(1) (2013): 117-119. Concedes the advantage “depends on the accuracy measure” for this technique; defends the framework beyond consider-the-opposite. (reply)
Johannes Muller-Trede, “Repeated judgment sampling: Boundaries,” Judgment and Decision Making 6(4) (2011): 283-294. Gains replicate for year and percentage questions, none for general numerical magnitudes; only about 10 percent of judges average voluntarily and realized gains vanish - the case for mechanical enforcement. (M, boundary study)
Dennie van Dolder and Martijn J. van den Assem, “The wisdom of the inner crowd in three large natural experiments,” Nature Human Behaviour 2 (2018): 21-26. About 1.2 million estimates: the inner crowd is real, grows with delay, and tops out near the value of 1.5 independent judges. (M, scale and ceiling)
Philippe P. F. M. Van de Calseyde and Emir Efendic, “Taking a Disagreeing Perspective Improves the Accuracy of People’s Quantitative Estimates,” Psychological Science 33(6) (2022): 971-983. Five pre-registered experiments (N = 6,425): the disagreeing-peer second estimate beat a plain second guess. Qualified by Fiechter (2024). (M, with the 2024 qualification)
Joshua L. Fiechter, “Drawing Generalizable Conclusions From Multilevel Models: Commentary on Van de Calseyde and Efendic (2022),” Psychological Science (2024). Shows the 2022 evidence partly arises from anticonservative random-intercept models. (critical literature)
Philippe P. F. M. Van de Calseyde and Emir Efendic, “Disagreeing Perspectives Enhance Inner-Crowd Wisdom for Difficult (but Not Easy) Questions,” Psychological Science 36(3) (2025): 147-156. Three pre-registered experiments (N = 2,884): the difficulty moderation - beneficial on hard questions, harmful on easy ones. The load-bearing when-NOT wall. (M)
Hauke Rauhut and Jan Lorenz, “The wisdom of crowds in one mind,” Journal of Mathematical Psychology 55 (2011): 191-197. Five repeated estimates: gains confirmed for most questions, with sharply diminishing returns after the second estimate. (supporting)
Stefan M. Herzog and Ralph Hertwig, “Harnessing the wisdom of the inner crowd,” Trends in Cognitive Sciences 18(10) (2014): 504-506. The program review; cited as a reader pointer, no effects drawn from it. (reference)

Excluded under the evidence rule: no effect in this dossier lacks a named primary source. The “worth half a second opinion” and “order of magnitude” phrasings are Herzog and Hertwig’s own; the 1.5-judges limit is van Dolder and van den Assem’s own summary statistic of their three natural experiments. The LLM self-consistency analog (Wang and colleagues 2022) is noted as agent-adjacent context and contributes nothing to the grade.

Was this page helpful?

Thinking Framework Skills v0.8.0 · 56 frameworks