Skip to content

Estimate-talk-estimate (Delphi)

Status: Documented, not shipped · Evidence: P · Family: Facilitation and group structures · Verdict: reject (2026-06-11)

Estimate-talk-estimate is the working core of the Delphi family: a protocol for getting a number (or a probability) out of several judges without letting the loudest judge write it. Each panelist first estimates privately and independently. The facilitator then feeds back only the anonymized spread of estimates, often with the reasons behind the outliers. The panel discusses the variance - why the high estimates are high and the low ones low - and then every panelist re-estimates privately again. The final answer is a statistical aggregate (typically the median or mean) of the second-round estimates, not a negotiated consensus number that anyone had to capitulate to in public.

Three load-bearing ingredients make it work, and all three are social: independence first (the initial estimates are formed before anyone speaks, so they are not anchored on a first voice), anonymity or controlled feedback (the spread is shared without names, so status and seniority cannot pull the number), and statistical aggregation (no one has to publicly back down for the group to land on an answer).

The family has three main variants. Classic RAND Delphi (Dalkey and Helmer, 1963) removes the talk entirely: iterated written questionnaires with controlled feedback over multiple rounds, originally so Cold War experts could converge without meeting. Estimate-talk-estimate proper (the name comes from Gustafson, Shukla, Delbecq and Walster, 1973) inserts one structured face-to-face discussion between the two private estimate rounds - and in their head-to-head comparison it was the talk, not the paper feedback, that carried the accuracy gain. The modern descendant is the IDEA protocol (Investigate, Discuss, Estimate, Aggregate; Hanea and colleagues, 2017; Hemming and colleagues, 2018), which brackets a discussion phase with two private estimate rounds and aggregates mathematically, developed for structured expert elicitation in ecology and geopolitical forecasting.

The durable cognitive move, named plainly: independence-preserving iteration. Keep the judges’ first estimates uncontaminated, expose everyone to the distribution of disagreement and its rationales, and let each judge revise privately, so the aggregate can harvest the information in the spread without paying the conformity tax of an open meeting.

It helps when several genuine experts hold partially non-overlapping knowledge about a question with no lookup-able answer and no usable base rate - demand forecasts, risk likelihoods, dose thresholds, time-to-event questions. It especially helps when the room is politically loaded: when a HIPPO (highest-paid person’s opinion), a dominant personality, or an entrenched faction would otherwise anchor the open discussion, the private-first/anonymous-feedback structure is the point. The evidence pattern says the gains come from two places: plain statistical aggregation of independent judgments (most of the benefit), and a structured talk phase that resolves linguistic ambiguity about what the question means and pools the reasons behind outlier views (the increment Gustafson and colleagues, 1973, and the IDEA results attribute to discussion).

It misleads in predictable ways:

  1. Convergence is not accuracy. Estimates reliably converge over rounds, and that convergence feels like validation. Woudenberg’s (1991) critical review concluded that consensus in Delphi studies is achieved mainly by group pressure to conformity, mediated by the fed-back statistical response - the panel huddles toward the median whether or not the median is right.
  2. Shared bias is immune to aggregation. If the panelists draw on the same information sources, their errors are correlated and no amount of independent estimating fixes the common blind spot. Expert selection dominates protocol mechanics.
  3. Social feedback can subtract value. Lorenz, Rauhut, Schweitzer and Helbing (2011, N = 144) showed that even mild social influence on estimation tasks narrows the diversity of estimates without improving group accuracy, undermining the crowd’s wisdom while increasing the group’s confidence. Whether the talk phase adds (Navajas and colleagues, 2018; the IDEA results) or subtracts (Lorenz) appears to depend on structure: deliberation inside a protected protocol helps; unstructured influence herds.
  4. The ritual can be a legitimacy laundry. A “Delphi panel” run to dress a predetermined number in expert-consensus clothing is a known failure mode of the method’s bureaucratic popularity.
  5. For an AI agent, the preconditions are absent. The method’s value is manufactured from genuinely independent knowledge bases and anonymity that defuses real social pressure. An agent that simulates a panel by spawning sub-agents samples correlated draws from one model: there are no independent knowledge bases to aggregate and no social pressure to defuse. The simulation does not merely lose the value - it counterfeits it, presenting one model’s prior, restated several ways, as independent expert convergence.

Honest grade: P - practitioner; governing tier capped from a mixed M/P read because the evidence is doubly transferred. Controlled comparisons exist directly on the mechanism and lean positive (Gustafson and colleagues 1973; Rowe and Wright’s 12-to-2 tally; the IDEA results), but a credible critical line finds the advantage over simple averaging small, absent, or reversed (Woudenberg 1991; Graefe and Armstrong 2011; Lorenz and colleagues 2011), and heterogeneity across lab “Delphis” is large. That is a genuine M/P split on the human-panel claim alone. Two transfers then force the conservative half. First, the strongest positive finding - that aggregating independent estimates wins - is the mechanical-aggregation / wisdom-of-crowds effect, which is the move shipped here as linear-model-aggregation, not ETE’s signature talk-and-revise step; crediting ETE with that robustness is adjacent-claim laundering. Second, the entire body of evidence is human panels, none agent-validated, and (per the verdict below) the panel preconditions an agent cannot even instantiate. Under the library’s rule that a split or transferred read emits the lower grade, the governing tier is P, with the full split stated. The registry preliminary and a prior draft of this dossier carried M; this run overturns that downward to P. (For comparison, the shipped facilitation-wall reject note-and-vote grades the same class of famous-but-transferred group protocol at P.)

  • Dalkey and Helmer (1963, Management Science 9(3): 458-467). The original RAND publication of the 1951 Project Delphi experiment: iterated anonymous questionnaires with controlled feedback to elicit expert consensus. Foundational demonstration, not a controlled comparison.
  • Gustafson, Shukla, Delbecq and Walster (1973, Organizational Behavior and Human Performance 9(2): 280-291). The study that names this entry. Compared four elicitation structures on subjective likelihood estimation: individuals, talk-estimate (interacting group), estimate-feedback-estimate (a Delphi approximation), and estimate-talk-estimate. ETE was the most accurate; the feedback-only Delphi approximation performed about as well as solo individuals. Direct controlled evidence on the actual move, and evidence that the talk phase, not the paper feedback, carries the increment.
  • Van de Ven and Delbecq (1974, Academy of Management Journal 17(4)). Found nominal-group and Delphi processes more effective than conventional interacting discussion groups. Same research program; converging support for structure over open discussion.
  • Rowe and Wright (1999, International Journal of Forecasting 15(4): 353-375). The standard evaluative review: Delphi groups beat first-round staticized groups (the simple average of pre-iteration estimates) by a tally of 12 studies to 2, and tend to beat unstructured interacting groups. The same review documents how heterogeneous and unrepresentative the lab Delphis are (students, almanac questions, few rounds), which caps the grade at M.
  • Woudenberg (1991, Technological Forecasting and Social Change 40(2): 131-150). The critical counterweight: found no evidence that Delphi is more accurate than other judgment methods, and attributed round-to-round consensus mainly to conformity pressure rather than information dissemination.
  • Graefe and Armstrong (2011, International Journal of Forecasting 27(1): 183-195). Lab comparison with 227 participants across face-to-face meetings, nominal groups, Delphi, and prediction markets on estimation tasks: no statistically significant overall accuracy differences among the four; the structured methods beat participants’ prior individual estimates but added little over a simple average of forecasts.
  • Hanea, McBride, Burgman, Wintle and colleagues (2017, International Journal of Forecasting 33(1): 267-279) and Hemming, Burgman, Hanea, McBride and Wintle (2018, Methods in Ecology and Evolution 9: 169-180). The IDEA protocol: private estimate, structured discussion, private re-estimate, mathematical aggregation, evaluated on geopolitical forecasting questions. Performed well against an equally weighted linear pool and a prediction market; the authors locate the discussion phase’s value in removing arbitrary linguistic uncertainty and sharing knowledge. The strongest modern validation of the ETE structure specifically.
  • Lorenz, Rauhut, Schweitzer and Helbing (2011, PNAS 108(22): 9020-9025). Experimental counter-evidence on the talk/feedback phase: social influence made estimates more similar without making the group more accurate (N = 144).
  • Navajas, Niella, Garbulsky, Bahrami and Sigman (2018, Nature Human Behaviour 2: 126-132). A 5,180-participant live experiment: averaging the consensus answers of many small deliberating groups beat the simple average of the whole crowd - evidence that structured deliberation can add accuracy beyond aggregation, in tension with Lorenz and colleagues.

What the evidence does NOT support: that any Delphi variant reliably beats a simple equal-weight average of independent estimates (the 12-to-2 tally is against first-round estimates, and Graefe and Armstrong found the iteration’s increment statistically indistinguishable); that convergence across rounds indicates accuracy (Woudenberg directly contradicts this); or anything at all about agent execution - every study above is human panels, so the entire grade is transferred. The mixed M/P read is for the human-panel claim “structured estimate-feedback/talk-re-estimate elicitation beats unstructured open discussion and adds a real but contested margin over one-shot averaging”; the governing tier emitted is the conservative P, because the read is split, the strong half is borrowed from mechanical aggregation, and none of it is agent-validated.

Self-check on numeric claims: the 12-to-2 tally (Rowe and Wright 1999), the four-structure ranking (Gustafson and colleagues 1973), N = 227 (Graefe and Armstrong 2011), N = 144 (Lorenz and colleagues 2011), and the 5,180-participant crowd (Navajas and colleagues 2018) each map to the named source. No unsourced figures are used.

Verdict: Reject (document-only). The dossier is the product. This confirms the preliminary registry verdict (cand/reject, evalDate 2026-06-11) and the wave-3 consensus, and it confirms the facilitation wall: every entry in facilitation-and-group-structures is fold, excl, or flag, with zero shipped, because the defining value of group protocols is social governance an agent cannot reproduce (the note-and-vote precedent). It overturns the preliminary tier downward, M to P, for the split-and-transferred reasons above; the verdict is unchanged.

The wall here is sharper than “an agent cannot facilitate a meeting.” Delphi’s three load-bearing ingredients - genuinely independent knowledge bases, anonymity that defuses status pressure, and dissent protected from social cost - are properties of a human panel. An agent that “runs a Delphi” by spawning sub-agent panelists produces correlated samples from a single model wearing different name tags. That is not a weaker version of the method; it is a counterfeit of it, because the artifact would present one model’s prior as independent expert convergence, which is precisely the evidential claim a real Delphi exists to earn. A skill whose honest output is fake corroboration is a harm vector, not a gap.

The agent-executable residue decomposes onto already-disposed homes, which is why no Build survives and no fold is honest:

  • Versus dialectical-bootstrapping (cand, build, M) - the closest collision, as the preliminary entry predicted. Dialectical bootstrapping is the within-one-mind crowd: estimate, assume the estimate is wrong and articulate why, re-estimate, average (Herzog and Hertwig 2009 explicitly framed it as simulating the wisdom of crowds inside one judge). ETE’s agent residue - generate several estimates, confront the spread and its rationales, revise, aggregate - is exactly that move scaled out across correlated samples. Hard wall in human terms: ETE requires several judges with genuinely independent knowledge; dialectical bootstrapping requires only one. In agent terms there are no independent knowledge bases, which is why the residue collapses into dialectical bootstrapping plus plain self-consistency sampling. If dialectical-bootstrapping ships, its dossier should absorb the within-mind iteration evidence; nothing here justifies a second skill.
  • Versus interval-calibration-check (cand, build, M). That candidate trains a single judge’s interval width against scored feedback until the hit rate matches stated confidence. ETE never scores a judge and never touches confidence intervals; it pools point estimates across judges. Hard wall: intra-judge calibration versus inter-judge aggregation. No collision.
  • Versus consider-the-unknowns (cand, build, M). That candidate maps the relevant variables a judge cannot observe before committing. ETE never inspects what is missing; it pools what each judge already believes. Hard wall: epistemic-absence mapping versus opinion aggregation. No collision.
  • Versus linear-model-aggregation (shipped, S). Shares the deep finding (mechanical aggregation beats unstructured holistic judgment) but not the move: linear-model-aggregation combines predictive cues by a fixed weighted formula for a recurring judgment; ETE combines judges’ holistic estimates for a one-off question and iterates. The aggregation step of an agent-run ETE is a degenerate equal-weight case of the shipped skill. Hard wall: cues versus judges as inputs; recurring scoring rule versus one-shot elicitation.
  • Versus fermi-estimation (shipped, P). Fermi decomposes one quantity into a chain of factor estimates by one judge; ETE never decomposes - it replicates whole-question estimates across judges. Hard wall: decomposition versus replication.
  • Versus reference-class-forecasting (shipped, S). Substitutes the outside-view base rate of comparable cases for the inside view; ETE pools inside views and has no reference class. Hard wall clear.
  • Versus decision-journal (shipped, P). Records one decision’s prediction and confidence for later review; ETE is synchronous multi-judge elicitation with no follow-up scoring. Hard wall clear.
  • Versus brainwriting (shipped, S). The real mechanism overlap: independence-before-discussion to prevent anchoring on the first voice is brainwriting’s core shipped move, applied there to idea generation. ETE is the same social mechanism applied to numeric estimates plus a re-estimate round. For an agent, the independence half of ETE is already shipped inside brainwriting’s parallel streams; the re-estimate half is the dialectical-bootstrapping candidate.

Why not Fold: a fold must name one shipped skill that honestly captures the mechanism. Brainwriting captures independence-before-discussion but for ideation, not numeric aggregation; linear-model-aggregation captures mechanical aggregation but of cues, not judges; the defining panel-iteration loop has no shipped home (its nearest home, dialectical-bootstrapping, is a candidate, not shipped, and fold targets must be shipped). Folding anywhere would falsely claim coverage - the same three-halves-three-homes shape that made note-and-vote a reject rather than a fold.

Why not Recipe: the only honest chain (brainwriting-style parallel streams, then average, then revise) is the self-ensemble whose epistemics are the harm vector above, and its defensible core is the dialectical-bootstrapping candidate’s move, which is pending its own adjudication. Writing the counterfeit into a recipe would launder it.

Proposed registry transition: cand -> excl (excluded on the merits, with the published dossier as the learning artifact - the note-and-vote pattern for famous group protocols). The method is famous enough that readers will look for it; what they should find is this dossier explaining what a real Delphi buys, which studies say so, and why an agent imitation would fake the very thing the protocol exists to guarantee.

The method was born at RAND: Project Delphi began in 1948-1951 as a Cold War exercise in eliciting expert consensus without a meeting, run by Olaf Helmer and Norman Dalkey (with Nicholas Rescher in the surrounding methodology work); the founding experiment was published a decade later as Dalkey and Helmer, “An Experimental Application of the Delphi Method to the Use of Experts” (Management Science, 1963). Dalkey’s RAND memorandum “The Delphi Method: An Experimental Study of Group Opinion” (RM-5888-PR, 1969) reports the larger experimental series. Harold Linstone and Murray Turoff’s edited volume “The Delphi Method: Techniques and Applications” (1975) is the canonical handbook.

The estimate-talk-estimate variant comes from the nominal-group-technique research program: David Gustafson, Ramesh Shukla, Andre Delbecq and G. William Walster (1973) ran the four-structure comparison that named the move, and Andre Delbecq, Andrew Van de Ven and David Gustafson codified the practice in “Group Techniques for Program Planning” (1975).

For the evaluative literature, read Gene Rowe and George Wright (“The Delphi technique as a forecasting tool: issues and analysis,” International Journal of Forecasting, 1999) with Fred Woudenberg (“An evaluation of Delphi,” Technological Forecasting and Social Change, 1991) as the counterweight, then Andreas Graefe and J. Scott Armstrong (2011) for the modern null. The contemporary practice line is the Melbourne structured-expert-judgment group: Mark Burgman (“Trusting Judgements: How to Get the Best out of Experts,” Cambridge, 2016), Anca Hanea and colleagues (the IDEA protocol, 2017), and Victoria Hemming and colleagues (the practical guide, 2018). For the social-influence dispute around the talk phase, pair Jan Lorenz and colleagues (PNAS, 2011) against Joaquin Navajas and colleagues (Nature Human Behaviour, 2018).

Was this page helpful?
Thinking Framework Skills v0.8.0 · 56 frameworks