Skip to content

Interval Calibration Check

People state uncertainty as intervals - “two to four weeks, 90 percent sure” - and those intervals are reliably too narrow. Overprecision is the most robust form of overconfidence: stated 90 percent intervals contain the true value far less than 90 percent of the time, and subjective intervals are sometimes only a fraction as wide as the judge’s own information would warrant. A stated “90” that historically hits 50 is not a confidence level, it is a habit of speech, and everything downstream that takes the number literally - an expected-value calculation, a risk model, a commitment - inherits the error. This method interrogates the WIDTH of a stated uncertainty: does your 90 mean 90? It runs two coupled moves that both operate on the width and never on the location of the estimate - an equivalent-bet indifference test at elicitation time, and hit-rate scoring against resolved outcomes - and emits a calibration scorecard. The durable move is not asking “how sure are you?” again. It is converting that question into a concrete bet, widening until the bet is genuinely a toss-up, and scoring the stated confidence against the truths that actually arrive.

  • A consequential plan, forecast, or commitment rests on a stated interval or confidence number that has never been audited - the “90 percent sure we ship in Q3” plan, the cost range in a proposal, the confidence column in a decision journal or assumption ledger.
  • The same person or team makes repeated resolvable estimates, so a track record exists or can accumulate and the scored-feedback half has material to work with.
  • A method that consumes probability numbers at face value sits immediately downstream (an expected-value decision tree, a risk model) - calibrate the inputs before the arithmetic launders them.
  • The worry is that the stated confidence is too tight to trust (overprecision), not that the central number is in the wrong place.
  • Do not run it on the agent’s own confidence. An LLM posing an equivalent bet to itself has no felt indifference to reveal; the test becomes the same self-report in different words, and verbalized model confidence is itself systematically overconfident (Xiong et al., 2024). This calibrates a human’s stated intervals through elicitation. It is not a self-calibration device for the model. This is the central wall.
  • Do not use it when the problem is the location of the estimate, not the width. A wrong number, well calibrated, is still wrong. Route a wrong central estimate to think-reference-class-forecasting (anchor on the base rate of comparable cases) or think-fermi-estimation (build the number from factors). Wrong number, use those; untrustworthy “sure,” use this.
  • Do not calibrate an interval around a lookupable fact. Where the answer can simply be checked, or no genuine uncertainty exists, calibrating its interval is theater.
  • Do not present a one-shot bet-test as a full calibration. Without resolvable items only the bet-test half applies, and the bet device is the least-evidenced part of the protocol; say plainly that the scorecard is one-legged.
  • Do not promise full debiasing. The controlled record shows partial correction with a stubborn residue. Promise tighter honesty about uncertainty, not calibrated certainty.
  • Do not confuse it with content moves. It never asks what information is missing (that is the consider-the-unknowns move) and never generates a second estimate to average (that is think-dialectical-bootstrapping). It is content-blind: it only asks whether the stated number means what it claims.

When asked to pressure-test a stated confidence interval or audit whether a “90 percent sure” is worth its face value, follow these steps:

  1. Name the focal claim and confirm there is genuine uncertainty. State the quantity and its stated interval with the nominal confidence (“ship date 8-11 weeks out, 90 percent”). If the answer is lookupable or there is no real uncertainty, stop and say so - calibrating it is theater.
  2. Confirm the problem is width, not location. If the worry is that the central number is in the wrong place, route to think-reference-class-forecasting or think-fermi-estimation and stop. This method resizes the stated uncertainty; it never relocates the estimate.
  3. Confirm the judge is a human. Calibrate a human’s stated intervals only. Never present the agent’s own self-administered bet as calibration (Xiong et al., 2024). If no human judge is in the loop, say the method does not apply.
  4. Run the equivalent-bet test on each interval. Offer the judge a choice between (a) betting that the truth falls inside their stated interval and (b) a reference lottery that pays at exactly the nominal probability (the classic device is a wheel with a winning region the size of the nominal confidence). A preference for the wheel reveals felt confidence is below the stated number - they are overconfident, widen the interval. A preference for the interval reveals it is above - narrow it. Iterate the bet against the adjusted interval until the judge is genuinely indifferent.
  5. Record the bet verdict and the adjusted interval. For each claim, log the original interval, the bet verdict (wheel-preferred / interval-preferred / indifferent), and the adjusted interval at indifference.
  6. Score the track record wherever outcomes resolve. If the judge has a battery of resolvable items with known answers, or their own past predictions that have since resolved, score the hit rate against the nominal confidence. Diagnose over- or underprecision (90s that hit 50 are overprecise; 70s that hit 90 are underprecise) and feed the score back before the next round.
  7. Mark the scorecard one-legged when no items resolve. With no resolvable track record, only the bet-test half ran. Say so plainly; do not present a bet-only result as a verified calibration.
  8. Emit the calibration scorecard per references/TEMPLATE.md: each interval, its nominal confidence, the bet verdict, the adjusted interval, and the hit rate and over/underprecision diagnosis where a track record exists - with the pre-printed evidence caveat carried into the artifact.

Use the template in references/TEMPLATE.md. The deliverable is the filled calibration scorecard - the focal claims, each stated interval with its nominal confidence, the equivalent-bet verdict, the adjusted interval at indifference, and the hit-rate-versus-nominal diagnosis wherever outcomes resolve - not a prose essay. The evidence caveat ships inside the artifact by construction. Never present the agent’s own confidence as a calibrated reading, and never report the central estimate as corrected (this method resizes width, it does not relocate the number).

Before finalizing, verify:

  • The focal claim has genuine uncertainty and is not lookupable, and the problem is width (overprecision), not a wrong central estimate.
  • The judge is a human; the agent’s own confidence was never presented as a calibrated reading.
  • Each interval ran the equivalent-bet test and was iterated to genuine indifference, with the original interval, the bet verdict, and the adjusted interval all recorded.
  • Where resolvable items exist, the hit rate is scored against the nominal confidence with an explicit over- or underprecision diagnosis; where they do not, the scorecard is marked one-legged.
  • Only the WIDTH of the uncertainty was adjusted; the location of the estimate was left alone.
  • The output is the calibration scorecard artifact, not prose.
  • No overclaiming: the evidence is practitioner-grade (P) and transferred from human studies; the caveat carries that debiasing is partial, not a guarantee of calibrated certainty (see evidence/dossier.md).

Tier P (governing; preliminary M overturned to P). The underlying phenomenon - interval overprecision - is established at strong-research level: 98 percent intervals cover roughly 60 percent of true values (Alpert and Raiffa, 1982), and 90 percent intervals contain the truth less than 45 percent of the time (Soll and Klayman, 2004). Scored feedback is among the few interventions with controlled evidence of improving calibration (Lichtenstein, Fischhoff and Phillips, 1982). But a robust bias is not evidence the specific fix works. The M-flavored training results sit on siblings, not this protocol: Lichtenstein and Fischhoff (1980) trained two-alternative half-range items with modest-to-nil transfer, and Klayman et al. (1999) show calibration is not unitary across formats, blocking that transfer; the controlled interval-time remedies tested are fractile decomposition and full-range assignment (Soll and Klayman 2004; Haran, Moore and Morewedge 2010), not the equivalent bet; the strongest training gain (Chang et al., 2016) came from a bundled curriculum; and the equivalent-bet device itself has zero controlled outcome evidence - only interview doctrine (Spetzler and Stael von Holstein, 1975) and excluded vendor data (Hubbard). Grading M would launder cousins’ robustness onto the actual move, so the governing grade is P. All evidence is transferred from human subjects; nothing validates the protocol run by or on an AI agent, and the model cannot calibrate itself (verbalized LLM confidence is overconfident - Xiong et al., 2024). The skill ships as an uncertainty-honesty aid that promises partial correction, never calibrated certainty. Full grading, sources, and caveats: evidence/dossier.md.

See references/EXAMPLE.md for a completed calibration scorecard on a real decision.

A full worked run (the shared Northwind scenario)

A completed run of the interval-calibration-check skill on a real, consequential decision. This is the quality bar a generated calibration scorecard should meet.

Uses the shared recurring scenario (Northwind, a B2B SaaS weighing a self-serve free-tier launch) so examples across skills read as one coherent product. Where think-scenario-planning stress-tests the free-tier strategy against external futures and think-reference-class-forecasting relocates an estimate onto a base rate, this skill takes the confidence numbers Northwind’s team is already stating about the free-tier launch and asks whether those numbers mean what they claim - it resizes the WIDTH of stated uncertainty and never touches the central estimate. See docs/internal/AUTHORING.md.

Evidence caveat (ships with this artifact by construction). This method is graded P (practitioner), governing. The phenomenon it targets - interval overprecision - is strongly established, but the controlled evidence for this specific fix (equivalent-bet plus scored feedback on intervals) is partial and largely transferred from sibling formats; the equivalent-bet device itself has no controlled outcome evidence. All evidence is transferred from human studies, not agent-validated. Expect partial correction, not calibrated certainty. This scorecard calibrates a human’s stated intervals only - never the agent’s own confidence. It resizes the width of stated uncertainty; it does not relocate the central estimate. See evidence/dossier.md.


Focal claim and why its uncertainty matters

Section titled “Focal claim and why its uncertainty matters”
  • Claim / quantity: The free-to-paid conversion rate of Northwind’s planned self-serve free tier, and the time to reach 1,000 activated free accounts.
  • Stated interval and nominal confidence: The PM, Dana, states three numbers feeding the go/no-go: (1) “free-to-paid conversion will be 3 to 5 percent, 90 percent sure”; (2) “we hit 1,000 activated free accounts in 8 to 11 weeks, 90 percent sure”; (3) “incremental support load is at most 1.5 extra tickets per 100 free users, 80 percent sure.”
  • What rides on it: These three numbers are inputs to the free-tier business case. The conversion interval feeds an expected-value model that the board will see; the activation-time interval sets the launch milestone; the support-load interval sizes the support hire. If any interval is narrower than Dana’s real knowledge warrants, the business case looks more certain than it is.
  • Genuine uncertainty confirmed: Yes - none of the three is lookupable (the free tier has not launched), and the worry is that Dana’s stated 90s and 80 are too tight to trust (overprecision), not that her central numbers are in the wrong place. (If the worry were “3 to 5 percent is anchored on the wrong comparable,” that would route to think-reference-class-forecasting, not here.)
  • Human judge: Dana, Northwind’s PM who owns the free-tier business case. The agent plays the encoding analyst and the scorer; it never substitutes its own confidence for Dana’s.

Equivalent-bet test (the width adjustment)

Section titled “Equivalent-bet test (the width adjustment)”

For each interval the agent offered Dana the choice: bet that the true value lands inside her stated interval, versus a reference lottery paying at exactly her stated confidence (a wheel with a winning region the size of the nominal confidence). Wheel-preferred means her felt confidence is below the stated number (overconfident, widen). Interval-preferred means above (narrow). Iterated to genuine indifference.

#ClaimStated intervalNominal conf.Bet verdictDirectionAdjusted interval (at indifference)
1Free-to-paid conversion3 - 5%90%wheel-preferred (Dana would rather take the 90% wheel than bet on 3-5%)widen2 - 7%
2Weeks to 1,000 activated accounts8 - 11 wks90%wheel-preferred, strongly (Dana admits she would “obviously rather spin the wheel”)widen7 - 16 wks
3Incremental support load (tickets / 100 users)up to 1.580%interval-preferred (Dana would rather bet on her interval than spin the 80% wheel)narrowup to 1.2
  • Bet verdict legend: wheel-preferred = overconfident (widen the interval); interval-preferred = underconfident (narrow it); indifferent = held at its stated confidence.
  • Reading it: Two of Dana’s three intervals were too narrow - classic overprecision - and badly so on the activation timeline, where 8 to 11 weeks ignored the fat right tail of a launch that under-delivers. The support-load interval was, unusually, a touch too wide (Dana had padded it defensively), so the test narrowed it. The width moved in both directions; the central numbers did not move at all.

Hit-rate scoring (the track record, where outcomes resolve)

Section titled “Hit-rate scoring (the track record, where outcomes resolve)”

Northwind keeps a decision journal, so Dana has a track record of past 90 percent intervals on resolved launches and forecasts. The agent scored them.

Nominal confidence bandItems in bandItems where truth fell insideActual hit rateDiagnosis
90%12758%overprecise - 90s historically land near 58 percent
80%9667%overprecise - 80s land near 67 percent
50%8450%well-calibrated at the wide end
  • Overall diagnosis: Dana is systematically overprecise on her high-confidence intervals (her 90s behave like 58s, her 80s like 67s) and well-calibrated only when she lets an interval get genuinely wide. This is the most common calibration signature and it confirms the bet-test reading on claims 1 and 2: her tight high-confidence bands need widening.
  • Feedback to carry into the next round: “When you say 90 percent, your history says treat it as roughly 60 percent - widen the band by enough to feel slightly uncomfortable before you call it 90.”

Corrected intervals (the output to use downstream)

Section titled “Corrected intervals (the output to use downstream)”
ClaimOriginal interval / conf.Corrected interval / conf.Note
Free-to-paid conversion3 - 5% @ 90%2 - 7% @ 90%width only; the 3-5% center of mass is unchanged, the band now honestly reflects a real 90%
Weeks to 1,000 activated accounts8 - 11 wks @ 90%7 - 16 wks @ 90%width only; the long right tail of a slow launch is now inside the band
Incremental support loadup to 1.2 @ 80% (was up to 1.5)up to 1.2 @ 80%narrowed; Dana’s defensive padding removed

Dana’s free-tier business case rested on two confidence numbers that were tighter than her own knowledge - and her own track record - could support: the conversion band (3 to 5 percent) and especially the activation timeline (8 to 11 weeks), both stated at 90 percent but behaving like 60 percent. The equivalent-bet test and her 58-percent historical hit rate on 90s both pointed the same way, so the bands were widened to 2 to 7 percent and 7 to 16 weeks; the support-load band was nudged tighter where she had over-hedged. Crucially, none of the central estimates moved - this was a width correction, not a re-estimate. The corrected intervals make the expected-value model and the launch milestone honest about how much is genuinely unknown, which is the point: the board now sees a business case that does not pretend to a precision Northwind does not have. The residue is real - calibration training corrects partially, not completely, so Dana’s next round of intervals will still skew a little tight - and these numbers are now closer to honest, not certified accurate.


Note how this differs from its neighbors on the same Northwind decision. The think-reference-class-forecasting example would attack the LOCATION of the conversion estimate (is 3-5% the right base rate for comparable free tiers?). The think-fermi-estimation example would BUILD the activation number bottom-up from traffic, signup, and activation factors. This scorecard does neither: it leaves Dana’s central numbers where they are and asks only whether the stated confidence around them is worth its face value, resizing the width and scoring the track record. The deliverable is honesty about uncertainty (and a corrected set of intervals to feed downstream), not a new estimate.

What the research does and does not show, with graded sources

Evidence Dossier: Interval Calibration Check

Section titled “Evidence Dossier: Interval Calibration Check”

The single source of truth for the interval-calibration-check skill. The SKILL.md, the sidecar (skill.meta.yml), and the eval cases all derive from this file. If a claim is not here, it does not belong in the skill. Reformatted from the vetted proposal dossier (_local/proposed-builds/interval-calibration-check/dossier.md) and admitted as a Build at tier P (overturning the candidate’s stale M, per the v0.7.0 phase-2 conservative-split rule).

Skillthinking-framework-skills.interval-calibration-check (installable name think-interval-calibration-check)
Familymeta-thinking-and-reflection
Evidence tierP governing (preliminary M OVERTURNED to P - see “What the evidence shows”)
ConfidenceModerate that interval overprecision is real and that scored feedback partially corrects it; low that the equivalent-bet device specifically improves outcomes, and the whole protocol is transferred from human studies, not agent-validated
Statusdraft (admitted from the v0.7.0 phase-2 tranche; tier corrected M -> P on the conservative-split rule)

1. The mechanism (what actually does the work)

Section titled “1. The mechanism (what actually does the work)”

People state uncertainty as intervals - “two to four weeks, 90 percent sure” - and those intervals are reliably too narrow. Overprecision is the most robust form of overconfidence: in controlled experiments, stated 90 percent intervals contain the true value far less than 90 percent of the time, and subjective intervals are sometimes only a fraction as wide as the judge’s own information would warrant. A stated “90” that historically hits 50 is not a confidence level, it is a habit of speech, and everything downstream that takes the number literally - an expected-value calculation, a risk model, a commitment - inherits the error.

The method is a correction protocol with two coupled moves, both operating on the WIDTH of stated uncertainty and never on its location:

  1. The equivalent-bet test at elicitation time. Offer the judge a choice between betting that the truth falls inside their stated interval and a reference lottery that pays at exactly the nominal probability (the classic device is a wheel with a 90 percent winning region). A preference for the wheel reveals that felt confidence is below the stated 90; a preference for the interval reveals it is above. Adjust the interval and repeat until the judge is genuinely indifferent. The device converts an abstract “how sure are you?” into a concrete preference between two gambles, which is the part of the encoding-interview doctrine designed to defeat anchoring on the first number and the social pressure to sound precise.
  2. Scored feedback wherever outcomes resolve. Collect the judge’s intervals on items whose answers arrive - a battery of questions with known answers, or the judge’s own resolvable predictions - score the hit rate against the nominal confidence, diagnose over- or underprecision, and feed the score back before the next round. Feedback is among the few interventions with controlled evidence of improving calibration.

The durable cognitive move is to interrogate the WIDTH of a stated uncertainty - does your 90 mean 90? - by bet-equivalence at elicitation time and hit-rate scoring against resolved outcomes, emitting corrected intervals and a calibration scorecard. It leaves the location of the estimate alone; it resizes the stated uncertainty and audits whether the confidence number means what it claims.

The output is a calibration scorecard: each stated interval with its nominal confidence, the equivalent-bet verdict, the adjusted interval, and - wherever a track record exists - the actual hit rate against the nominal confidence with an explicit over- or underprecision diagnosis.

The equivalent-bet idea descends from the betting interpretation of subjective probability (Frank Ramsey’s “Truth and Probability,” 1926; Bruno de Finetti): a degree of belief IS a betting rate, so an interval whose bet you would not take is not actually held at its stated confidence. The working protocol comes from the Stanford / SRI decision-analysis school - Ronald Howard’s decision-analysis program and the SRI Decision Analysis Group - codified by Carl Spetzler and Carl-Axel Stael von Holstein (1975) as probability encoding: the structured interview, the probability wheel, and indifference-seeking reference bets. The calibration-research line runs through Sarah Lichtenstein, Baruch Fischhoff and Lawrence Phillips (Decision Research, Oregon) and Marc Alpert with Howard Raiffa (Harvard), through the elicitation-format work of Joshua Klayman and Jack Soll and the overprecision program of Don Moore, to the Good Judgment Project training results (Barbara Mellers, Philip Tetlock, Welton Chang). Douglas Hubbard (Hubbard Decision Research) is the popularizer who renamed the device the “equivalent bet test” and built commercial calibration training around it (How to Measure Anything, 2007). The LLM-era line (Xiong et al., 2024) measures verbalized model confidence, finds it overconfident, and explains why this method is administered BY the agent TO a human rather than to the model itself.

The terms are generic and descriptive; the durable move is named for what it does (interrogate the width of a stated uncertainty). The skill ships descriptively as think-interval-calibration-check, with the equivalent-bet device and the calibration-training format attributed honestly to their lineage and to Douglas Hubbard as the popularizer, and Hubbard’s “equivalent bet test” carried as an attributed term, not a brand claim.

3. What the evidence shows, and what it does NOT show

Section titled “3. What the evidence shows, and what it does NOT show”

The honest grade is P (practitioner), overturning the preliminary registry grade of M. The split, stated plainly: the underlying phenomenon (interval overprecision) is established at strong-research level, and the trainability of calibration via feedback has real controlled support - but the M-flavored training studies measured sibling formats and bundled curricula rather than this protocol, the interval-specific training record is thin and partial, and the device the method is named for, the equivalent bet, has no controlled outcome evidence at all - only interview doctrine and vendor data. By this library’s conservative rule (a split read or transferred evidence takes the lower grade; a cousin’s robustness cannot be laundered onto the actual move), the governing grade is P.

What the record supports (the phenomenon, strongly). Interval overprecision is among the most replicated findings in judgment research. Alpert and Raiffa (1982, orig. 1969) found 98 percent intervals covered roughly 60 percent of true values. Soll and Klayman (2004) found 90 percent intervals contained the truth less than 45 percent of the time, with subjective intervals sometimes only about 40 percent as wide as the judge’s information warranted. The review literature (Lichtenstein, Fischhoff and Phillips 1982) establishes that overconfidence is systematic, worst on hard tasks, and that feedback is among the few interventions that reliably move it. This is robust, replicated support - for the BIAS, not for this specific fix.

What the record does NOT support (the fix, at full strength). Moderate (M) would require controlled evidence that THIS protocol - bet-equivalence plus scored feedback on intervals - improves calibration. What exists instead:

  • Sibling-format training, with documented transfer limits. Lichtenstein and Fischhoff (1980) trained calibration on two-alternative half-range items with comprehensive feedback - considerable learning, almost all after the first round, but modest generalization to several related tasks and none at all to two others. The transfer limits are themselves the finding.
  • The transfer is explicitly blocked. Klayman, Soll, Gonzalez-Vallejo and Barlas (1999) show interval-production tasks exhibit far larger overconfidence than two-choice confidence tasks, and calibration is not a unitary trait across formats. This is the study that blocks transferring the two-alternative training evidence onto interval calibration at full strength - the decisive reason the trainability line cannot carry an M for this method.
  • Different devices, not the equivalent bet. Soll and Klayman (2004) show decomposed (fractile) elicitation reduces but does not eliminate overprecision; Haran, Moore and Morewedge (2010) show full-range probability assignment improves interval calibration. Both are controlled support for elicitation-time intervention on interval width - by a different device than the equivalent bet.
  • One favorable-conditions interval-feedback study. Bolger and Onkal-Atay (2004) found performance feedback improved judgmental interval forecasts under favorable conditions, by forecasters learning the noise level of the series rather than defensively widening everything. Direct interval-format feedback evidence, but a single study, task-specific.
  • A bundled forecasting curriculum. Chang, Chen, Mellers and Tetlock (2016) found a sub-one-hour debiasing module improved Brier-score accuracy by 6 to 11 percent - one of the most rigorous debiasing trials on record - but the module bundles base rates, comparison classes, and bias awareness; it is not an interval-bet calibration protocol. Adjacent support for “calibration-style training works,” not for this move.
  • The equivalent bet itself has zero controlled outcome evidence. Only interview doctrine (Spetzler and Stael von Holstein 1975) and vendor data (Hubbard). The device the method is named for is the least-evidenced part of the protocol.

The stubborn residue. Even where training works it is partial. Alpert and Raiffa’s feedback and exhortation reduced but came nowhere near eliminating interval overconfidence. Lichtenstein and Fischhoff’s trained gains generalized only modestly. The honest promise is tighter honesty about uncertainty, not calibrated certainty - partial debiasing is the documented outcome.

Why not C, why not M. Not C: the phenomenon is among the most replicated in judgment research, several nearby controlled interventions work, and the doctrine has fifty years of professional decision-analysis use; it is well past conceptually-plausible. Not M: grading M would launder cousins’ robustness (two-alternative training, fractile decomposition, full-range assignment, a bundled curriculum) onto the actual move, which is exactly what the conservative rule forbids. P is the honest governing grade.

4. Transferred-evidence flag (required honesty for this library)

Section titled “4. Transferred-evidence flag (required honesty for this library)”

Every study above is on human subjects. None validates the protocol run by or on an AI agent. The evidence is transferred from human contexts and not validated for AI-augmented use, which independently caps the grade at P. There is a hard agent-side wall specific to this method: an LLM posing an equivalent bet to itself has no felt indifference to reveal, and verbalized model confidence is itself systematically overconfident (Xiong et al. 2024). The skill calibrates a HUMAN’s stated intervals through elicitation; it is not a self-calibration device for the model. The agent’s value is mechanical and modest - it administers both halves conversationally (playing the encoding analyst who poses the bet and probes for indifference, and the scorer who tracks hits against nominal confidence), forces the discipline, and emits a durable, inspectable scorecard. The skill ships honestly as a P-tier uncertainty-honesty aid with hard walls, never as a measured-gain calibration guarantee.

5. When it works / when it fails (drives the eval negative cases and “When NOT to Use”)

Section titled “5. When it works / when it fails (drives the eval negative cases and “When NOT to Use”)”

Works best when:

  • A consequential plan, forecast, or commitment rests on a stated interval or confidence number that has never been audited - the “90 percent sure we ship in Q3” plan, the cost range in a proposal, the confidence column in a decision journal or assumption ledger.
  • The same person or team makes repeated resolvable estimates, so the scored-feedback half has material to work with and the scorecard becomes a track record.
  • A method that consumes probability numbers at face value (an expected-value decision tree, a risk model) sits immediately downstream - calibrate the inputs before the arithmetic launders them.

Fails or misleads when (poor-fit / anti-patterns):

  • On the agent’s own confidence. An LLM posing an equivalent bet to itself has no felt indifference to reveal; the test would be the same self-report in different words, and verbalized model confidence is itself overconfident (Xiong et al. 2024). This calibrates a human’s stated intervals; it is not a self-calibration device for the model. This is the central wall.
  • When the problem is the location of the estimate, not the width of the uncertainty. A wrong number, well calibrated, is still wrong. Route to think-reference-class-forecasting (anchor on the base rate of comparable cases) or think-fermi-estimation (build the number from factors). Wrong number, use those; untrustworthy “sure,” use this.
  • On lookupable facts or where no genuine uncertainty exists. Calibrating an interval around a number you could simply check is theater.
  • As a one-shot ritual with no feedback prospect. Without resolvable items, only the bet-test half applies - and the bet device’s specific contribution is the least-evidenced part of the protocol. Say plainly that the scorecard is one-legged in this case.
  • While expecting full debiasing. The controlled record shows partial correction with a stubborn residue. Promise tighter honesty about uncertainty, not calibrated certainty.

The single durable move it adds: test whether a stated confidence number is worth its face value and resize the interval accordingly. The catalog audit behind the claim - the library produces estimates (fermi-estimation), relocates them to base rates (reference-class-forecasting), records confidence for later review (decision-journal), re-scores beliefs against new evidence (belief-update-routine), and replaces recurring judgments with a formula (linear-model-aggregation). Nothing tests whether a stated confidence number is worth its face value, and nothing adjusts an interval’s width. That corner of meta-thinking-and-reflection - the honesty of the confidence-stating process itself - is open.

Burden of proof against the closest shipped skills:

  • decision-journal (HIGH adjacency; the fold candidate). Both traffic in stated confidence attached to predictions, and both aim at calibration over time. The wall: the journal’s load-bearing move is contemporaneous CAPTURE - its own dossier states it supplies the recorded-prediction half of the calibration loop, and it never adjusts the confidence it records. This method’s entire mechanism IS the adjustment - bet-test, widen, score, correct. Folding it in would bolt an elicitation-correction protocol onto a capture skill, misrepresenting both. Honest mechanism overlap is in the 15 to 20 percent band (shared inputs and goal; disjoint operations). The two compose cleanly: calibrate the interval, then journal it.
  • belief-update-routine (MEDIUM). It re-scores belief confidence against NEW EVIDENCE on a cadence - content-driven updating of what you believe. This method is outcome-driven calibration of how you STATE confidence, and it is content-blind: a belief-update ledger never asks whether the judge’s 80s historically hit 80. (Note: belief-update-routine is a SHIPPED skill, correcting an error in the preliminary reasoning that called it “folded.”)
  • fermi-estimation (LOW to MEDIUM). It emits a low/high range, but builds it bottom-up from decomposed factors; it never takes a stated interval and tests its width against a bet or a track record.
  • reference-class-forecasting (LOW to MEDIUM). It fights optimism by RELOCATING the estimate to the outside-view base rate; this method leaves the location alone and resizes the stated uncertainty.
  • linear-model-aggregation (LOW). It replaces holistic judgment with a fixed formula for recurring predictions; this keeps the human judgment and calibrates its confidence.

Cluster walls (the estimation/calibration batch):

  • vs dialectical-bootstrapping. That method improves the ACCURACY of a point estimate by generating a contrarian second self-estimate and averaging the two; it never touches the stated confidence. Different failure attacked, different artifact; they compose without sharing mechanism.
  • vs consider-the-unknowns. That is a CONTENT move - enumerate the unobservable variables before judging. This method is content-blind: it never asks what is missing, only whether the stated number means what it claims.
  • vs estimate-talk-estimate (rejected). That is a multi-human group protocol whose value is anonymous social aggregation, which an agent cannot reproduce. This method does NOT face that wall: probability encoding was designed as an analyst-with-one-judge interview, and the agent is the analyst in the chat. Single-judge by construction.
  • Recipe reading rejected: no chain of shipped skills produces a widened interval or a hit-rate scorecard, because no shipped skill contains either operation. There is nothing to chain.

The honest runner-up, recorded: fold-with-enrichment into decision-journal. Defensible on parsimony, but it buries the elicitation protocol and the scorecard inside a capture skill whose mechanism it does not share, and would leave the journal claiming a correction move its own dossier disclaims. The Build verdict stands.

The skill must emit a calibration scorecard, not prose: the focal claim and why its uncertainty matters; each stated interval with its nominal confidence; the equivalent-bet verdict (wheel-preferred means overconfident and widen, interval-preferred means underconfident and narrow, indifferent means held); the adjusted interval after iterating to indifference; and - wherever a track record of resolved items exists - the actual hit rate against the nominal confidence with an explicit over- or underprecision diagnosis. The artifact carries a pre-printed evidence caveat (governing tier P, transferred from human studies, partial debiasing only, human-stated intervals only) by construction, so the honesty ships with the scorecard. When no resolvable items exist, the scorecard is explicitly marked one-legged (bet-test only).

  1. Carl S. Spetzler and Carl-Axel S. Stael von Holstein, “Probability Encoding in Decision Analysis,” Management Science 22(3):340-358 (1975). The foundational SRI interview doctrine - structured encoding with a trained interviewer, the probability wheel, indifference-seeking reference bets, with explicit bias-reduction rationale. Defines the device; measures nothing comparative. (P, doctrine.)
  2. Marc Alpert and Howard Raiffa, “A progress report on the training of probability assessors,” in Kahneman, Slovic and Tversky (eds.), Judgment under Uncertainty: Heuristics and Biases, pp. 294-305 (1982; orig. 1969). The classic interval-overprecision demonstration - 98 percent intervals covered roughly 60 percent of true values; feedback plus exhortation reduced but did not eliminate the miss rate. The single most on-point training result, and honest about its own partial failure. (Phenomenon strong; training effect modest.)
  3. Sarah Lichtenstein and Baruch Fischhoff, “Training for calibration,” Organizational Behavior and Human Performance 26:149-171 (1980). Eleven sessions of two-alternative half-range items with comprehensive feedback - considerable learning, mostly after the first round; modest generalization to several related tasks, none to two others. Controlled and real, but a sibling FORMAT, and the transfer limits are the finding.
  4. Sarah Lichtenstein, Baruch Fischhoff and Lawrence D. Phillips, “Calibration of probabilities: The state of the art to 1980,” in Judgment under Uncertainty, pp. 306-334 (1982). The canonical review - overconfidence is systematic, worst on hard tasks, feedback is among the few interventions that reliably move it. (M for the phenomenon and the feedback direction.)
  5. Joshua Klayman, Jack B. Soll, Claudia Gonzalez-Vallejo and Sema Barlas, “Overconfidence: It depends on how, what, and whom you ask,” Organizational Behavior and Human Decision Processes 79(3):216-247 (1999). Interval-production tasks show far larger overconfidence than two-choice tasks, and calibration is not unitary across formats. The study that BLOCKS transferring two-alternative training evidence to interval calibration at full strength.
  6. Jack B. Soll and Joshua Klayman, “Overconfidence in interval estimates,” Journal of Experimental Psychology: Learning, Memory, and Cognition 30(2):299-314 (2004). 90 percent intervals contained the truth less than 45 percent of the time; subjective intervals systematically too narrow, sometimes only about 40 percent as wide as needed; decomposed elicitation reduces but does not eliminate overprecision. Strong controlled phenomenon evidence; the tested remedy is fractile decomposition, not the equivalent bet.
  7. Uriel Haran, Don A. Moore and Carey K. Morewedge, “A simple remedy for overprecision in judgment,” Judgment and Decision Making 5(7):467-476 (2010). Full-range probability assignment improves interval calibration versus standard 90 percent interval elicitation, with carryover. Controlled support for elicitation-time intervention on interval width, by a different device.
  8. Fergus Bolger and Dilek Onkal-Atay, “The effects of feedback on judgmental interval predictions,” International Journal of Forecasting 20(1):29-39 (2004). Performance feedback improved judgmental interval forecasts under favorable conditions, via learning the series noise level rather than defensive widening. Direct interval-format feedback evidence; one study, task-specific.
  9. Welton Chang, Eva Chen, Barbara Mellers and Philip Tetlock, “Developing expert political judgment: The impact of training and practice on judgmental accuracy in geopolitical forecasting tournaments,” Judgment and Decision Making 11(5):509-526 (2016) (with Mellers et al. 2014). A sub-one-hour debiasing module improved Brier-score accuracy by 6 to 11 percent over control - one of the most rigorous debiasing trials on record - but the module bundles base rates, comparison classes, and bias awareness; not an interval-bet protocol. Adjacent support, not for this move.
  10. Douglas W. Hubbard, How to Measure Anything: Finding the Value of Intangibles in Business (Wiley, 2007 and later editions); Hubbard Decision Research calibration training. The popularizer - the “equivalent bet test” name, the half-day training format, the widely quoted post-training hit rates. Those headline rates are vendor data without independent controlled publication and are EXCLUDED from the grade. (V.)
  11. Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He and Bryan Hooi, “Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs,” ICLR 2024 (arXiv:2306.13063). Verbalized LLM confidence is systematically overconfident, plausibly imitating human confident speech. Not evidence for the method; named because it defines the agent-side wall - the model cannot administer the bet to itself and trust the answer.

Excluded on the evidence rule: Hubbard Decision Research’s post-training calibration rates (vendor data, no independent primary source); any specific “calibrated forecasters reach N percent hit rates after training” figure - no controlled primary source exists for the equivalent-bet protocol specifically. The governing grade is set by the conservative split: strong phenomenon, partial and sibling-format fix evidence, an undertested signature device - P, not M.

Was this page helpful?
Thinking Framework Skills v0.8.0 · 56 frameworks