Reference Class Forecasting

Picture it

Every estimate can be made two ways. The inside view runs on your plan and your reasons, and tends to run late. The outside view asks how comparable past efforts actually went.

graph TD
  P["Estimate this project"] --> I["Inside view:<br/>our plan, our reasons<br/>-> optimistic number"]
  P --> O["Outside view:<br/>how did 20 similar<br/>past projects actually go?<br/>-> base-rate number"]
  I --> A["Anchor on the outside view,<br/>then adjust for what is<br/>genuinely different"]
  O --> A
  classDef inside fill:#fde7e7,stroke:#dc2626,color:#7f1d1d
  classDef outside fill:#e3f5e8,stroke:#16a34a,color:#14532d
  classDef ans fill:#e6e9ff,stroke:#6366f1,color:#1e1b4b,font-weight:bold
  class I inside
  class O outside
  class A ans

The correction is to start from the track record of the reference class, not from the inside-view story, then adjust.

People forecast from the inside view, building an estimate from their own plan’s details, which invites optimism and the planning fallacy. This skill replaces that with the outside view: find a reference class of similar past cases, take the base-rate distribution of how they actually turned out, and anchor the forecast on that, adjusting only cautiously for genuine specifics. The output is a reference-class estimate. The honest constraint: it requires real base-rate data; inventing a distribution is worse than admitting uncertainty.

When to Use

Forecasting cost, time, or odds of success for something with comparable precedents.
The inside-view estimate is likely optimistic.
High-stakes commitments prone to the planning fallacy.

When NOT to Use

Genuinely novel undertakings with no comparable reference class.
When you have no real base-rate data and would have to invent it.
When the specifics genuinely dominate and no class is comparable.
When a point certainty is expected (this produces a distribution).

Instructions

When asked to forecast with the outside view, follow these steps:

State what is being forecast (cost, duration, success odds) and the inside-view estimate if one exists.
Define the reference class. Identify the set of genuinely comparable past cases, and say why they are comparable. Resist a class that is too narrow or too flattering.
Get the base rates. State the distribution of outcomes for that class - typical and worst-case - with the data source. If real data is unavailable, say so explicitly and stop or downgrade to a clearly-flagged rough estimate; do not fabricate numbers.
Anchor on the outside view. Set the estimate from the base-rate distribution, not the plan’s details.
Adjust conservatively. Only then adjust for genuine specifics, in small amounts, and resist sliding back to the optimistic inside view.
Emit the reference-class estimate per references/TEMPLATE.md.

Output Format

Use the template in references/TEMPLATE.md. The deliverable is the reference class, base rates, and the outside-anchored estimate as a range, not a single optimistic number.

Quality Checklist

Before finalizing, verify:

The reference class is genuinely comparable, not narrow or flattering.
Base rates come with a data source, or missing data is flagged (not invented).
The estimate is anchored on the distribution, then adjusted only conservatively.
The result is a range/distribution, not a point certainty.
The inside-view estimate did not sneak back in as the answer.
The output is the reference-class estimate artifact, not prose.

Evidence

Tier S. The planning fallacy is robustly demonstrated and the outside view measurably reduces forecast error (Kahneman & Lovallo 1993; Kahneman 2011). Bent Flyvbjerg’s reference class forecasting for infrastructure documented systematic overruns and has been adopted by institutions (e.g. UK guidance) - real-world validation, not only lab results. The strong evidence is from human forecasting; the AI use transfers the method, with the firm constraint that base rates must be real, not invented. Full grading: evidence/dossier.md.

Examples

See references/EXAMPLE.md for a completed reference-class estimate.

Deep dive: worked example

A full worked run (the shared Northwind scenario)

Reference-Class Estimate - Worked Example

A completed run of think-reference-class-forecasting, on the shared Northwind scenario. This is the quality bar a generated estimate should meet.

Northwind is a B2B SaaS. The team estimates the free-tier build will take 6 weeks. Here the skill checks that with the outside view.

What is being forecast

Quantity: time to ship the self-serve free tier (build + billing + onboarding).
Inside-view estimate: 6 weeks (built up from the team’s task list).

Reference class

The class: Northwind’s last several “new self-serve surface” launches (billing/auth/onboarding changes of similar scope), plus comparable launches the team has data on.
Why comparable: all involved new billing states, auth edge cases, and onboarding flows under a deadline - the same risk profile, not cherry-picked easy projects.

Base rates

Data source: Northwind’s last 5 comparable launches (internal delivery records). [If those records did not exist, the honest move would be to flag “no real base-rate data” and treat the 6-week figure as an untested inside estimate, not to invent a multiplier.]
Typical outcome: comparable launches ran ~1.5x the initial estimate (so ~9 weeks for a “6-week” plan).
Worst-case / tail: the two launches that touched billing most heavily ran ~2x (~12 weeks), usually from billing/security rework discovered late.

Outside-anchored estimate

Anchored estimate (range): 9 to 12 weeks, centered on ~1.5x.
Conservative adjustment for specifics: the team is slightly more experienced with this billing system now; nudge the center down modestly, not back to 6. Resisted the inside-view pull to “but this time it is simpler.”
Final forecast: ~8 to 11 weeks (median ~9), with the main uncertainty in the billing/auth path. Implication: a 6-week, fixed-date commitment is likely to slip or ship rough; either move the date or cut scope now.

Note: the value is refusing the 6-week inside estimate and anchoring on what comparable launches actually took (~1.5x). The honesty rule is load-bearing here: if Northwind had no real delivery records, the right output is “we lack a reference class” - not a fabricated multiplier. This pairs naturally with a premortem on the slip risk.

Grounding: the full evidence dossier

What the research does and does not show, with graded sources

Evidence Dossier: Reference Class Forecasting

Single source of truth for the reference-class-forecasting skill. The SKILL.md, sidecar, and evals derive from this. This is one of the library’s strong-evidence anchors.


Skill	`thinking-framework-skills.reference-class-forecasting` (installable name `think-reference-class-forecasting`)
Family	risk-and-resilience
Evidence tier	S (strong; empirical + real-world institutional adoption)
Confidence	High - among the best-evidenced debiasing methods
Status	draft (authored 2026-05-31 from the discovery corpus)

1. The mechanism (what actually does the work)

People forecast from the inside view: they build an estimate from the specifics of their own plan, which invites optimism and the planning fallacy (systematic underestimation of cost, time, and risk). Reference class forecasting replaces that with the outside view: find a reference class of similar past cases, get the base-rate distribution of how they actually turned out (cost overruns, schedule slips, success rates), and anchor the forecast on that distribution, adjusting only cautiously for genuine specifics. The work is done by anchoring on real outcomes of comparable cases instead of on the inside story, which is what corrects the optimism.

2. Lineage

Kahneman & Tversky introduced the inside/outside view distinction; Kahneman & Lovallo (1993) formalized the planning fallacy and the case for the outside view.
Bent Flyvbjerg developed reference class forecasting for large infrastructure projects (documenting systematic cost and schedule overruns) and it has been adopted by institutions (for example UK Treasury / transport planning guidance) - real-world uptake, not just lab results.

No trademark. Named descriptively.

3. What the evidence shows, and what it does NOT show

Strongly supported (the S): the planning fallacy is robustly demonstrated, and taking the outside view via reference classes measurably reduces forecast error; Flyvbjerg’s work and its institutional adoption are real-world validation, not only experiments. This is a genuine strong-evidence method and a credibility anchor for the library.

Boundaries (still honest): it requires a genuine reference class with real base-rate data. Where no comparable class exists (a truly novel undertaking), or where the specifics genuinely dominate, the method weakens. And it forecasts a distribution, not a certainty.

4. Transferred-evidence flag

The strong evidence is from human forecasting and large projects, not AI-augmented use. Transferred, not AI-validated. The AI value is real and pointed: a model will happily produce an inside-view, optimistic estimate from a plan’s details; forcing it to construct a reference class and anchor on base rates is a direct counter, with the honest constraint that the agent must use real base-rate data, not invented numbers.

5. When it works / when it fails

Works best when: forecasting cost, time, or odds of success for something with comparable precedents; the inside view is likely optimistic; high-stakes commitments prone to the planning fallacy.

Fails or misleads when (poor-fit / anti-patterns):

No real base-rate data - inventing a distribution is worse than admitting uncertainty (the central failure mode for an AI).
Choosing a reference class that is too narrow, too flattering, or not actually comparable.
Over-adjusting back toward the optimistic inside view (“but we are different”).
Genuinely novel undertakings with no comparable class.
Treating the outside estimate as a point certainty rather than a distribution.

6. Output artifact

A reference-class estimate: the reference class defined (and why it is comparable), the base-rate distribution (typical and worst-case outcomes, with the data source or an explicit flag that data is missing), the original inside-view estimate, the outside-anchored estimate, and the adjustment rationale (kept conservative).

7. Sources

Kahneman, D., & Lovallo, D. (1993) - timid choices, bold forecasts; the planning fallacy and the outside view.
Flyvbjerg, B. - reference class forecasting for infrastructure; documented cost/schedule overruns; institutional adoption (e.g. UK guidance).
Kahneman, D. (2011), Thinking, Fast and Slow - inside vs outside view popularization.

Verification status: the planning-fallacy and Flyvbjerg findings, and the institutional adoption, are well-attested; the “S” grade is justified. Keep the “use real base rates, not invented ones” constraint front and center for AI use.

Thinking Framework Skills v0.3.0 · 38 frameworks