Natural-Frequency Bayesian Framing

Picture it

A conditional probability that feels impossible as a percentage becomes obvious once you count people. Imagine 1,000 people, a condition that affects 1%, and a test that is roughly 90% accurate:

graph TD
  A["1,000 people"] --> B["10 have it<br/>(1% base rate)"]
  A --> C["990 do not"]
  B --> D["9 test positive<br/>true positives"]
  C --> F["about 89 test positive<br/>false positives"]
  D --> H["About 98 positive tests,<br/>only 9 are real:<br/>roughly 9% truly have it"]
  F --> H
  classDef has fill:#fde7e7,stroke:#dc2626,color:#7f1d1d
  classDef hasnt fill:#e3f5e8,stroke:#16a34a,color:#14532d
  classDef ans fill:#e6e9ff,stroke:#6366f1,color:#1e1b4b,font-weight:bold
  class B,D has
  class C,F hasnt
  class H ans

Illustrative numbers. The point is the move: stated as natural frequencies (9 of 98), the base-rate trap that “a positive test means I probably have it” is visibly wrong.

People - including experts - reason badly about conditional probabilities stated as percentages, because they neglect the base rate. Re-expressing the same facts as natural frequencies over a concrete population makes the correct answer nearly visible: “Out of 1,000, 10 have it; 9 of those test positive; of the 990 without it, ~89 also test positive; so of ~98 positives, only 9 truly have it - about 9%.” The format does the work by keeping the base rate in the counts. The output is a natural-frequency breakdown. Honest constraint: the base rate and hit rates must be real - the format makes correct reasoning tractable, it does not invent the inputs.

When to Use

Interpreting a test or screening result (medical, fraud, security, lead-scoring, A/B).
Any “given a positive signal, what is the actual probability the thing is true?” question.
Communicating risk to others so they do not over-read a positive.

When NOT to Use

When you do not have real input rates and would have to invent them.
When there is no conditional-probability structure to the question.
For general project forecasting (use reference-class forecasting).
When a single point estimate is wanted and the base-rate structure is irrelevant.

Instructions

When asked to reason about a conditional probability, follow these steps:

State the question precisely. What posterior is being asked - usually P(condition | positive signal). Distinguish it from P(positive | condition), which people confuse it with.
Gather the real inputs. The base rate, the true-positive (hit) rate, and the false-positive rate. If any is unknown, say so and stop or clearly flag the estimate as illustrative - do not fabricate numbers.
Build a frequency tree over a concrete population. Pick a round number (e.g., 1,000). Work out: how many have the condition; of those, how many test positive; of those without, how many also test positive.
Compute the posterior as true positives / all positives, and state it plainly.
Name the wrong intuition it corrects. State the answer most people give (usually near the hit rate) and why it is wrong (base-rate neglect).
Emit the natural-frequency breakdown per references/TEMPLATE.md.

Output Format

Use the template in references/TEMPLATE.md. The deliverable is the frequency tree, the posterior, and the plain-language meaning, not a bare percentage.

Quality Checklist

Before finalizing, verify:

The question distinguishes P(condition | positive) from P(positive | condition).
The base rate, true-positive rate, and false-positive rate are real (or missing data is flagged, not invented).
A frequency tree over a concrete population is shown.
The posterior is computed as true positives / all positives.
The common wrong intuition (base-rate neglect) is named.
The output is the breakdown artifact, not a bare number.

Evidence

Tier S. Presenting conditional-probability information as natural frequencies substantially improves Bayesian-inference accuracy - accuracy on these problems rises from roughly 10% to 50-90% with the same facts in frequency format (Gigerenzer & Hoffrage 1995; Sedlmeier & Gigerenzer 2001), replicated across populations including physicians. The format does not supply the inputs; real rates are required. Evidence is from human reasoners, transferred to AI use, not AI-validated. Full grading: evidence/dossier.md.

Examples

See references/EXAMPLE.md for a completed breakdown.

Deep dive: worked example

A full worked run (the shared Northwind scenario)

Natural-Frequency Breakdown - Worked Example

A completed run of think-natural-frequency-bayesian, on the shared Northwind scenario. This is the quality bar a generated breakdown should meet.

Northwind is a B2B SaaS. Sales treats every account its new model flags as “high-intent” as if it almost certainly is. This skill checks what a flag actually means.

Question

Posterior asked: P(truly high-intent | flagged “high-intent”) = ?
(This is NOT the model’s 80% sensitivity, which is P(flagged | truly high-intent). Sales is confusing the two.)

Inputs (real, with source)

Base rate P(high-intent): 5% - source: historical share of accounts that became opportunities.
True-positive (hit) rate P(flagged | high-intent): 80% - source: model validation set.
False-positive rate P(flagged | not high-intent): 10% - source: model validation set.

Frequency tree (per 1,000 accounts)

Of 1,000: 50 are truly high-intent; 950 are not.
- Of the 50 high-intent: 40 are flagged (80%).
- Of the 950 not high-intent: ~95 are also flagged (10%).
Total flagged: 40 + 95 = 135.

Posterior

P(high-intent | flagged) = 40 / 135 = ~30%.

What it means / the wrong intuition it corrects

Plain language: when the model flags an account, it is truly high-intent only about 30% of the time - so roughly 2 of every 3 flagged accounts are not.
Common wrong answer: ~80% (people read the flag as the sensitivity). Wrong because it ignores that high-intent accounts are rare (5% base rate), so the many false positives from the large low-intent pool swamp the true positives.

Note: the value is converting “the model is 80% accurate” into “a flag is right ~30% of the time,” which completely changes how Sales should treat flags (triage, not trust). The honest constraint held: the three input rates came from real validation data, not invented numbers - without them the right output would have been “we cannot compute this yet.”

Grounding: the full evidence dossier

What the research does and does not show, with graded sources

Evidence Dossier: Natural-Frequency Bayesian Framing

Single source of truth for the natural-frequency-bayesian skill. The SKILL.md, sidecar, and evals derive from this. A strong-evidence anchor.


Skill	`thinking-framework-skills.natural-frequency-bayesian` (installable name `think-natural-frequency-bayesian`)
Family	reasoning-clarity
Evidence tier	S (well-replicated)
Confidence	High - the format effect is one of the most robust findings in judgment research
Status	draft (authored 2026-05-31 from the discovery corpus)

1. The mechanism (what actually does the work)

People - including experts - reason badly about conditional probabilities when they are stated as percentages or probabilities. Given “the test is 90% sensitive, the condition affects 1%, the false-positive rate is 9%,” most people (including doctors) wildly overestimate the chance that a positive result means the condition is present, because they neglect the base rate.

Re-expressing the identical information as natural frequencies over a concrete population makes the correct answer almost visible: “Out of 1,000 people, 10 have the condition; of those, 9 test positive. Of the 990 without it, about 89 also test positive. So of ~98 positives, only 9 truly have it - about 9%.” The format does the work: it preserves the base rate in the counts instead of hiding it in a rate. Accuracy on these problems jumps from roughly 10% to 50-90% when the same facts are presented as natural frequencies.

2. Lineage

Gigerenzer & Hoffrage (1995) on how natural-frequency formats improve Bayesian reasoning; Sedlmeier & Gigerenzer (2001) on teaching it; widely applied in medical decision-making and risk communication.

No trademark. Named descriptively.

3. What the evidence shows, and what it does NOT show

Strongly supported (the S): presenting conditional-probability information as natural frequencies substantially improves the accuracy of Bayesian inference (the ~10% to ~50-90% jump is well-replicated across studies and populations, including physicians).

What it does NOT do: it does not invent the inputs. The base rate, the true-positive rate, and the false-positive rate must be real; the format makes correct reasoning from those numbers tractable, it does not supply them. And it applies only where there is genuine conditional-probability structure - it is not a general forecasting tool (that is reference-class forecasting).

4. Transferred-evidence flag

The evidence is from human reasoners. Transferred to AI use; the model can do the arithmetic, but the value is the same as for humans plus communication: it forces the base rate to be used (countering base-rate neglect in the model’s own answers and in how it explains risk), and it produces an inspectable frequency breakdown rather than a bare percentage. It still must refuse to fabricate the input rates.

5. When it works / when it fails

Works best when: interpreting a test or screening result (medical, fraud, security, lead-scoring, A/B); any “given a positive signal, what is the real probability” question; communicating risk to others.

Fails or misleads when (poor-fit / anti-patterns):

No real input rates - inventing the base rate or hit rate is worse than admitting they are unknown (the central failure for an AI).
Ignoring the base rate (base-rate neglect) - the very error the method exists to fix.
Confusing P(positive | condition) with P(condition | positive) - state which is which.
No conditional-probability structure (then this is the wrong tool).
General project forecasting (use reference-class forecasting).

6. Output artifact

A natural-frequency breakdown: the question; the inputs (base rate, true-positive rate, false-positive rate) with sources or an explicit missing-data flag; a frequency tree over a concrete population (e.g., 1,000); the computed posterior; and a plain-language statement of what it means plus the common wrong intuition it corrects.

7. Sources

Gigerenzer, G., & Hoffrage, U. (1995) - improving Bayesian reasoning with natural-frequency formats.
Sedlmeier, P., & Gigerenzer, G. (2001) - teaching Bayesian reasoning (accuracy gains).

Verification status: the natural-frequency format effect and the rough 10%->50-90% accuracy gain are well-attested; confirm exact figures against the papers before a public quantified claim. The “must use real input rates” constraint is the honest core for AI use.

Thinking Framework Skills v0.3.0 · 38 frameworks