Natural-Frequency Bayesian Framing
Picture it
Section titled “Picture it”A conditional probability that feels impossible as a percentage becomes obvious once you count people. Imagine 1,000 people, a condition that affects 1%, and a test that is roughly 90% accurate:
graph TD A["1,000 people"] --> B["10 have it<br/>(1% base rate)"] A --> C["990 do not"] B --> D["9 test positive<br/>true positives"] C --> F["about 89 test positive<br/>false positives"] D --> H["About 98 positive tests,<br/>only 9 are real:<br/>roughly 9% truly have it"] F --> H classDef has fill:#fde7e7,stroke:#dc2626,color:#7f1d1d classDef hasnt fill:#e3f5e8,stroke:#16a34a,color:#14532d classDef ans fill:#e6e9ff,stroke:#6366f1,color:#1e1b4b,font-weight:bold class B,D has class C,F hasnt class H ans
Illustrative numbers. The point is the move: stated as natural frequencies (9 of 98), the base-rate trap that “a positive test means I probably have it” is visibly wrong.
People - including experts - reason badly about conditional probabilities stated as percentages, because they neglect the base rate. Re-expressing the same facts as natural frequencies over a concrete population makes the correct answer nearly visible: “Out of 1,000, 10 have it; 9 of those test positive; of the 990 without it, ~89 also test positive; so of ~98 positives, only 9 truly have it - about 9%.” The format does the work by keeping the base rate in the counts. The output is a natural-frequency breakdown. Honest constraint: the base rate and hit rates must be real - the format makes correct reasoning tractable, it does not invent the inputs.
When to Use
Section titled “When to Use”- Interpreting a test or screening result (medical, fraud, security, lead-scoring, A/B).
- Any “given a positive signal, what is the actual probability the thing is true?” question.
- Communicating risk to others so they do not over-read a positive.
When NOT to Use
Section titled “When NOT to Use”- When you do not have real input rates and would have to invent them.
- When there is no conditional-probability structure to the question.
- For general project forecasting (use reference-class forecasting).
- When a single point estimate is wanted and the base-rate structure is irrelevant.
Instructions
Section titled “Instructions”When asked to reason about a conditional probability, follow these steps:
- State the question precisely. What posterior is being asked - usually P(condition | positive signal). Distinguish it from P(positive | condition), which people confuse it with.
- Gather the real inputs. The base rate, the true-positive (hit) rate, and the false-positive rate. If any is unknown, say so and stop or clearly flag the estimate as illustrative - do not fabricate numbers.
- Build a frequency tree over a concrete population. Pick a round number (e.g., 1,000). Work out: how many have the condition; of those, how many test positive; of those without, how many also test positive.
- Compute the posterior as true positives / all positives, and state it plainly.
- Name the wrong intuition it corrects. State the answer most people give (usually near the hit rate) and why it is wrong (base-rate neglect).
- Emit the natural-frequency breakdown per
references/TEMPLATE.md.
Output Format
Section titled “Output Format”Use the template in references/TEMPLATE.md. The deliverable is the frequency tree, the posterior, and the plain-language meaning, not a bare percentage.
Quality Checklist
Section titled “Quality Checklist”Before finalizing, verify:
- The question distinguishes P(condition | positive) from P(positive | condition).
- The base rate, true-positive rate, and false-positive rate are real (or missing data is flagged, not invented).
- A frequency tree over a concrete population is shown.
- The posterior is computed as true positives / all positives.
- The common wrong intuition (base-rate neglect) is named.
- The output is the breakdown artifact, not a bare number.
Evidence
Section titled “Evidence”Tier S. Presenting conditional-probability information as natural frequencies substantially improves Bayesian-inference accuracy - accuracy on these problems rises from roughly 10% to 50-90% with the same facts in frequency format (Gigerenzer & Hoffrage 1995; Sedlmeier & Gigerenzer 2001), replicated across populations including physicians. The format does not supply the inputs; real rates are required. Evidence is from human reasoners, transferred to AI use, not AI-validated. Full grading: evidence/dossier.md.
Examples
Section titled “Examples”See references/EXAMPLE.md for a completed breakdown.
Deep dive: worked example
Section titled “Deep dive: worked example”A full worked run (the shared Northwind scenario)
Natural-Frequency Breakdown - Worked Example
Section titled “Natural-Frequency Breakdown - Worked Example”A completed run of think-natural-frequency-bayesian, on the shared Northwind scenario. This is the quality bar a generated breakdown should meet.
Northwind is a B2B SaaS. Sales treats every account its new model flags as “high-intent” as if it almost certainly is. This skill checks what a flag actually means.
Question
Section titled “Question”- Posterior asked: P(truly high-intent | flagged “high-intent”) = ?
- (This is NOT the model’s 80% sensitivity, which is P(flagged | truly high-intent). Sales is confusing the two.)
Inputs (real, with source)
Section titled “Inputs (real, with source)”- Base rate P(high-intent): 5% - source: historical share of accounts that became opportunities.
- True-positive (hit) rate P(flagged | high-intent): 80% - source: model validation set.
- False-positive rate P(flagged | not high-intent): 10% - source: model validation set.
Frequency tree (per 1,000 accounts)
Section titled “Frequency tree (per 1,000 accounts)”- Of 1,000: 50 are truly high-intent; 950 are not.
- Of the 50 high-intent: 40 are flagged (80%).
- Of the 950 not high-intent: ~95 are also flagged (10%).
- Total flagged: 40 + 95 = 135.
Posterior
Section titled “Posterior”- P(high-intent | flagged) = 40 / 135 = ~30%.
What it means / the wrong intuition it corrects
Section titled “What it means / the wrong intuition it corrects”- Plain language: when the model flags an account, it is truly high-intent only about 30% of the time - so roughly 2 of every 3 flagged accounts are not.
- Common wrong answer: ~80% (people read the flag as the sensitivity). Wrong because it ignores that high-intent accounts are rare (5% base rate), so the many false positives from the large low-intent pool swamp the true positives.
Note: the value is converting “the model is 80% accurate” into “a flag is right ~30% of the time,” which completely changes how Sales should treat flags (triage, not trust). The honest constraint held: the three input rates came from real validation data, not invented numbers - without them the right output would have been “we cannot compute this yet.”
Grounding: the full evidence dossier
Section titled “Grounding: the full evidence dossier”What the research does and does not show, with graded sources
Evidence Dossier: Natural-Frequency Bayesian Framing
Section titled “Evidence Dossier: Natural-Frequency Bayesian Framing”Single source of truth for the
natural-frequency-bayesianskill. The SKILL.md, sidecar, and evals derive from this. A strong-evidence anchor.
| Skill | thinking-framework-skills.natural-frequency-bayesian (installable name think-natural-frequency-bayesian) |
| Family | reasoning-clarity |
| Evidence tier | S (well-replicated) |
| Confidence | High - the format effect is one of the most robust findings in judgment research |
| Status | draft (authored 2026-05-31 from the discovery corpus) |
1. The mechanism (what actually does the work)
Section titled “1. The mechanism (what actually does the work)”People - including experts - reason badly about conditional probabilities when they are stated as percentages or probabilities. Given “the test is 90% sensitive, the condition affects 1%, the false-positive rate is 9%,” most people (including doctors) wildly overestimate the chance that a positive result means the condition is present, because they neglect the base rate.
Re-expressing the identical information as natural frequencies over a concrete population makes the correct answer almost visible: “Out of 1,000 people, 10 have the condition; of those, 9 test positive. Of the 990 without it, about 89 also test positive. So of ~98 positives, only 9 truly have it - about 9%.” The format does the work: it preserves the base rate in the counts instead of hiding it in a rate. Accuracy on these problems jumps from roughly 10% to 50-90% when the same facts are presented as natural frequencies.
2. Lineage
Section titled “2. Lineage”- Gigerenzer & Hoffrage (1995) on how natural-frequency formats improve Bayesian reasoning; Sedlmeier & Gigerenzer (2001) on teaching it; widely applied in medical decision-making and risk communication.
No trademark. Named descriptively.
3. What the evidence shows, and what it does NOT show
Section titled “3. What the evidence shows, and what it does NOT show”Strongly supported (the S): presenting conditional-probability information as natural frequencies substantially improves the accuracy of Bayesian inference (the ~10% to ~50-90% jump is well-replicated across studies and populations, including physicians).
What it does NOT do: it does not invent the inputs. The base rate, the true-positive rate, and the false-positive rate must be real; the format makes correct reasoning from those numbers tractable, it does not supply them. And it applies only where there is genuine conditional-probability structure - it is not a general forecasting tool (that is reference-class forecasting).
4. Transferred-evidence flag
Section titled “4. Transferred-evidence flag”The evidence is from human reasoners. Transferred to AI use; the model can do the arithmetic, but the value is the same as for humans plus communication: it forces the base rate to be used (countering base-rate neglect in the model’s own answers and in how it explains risk), and it produces an inspectable frequency breakdown rather than a bare percentage. It still must refuse to fabricate the input rates.
5. When it works / when it fails
Section titled “5. When it works / when it fails”Works best when: interpreting a test or screening result (medical, fraud, security, lead-scoring, A/B); any “given a positive signal, what is the real probability” question; communicating risk to others.
Fails or misleads when (poor-fit / anti-patterns):
- No real input rates - inventing the base rate or hit rate is worse than admitting they are unknown (the central failure for an AI).
- Ignoring the base rate (base-rate neglect) - the very error the method exists to fix.
- Confusing P(positive | condition) with P(condition | positive) - state which is which.
- No conditional-probability structure (then this is the wrong tool).
- General project forecasting (use reference-class forecasting).
6. Output artifact
Section titled “6. Output artifact”A natural-frequency breakdown: the question; the inputs (base rate, true-positive rate, false-positive rate) with sources or an explicit missing-data flag; a frequency tree over a concrete population (e.g., 1,000); the computed posterior; and a plain-language statement of what it means plus the common wrong intuition it corrects.
7. Sources
Section titled “7. Sources”- Gigerenzer, G., & Hoffrage, U. (1995) - improving Bayesian reasoning with natural-frequency formats.
- Sedlmeier, P., & Gigerenzer, G. (2001) - teaching Bayesian reasoning (accuracy gains).
Verification status: the natural-frequency format effect and the rough 10%->50-90% accuracy gain are well-attested; confirm exact figures against the papers before a public quantified claim. The “must use real input rates” constraint is the honest core for AI use.