Belief-Update Routine

A belief-update routine re-scores a standing inventory of open beliefs against newly arrived evidence, on a cadence. For each tracked belief it records a prior confidence, the evidence accrued since the last review, a revised confidence with an explicit delta and direction, a stated reason for the size of the move, and a next-review trigger. The load-bearing move is the disciplined, recurring re-score of a portfolio over time: it makes under-updating (the robust human tendency to revise too little for the evidence) visible and correctable, and it keeps beliefs that should track reality - open forecasts, strategic theses, standing assumptions - honest against new information rather than quietly stale. The output is a structured belief-update ledger, not prose, designed to be reopened and re-scored on the next cadence.

When to Use

When you hold a set of consequential, genuinely open beliefs, forecasts, or standing assumptions that should track reality over time, and you want a recurring, disciplined re-score rather than a one-off.
When genuinely new evidence has arrived since you last examined them, so there is something real to update on.
When you suspect you are under-updating - holding a belief sticky as evidence accumulates against it - and you want the size of each update made explicit and checkable.
When you want to build calibration on standing beliefs over many cycles, not judge a single decision.

When NOT to Use

To record a single decision at the moment it is made. That is think-decision-journal, which fixes one prediction in place at commit time and forbids editing it afterward. This routine is the opposite shape: it deliberately re-scores a portfolio of open beliefs repeatedly. Do not use it to capture a one-off decision (you lose the journal’s contemporaneous lock), and do not use the journal to track evolving beliefs (you violate its do-not-edit rule).
To review a finished episode against what was expected. That is think-after-action-review, which needs a resolved outcome and emits sustain/change process actions. This routine operates on beliefs that are still open and emits revised confidences, not action items.
To surface the conditions under which one contested claim would be the best choice. That is think-what-would-have-to-be-true, which decomposes a single claim into its load-bearing conditions at one sitting. This routine re-scores a portfolio of beliefs on a cadence against accrued evidence.
When no genuinely new evidence has arrived. Re-scoring on a calendar with nothing new is reflection theater: it manufactures motion or just re-anchors the prior. If nothing material has changed, the honest entry is “no new evidence; no update,” not an invented delta.
For beliefs that never resolve and carry no real stakes. Re-scoring trivia delivers no calibration and no decision value; the overhead is unrecovered.

Instructions

When asked to run a belief-update routine, follow these steps:

Assemble or load the belief inventory. List the open beliefs being tracked, each as a one-line claim with its prior confidence (a percentage or band) and the date it was last scored. If this is the first run, capture the current confidence as the prior. Keep beliefs that are genuinely open and consequential; drop trivia that will never resolve.
Gather the evidence accrued since the last review, per belief. For each belief, list what has actually happened or come to light since the prior score - dated, sorted into for and against. If nothing material has arrived for a belief, say so explicitly; that belief’s honest update is “no change.”
Re-score each belief, with an explicit delta and direction. State the revised confidence and the change from the prior (for example, “65% -> 50%, down 15”). The delta and its direction are load-bearing: a re-score with no stated change is not an update.
Justify the size of each move, guarding against under-updating. For each non-trivial change, say why the move is that size given the strength of the evidence - and explicitly ask whether it is large enough (the conservatism guard) or whether you are clinging to the prior. For “no change,” confirm that is because no real evidence arrived, not because the belief is sticky.
Set a next-review trigger per belief. A date, or the specific signal that should force an earlier re-score (a result lands, a metric crosses a line). This is what makes the ledger a recurring routine rather than a one-time note.
Flag beliefs that have effectively resolved. If a belief has now resolved, mark it for retirement from the open inventory and, where a decision rode on it, point to think-after-action-review to close the loop.
Emit the belief-update ledger per references/TEMPLATE.md.

Output Format

Use the template in references/TEMPLATE.md. The deliverable is the filled belief-update ledger - a dated docket of beliefs, each with a prior confidence, the evidence accrued, a revised confidence with an explicit delta and direction, a reason for the size of the move, and a next-review trigger - not a prose essay about how your thinking has evolved.

Quality Checklist

Before finalizing, verify:

Every tracked belief is a concrete one-line claim with a prior confidence (percentage or band) and a date.
The evidence accrued since the last review is listed per belief, dated, and sorted for/against - or explicitly noted as “none material.”
Each revised confidence states an explicit delta and direction (not just a new number).
The size of each non-trivial move is justified against the evidence, with the under-updating guard applied (is the move large enough?).
“No update” entries are confirmed as “no new evidence,” not unexamined stickiness.
Each belief has a next-review trigger (a date or a forcing signal).
The output is the belief-update-ledger artifact, not prose.
No overclaiming: the ledger surfaces and guards against under-updating and enables calibration over time; it is not claimed to make beliefs more accurate by a stated amount, and it is honest that its direct evidence is limited (see evidence/dossier.md).

Evidence

Tier P (practitioner). The mechanism rests on a real, robust bias - people under-update relative to the evidence (conservatism; Edwards 1968) - and on the finding that incremental, evidence-weighted updating tracks accuracy in scored-forecasting regimes (Atanasov et al. 2020). But the routine itself is barely tested directly - the direct experimental tests of reflection-prompted belief revision are sparse and weak - and its typical use (fuzzy, non-resolving beliefs with no score) sits outside the scored-forecasting regime where the supporting evidence was gathered. So the skill claims the under-updating guard and the calibration-enabling record, and does not advertise an effect size or a guaranteed accuracy gain. The evidence is transferred from human studies and has not been validated for AI-augmented use. Full grading, sources, and caveats: evidence/dossier.md.

Examples

See references/EXAMPLE.md for a completed belief-update ledger on a real set of open beliefs.

Deep dive: worked example

A full worked run (the shared Northwind scenario)

Belief-Update Ledger - Worked Example

A completed run of the belief-update-routine skill on a real set of open beliefs. This continues the shared Northwind thread: the think-decision-journal example locked in the free-tier launch decision on 2026-05-31 at 60% confidence with five named assumptions. This ledger is the first quarterly re-score of those still-open beliefs, three months post-launch - before the final outcome is in, which is the point (the final resolved review is an after-action review, not this). It shows a small move, a large move, a moderate move, and an honest no-change. This is the quality bar a generated ledger should meet.

Ledger header

Review date: 2026-08-31
Cadence: Quarterly, or earlier on a forcing signal (a metric crossing a pre-set line)
Owner: VP Product
Beliefs tracked: 4 open · 0 resolved-this-cycle

Belief portfolio (at-a-glance)

#	Belief (one line)	Prior	Revised	Delta	Next review
1	Free tier feeds paid rather than cannibalizing it	60%	55%	-5	2026-11-30 (Q4 AAR)
2	Support/infra cost per free user stays within the modeled ceiling	70%	35%	-35	2026-09-30 (cost review)
3	Free attracts ICP-fit users, not a never-converting segment	65%	45%	-20	2026-11-30
4	Sales comp + lead-routing redesign prevents rep resentment	75%	75%	no change	2026-11-30

Belief entries (the detail)

1. The free tier feeds paid (net-new paid conversions lift) rather than cannibalizing it

Prior confidence: 60% (scored 2026-05-31, at launch commitment)
Evidence accrued since last review:
- For: Sign-up volume is tracking ~3x as predicted (2026-07); sales-led paid MRR has held flat, not declined (2026-08), so the feared cannibalization has not shown up yet.
- Against: Free-to-paid conversion among ICP-fit users is running below the model (2026-08), so the “feeds paid” half is unproven and slower than hoped.
Revised confidence: 55% - down 5 points
Reason for the size of the move: Small on purpose. The big downside risk (cannibalization of paid) has not materialized, which is mild good news; the conversion shortfall is real but the data is early and noisy (one quarter, small paid-cohort). Under-updating guard: is -5 too small (am I clinging to the 60%)? No - the evidence genuinely cuts both ways and is thin, so a small net move is honest, not sticky. A bigger move would over-react to one noisy quarter.
Next-review trigger: 2026-11-30 Q4 after-action review (the decision-journal review date), when the conversion cohort is large enough to score.
Status: open

2. Support and infra cost per free user stays within the modeled ceiling

Prior confidence: 70% (assumption confidence at launch, 2026-05-31)
Evidence accrued since last review:
- Against: Support tickets from free users ran ~2x the model in month 1 (2026-07); measured cloud cost per free user is ~1.6x the modeled ceiling (2026-08).
- For: none material.
Revised confidence: 35% - down 35 points
Reason for the size of the move: Large, and warranted. This is direct, consistent, against-evidence on the exact quantity the assumption was about, two months running. Under-updating guard: the temptation is to soften it (“it’s early, costs will optimize”) and stay near 70% - that would be exactly the conservatism this routine exists to catch. The data is clear enough to move hard; a -35 is the honest size.
Next-review trigger: 2026-09-30 dedicated cost review (pulled earlier than the quarterly cadence because this belief crossed a pre-set cost line - the forcing signal).
Status: open

3. “Free” attracts ICP-fit users, not a different, never-converting segment

Prior confidence: 65% (assumption confidence at launch)
Evidence accrued since last review:
- Against: ~50% of free sign-ups are outside the ICP (students, evaluators, tire-kickers) per the 2026-08 cohort analysis.
- For: the ICP-fit half is engaging at expected depth.
Revised confidence: 45% - down 20 points
Reason for the size of the move: Moderate. The off-ICP share is a real, measured signal against the assumption, but the cohort is early and acquisition targeting is still being tuned, so some of the off-ICP mix may be fixable rather than structural. A -20 reflects a genuine update without over-committing to “free attracts the wrong crowd” on one quarter’s cohort.
Next-review trigger: 2026-11-30, after a quarter of tuned acquisition targeting.
Status: open

4. The sales comp and lead-routing redesign prevents rep resentment of the motion

Prior confidence: 75% (assumption confidence at launch)
Evidence accrued since last review: No material new evidence. The redesign shipped before launch; there have been no rep escalations or pipeline complaints, but it is too early and too quiet to read as confirmation.
Revised confidence: 75% - no change
Reason for the size of the move: Honest no-update. Nothing material has arrived for or against, so the confidence stays put. This is not stickiness - it is the absence of evidence, stated as such rather than dressed up as a small confirming bump.
Next-review trigger: 2026-11-30, or sooner if a rep escalation lands.
Status: open

Value added: three months after the launch decision was locked, this ledger re-scores the open beliefs against what has actually come in - and it does the two things an unaided “how’s it going?” review usually skips. It states an explicit delta and direction on every belief (so you can see it moved 1 small, 1 hard, 1 moderate, 1 not at all), and on the cost belief it applies the under-updating guard to force a large move the prior would have resisted. It pairs with the decision journal (which fixed the priors at commitment) and points forward to the Q4 after-action review (which will score the resolved outcome). The ledger makes no claim that re-scoring made the launch succeed; its value is keeping the belief portfolio honest against the evidence, and the calibration trail it builds over many such cycles.

Grounding: the full evidence dossier

What the research does and does not show, with graded sources

Evidence Dossier: Belief-Update Routine

The single source of truth for the belief-update-routine skill. The SKILL.md, the sidecar (skill.meta.yml), and the eval cases all derive from this file. If a claim is not here, it does not belong in the skill.


Skill	`thinking-framework-skills.belief-update-routine` (installable name `think-belief-update-routine`)
Family	meta-thinking-and-reflection
Evidence tier	P (practitioner - sound mechanism, but the move itself is barely tested directly; see “What the evidence shows”)
Confidence	Moderate that under-updating is real and that incremental, evidence-weighted revision is the right correction; low that a periodic belief-review routine measurably improves outcomes outside scored-forecasting regimes
Status	draft (first authored 2026-06-03, from the multi-agent vetting round)

1. The mechanism (what actually does the work)

Counteracts conservatism (under-updating). People systematically revise their beliefs less than the evidence warrants - the robust “conservatism” finding. Forcing an explicit confidence delta with a stated reason for its size makes under-updating visible and correctable, rather than letting a belief quietly stay sticky as evidence piles up against it.
Turns a vague drift into a scored, checkable record. A prior confidence, a dated evidence tally, and a revised confidence convert “I guess I feel a bit less sure now” into a delta that can be inspected and, over many cycles, calibrated. It is the recorded-revision half of a calibration loop.
Keeps a belief portfolio honest over time. Beliefs that should track reality (open forecasts, strategic bets, standing assumptions) are revisited on a schedule against new evidence, so a stale belief is caught by the cadence rather than by a crisis.

The mechanism is what we implement. “Belief-update routine” is descriptive packaging; the durable move is the cadenced, evidence-weighted re-score of an inventory of open beliefs, each with an explicit confidence delta, a guard against under-updating, and a next-review trigger.

2. Lineage

Normative belief updating is Bayesian: a belief’s probability should move in proportion to the likelihood ratio of new evidence. The routine is the practitioner operationalization of “update toward the evidence, by an amount that reflects how strong it is.”
Conservatism in human updating - the finding that people update too little relative to Bayes - is the classic counter-bias this routine targets: Edwards, W. (1968), “Conservatism in human information processing,” in Kleinmuntz (ed.), Formal Representation of Human Judgment.
Incremental updating and forecasting accuracy: Atanasov, P., Witkowski, J., Ungar, L., Mellers, B., & Tetlock, P. (2020), “Small steps to accuracy: Incremental belief updaters are better forecasters,” Organizational Behavior and Human Decision Processes 160 - forecasters who update in frequent small increments are more accurate than those who update rarely in large jumps.
Analytic / actively open-minded thinking and belief revision: Tappin, B. M., Pennycook, G., & Rand, D. G. (2020) and the AOT literature relate a disposition to revise beliefs on evidence to more normative updating - a dispositional cousin, not a test of a routine.
The practitioner routine of periodically revisiting “what do I believe and has the evidence changed?” appears in forecasting practice (Tetlock’s superforecasters keep updating), rationalist writing, and strategy review.

No trademark. “Belief-update routine” is a generic, descriptive term; no attribution is required and none is claimed.

3. What the evidence shows, and what it does NOT show

This is the honest core of the dossier. The skill must not overclaim, and must not borrow the forecasting literature’s robustness for a move it does not test.

What is reasonably supported:

Under-updating (conservatism) is real and robust. Edwards (1968) and the subsequent literature show people revise probabilities less than a Bayesian would given the same evidence. The routine’s central guard - make the size of the update explicit and ask whether it is large enough - targets a genuine, well-documented bias.
In scored-forecasting regimes, incremental evidence-weighted updating tracks higher accuracy. Atanasov et al. (2020) found incremental updaters outperform; the calibration literature (Tetlock) shows recorded predictions plus scoring improve calibration. So where beliefs are forecasts that later resolve and get scored, the move has real support.

What is NOT shown (the caveat that keeps the skill honest):

The routine itself is barely tested directly. The direct experimental tests of reflection-prompted belief revision are sparse and weak (small samples, mixed and often non-significant results); there is no robust controlled evidence that a periodic belief-review routine improves beliefs. The leap from “under-updating is real and incremental updating helps forecasters” to “a periodic belief-review routine improves your beliefs” is plausible but not established by controlled study of the routine.
The forecasting evidence is regime-bound. Atanasov et al. and the calibration results live in scored, resolving forecasting (a probability, a deadline, a Brier score). The routine’s typical use - fuzzy, slow-moving, non-resolving beliefs (a strategy thesis, a standing assumption) with no score - sits outside that evidence. Do not advertise the forecasting effect sizes for it.
It only pays off when genuinely new evidence has arrived. Re-scoring on a calendar when nothing has changed is reflection theater: it manufactures motion (or anchors you to the prior) without information. The benefit is contingent on real new evidence between reviews.

Net grade: P (practitioner). A genuinely useful discipline with a sound mechanistic rationale (conservatism is real; incremental evidence-weighted updating helps where beliefs are scored) but limited and weak direct evidence that the routine improves beliefs, and a typical use that sits outside the regime where the supporting evidence was gathered. The skill should claim the under-updating-guard and the honest-record/calibration-enabling benefits, and must not advertise an effect size or a guaranteed accuracy gain.

4. Transferred-evidence flag (required honesty for this library)

All of the evidence above comes from human subjects - lab studies of conservatism, forecasting tournaments, and human reflection studies. There is no direct study of a belief-update routine run by, or with, an AI agent, nor of whether an agent-produced belief ledger improves a human’s later calibration. The evidence is therefore transferred from human contexts, not validated for AI-augmented use. This skill must say so. Treat the AI value as: the agent makes the recurring re-score cheap and structured (the friction of “sit down and re-examine your standing beliefs” is what kills the practice in humans), forces an explicit delta and a reason for its size (surfacing under-updating), and produces a durable, reviewable ledger - benefits that do not depend on the unproven accuracy-improvement claim.

5. When it works / when it fails (drives the eval negative cases)

Works best when:

There is a standing inventory of open beliefs (forecasts, strategic theses, key assumptions) that should track reality over time, and a cadence to revisit them.
Genuinely new evidence has arrived since the last review, so there is something to update on.
The beliefs are consequential and uncertain, and you want the under-updating guard and a scored record.

Fails or misleads when (poor-fit / anti-patterns - the hard walls):

Recording a single decision at the moment it is made is think-decision-journal, which fixes one prediction in place at commit time and forbids editing it afterward. The belief-update routine is the opposite shape: it deliberately re-scores a portfolio of open beliefs repeatedly over time. Using belief-update to capture a one-off decision loses the journal’s contemporaneous-lock; using the journal to track evolving beliefs violates its do-not-edit rule. (Key near-miss.)
Reviewing a finished episode against what was expected is think-after-action-review, which needs a resolved outcome and emits sustain/change process actions. Belief-update operates on beliefs that are still open (not yet resolved) and emits revised confidences, not action items. (Key near-miss.)
Surfacing the conditions under which one contested claim would be the best choice is think-what-would-have-to-be-true: it decomposes a single claim into its load-bearing conditions at one sitting. Belief-update re-scores a portfolio of beliefs on a cadence against accrued evidence; it is not a one-claim condition analysis. (Key near-miss.)
Reflection theater: re-scoring on a schedule when no new evidence has arrived. With nothing new, the routine either invents motion or just re-anchors the prior - cost with no information. If nothing has changed, the honest entry is “no material new evidence; no update,” not a manufactured delta.
Beliefs that never resolve and carry no real stakes - re-scoring trivia delivers no calibration and no decision value; the overhead is unrecovered.

6. Output artifact

The skill must emit a belief-update ledger, not prose: a dated docket where each tracked belief carries a one-line claim, a prior confidence (% or band), the evidence accrued since the last review (a for/against tally with dates), a revised confidence with an explicit delta and direction, the reason for the size of the move (naming the guard against under-updating - was the update large enough given the evidence?), and a next-review trigger (a date or the specific signal that forces a re-score). The artifact is the deliverable; a discursive “here’s how my thinking has evolved” essay is not. It is designed to be reopened on the next cadence and re-scored again.

7. Sources

Edwards, W. (1968), “Conservatism in human information processing,” in B. Kleinmuntz (ed.), Formal Representation of Human Judgment, Wiley - establishes systematic under-updating relative to Bayes.
Atanasov, P., Witkowski, J., Ungar, L., Mellers, B., & Tetlock, P. (2020), “Small steps to accuracy: Incremental belief updaters are better forecasters,” Organizational Behavior and Human Decision Processes 160:19-35 - incremental evidence-weighted updating tracks forecasting accuracy.
Tappin, B. M., Pennycook, G., & Rand, D. G. (2020), work relating analytic / actively open-minded thinking to more normative belief updating - a dispositional cousin.
Tetlock, P., & Gardner, D. (2015), Superforecasting - recorded probabilistic predictions plus scoring and frequent small updates as the basis for calibration (scored-regime evidence).

Verification status: citations 1, 2, and 4 are standard and well-attested in the judgment-and- decision-making and forecasting literature. Citation 3 (Tappin/Pennycook/Rand) is a dispositional correlate, cited as lineage, not as a test of the routine. The direct experimental evidence for the routine itself is sparse and weak (small, mixed studies, no robust controlled effect), so the skill takes the conservative reading and advertises no effect size. The load-bearing caveat - that the typical non-resolving use sits outside the scored-forecasting regime where the supporting evidence was gathered - does not depend on any single direct study.

Thinking Framework Skills v0.3.0 · 38 frameworks