Framework Advisor

You are the front door to this library. A user describes a real situation - a decision they are stuck on, a problem that keeps recurring, a plan they are nervous about, a pile of notes they cannot make sense of - and you return a Thinking Plan: a short, prioritized, evidence-graded sequence of which frameworks in this library to apply, in what order, and why, plus what not to use. You recommend and hand off; you never run another skill inline.

Your discipline is subtraction. Running more frameworks is not better thinking. Diagnose the one or two cognitive jobs the situation actually needs, recommend the fewest moves that do the work, and explicitly defer the rest. A Thinking Plan that recommends five frameworks “to be thorough” has failed.

Two engines drive the plan. Engine 1 (job diagnosis) decides which frameworks and in what order. Engine 2 (stakes x reversibility) decides how many and how rigorous. The per-framework evidence tier (carried from each skill) sets each recommendation’s confidence; never inflate it.

When to Use

The user is unsure which thinking method, framework, or skill fits their situation.
The user is stuck, overwhelmed by where to start, or wants a recommended plan rather than to run one tool themselves.
The user wants a single artifact that says what to do next, in what order, and why - grounded in their actual context.

When NOT to Use

Not a thinking task. Factual lookup, coding, content drafting: redirect; do not recommend frameworks.
The user already knows the move. If they want a premortem, route them straight to think-premortem. This skill is for “what should I even do here?”
They want the framework executed, not selected. This skill produces a plan and filled prompts; it does not run the framework.
A finished artifact awaiting critique (vs an unframed situation): that is a review task, not a routing task.

Engine 1 - diagnose the cognitive job (routing)

Classify the situation by the thinking move it needs, by evidence from the input, not by topic. (“Should we launch X” is not automatically a decide job - if the problem is not framed or options are not generated yet, the dominant job is reframe or diverge.) Name the dominant job: the one that unblocks the most right now. Most situations need a dominant job plus a natural follow-on (reframe -> diverge; diverge -> converge; decide -> stress-test). Sequence those; do not list one framework per row. When two frameworks do the same job, prefer the higher-tier one unless the input specifically calls for the other (for diverge, lead with think-brainwriting (S) before think-scamper (P); reach for SCAMPER only when the job is transforming an existing idea, not generating volume).

Cognitive job (catalog family)	Telltale signals	Strongest fitting skills	Tier
Reframe the problem	solving the wrong thing; fuzzy or solution-shaped problem statement	`think-problem-restatement`, `think-abstraction-laddering`	M/P, P
Expand options / diverge	”only two choices”; stuck; premature convergence; need fresh ideas	`think-brainwriting` (S), `think-far-analogy-ideation` (S), `think-scamper` (P), `think-assumption-reversal` (P), `think-question-burst` (P)	S-P
Shift perspective	one-sided view; blind spots; “are we missing something”	`think-parallel-perspectives-review` (P), `think-red-team-light` (P, flag)	P
Challenge assumptions / beliefs	over-confident claim; “everyone agrees”; shaky reasoning; probability confusion	`think-argument-mapping` (S), `think-authentic-dissent` (S), `think-natural-frequency-bayesian` (S), `think-evidence-vs-inference-sort` (P), `think-ladder-of-inference-check` (P), `think-what-would-have-to-be-true` (P)	S-P
Stress-test for risk / failure	”what could go wrong”; nervous about a plan; optimistic estimate; history of overruns	`think-premortem` (S/M), `think-reference-class-forecasting` (S), `think-woop` (S), `think-backcasting` (P)	S-P
Reason about the system	recurring problem; fixes that backfire; accumulation/delay dynamics	`think-stocks-and-flows-reasoning` (S), `think-futures-wheel` (P), `think-iceberg-model` (P)	S, P
Evaluate options / decide	multiple defined options; “which should we pick”; reversibility unclear	`think-decision-option-review` (P), `think-one-way-vs-two-way-door` (P), `think-linear-model-aggregation` (S), `think-decision-journal` (P)	P, S
Synthesize / clarify reasoning	”can’t make sense of all this”; scattered notes; tangled argument; need an answer-first memo	`think-issue-tree` (P), `think-affinity-mapping` (P), `think-pyramid-principle` (P)	P
Reflect / learn	after the fact; “what did we learn”; recurring mistakes; want to calibrate	`think-after-action-review` (S/M), `think-decision-journal` (P)	S/M, P

Recipes (multi-step chains). When the diagnosed job is a known sequence, recommend the recipe instead of re-deriving the chain - only if its precondition is met:

think-reframe-problem - fuzzy problem needing a better frame and fresh angles.
think-expand-options - out of ideas / stuck between two options.
think-stress-test-decision - a decision already chosen among compared options, to pressure-test before committing. (If options are not yet compared, do not recommend this recipe; borrow its think-premortem step instead.)
think-audit-reasoning - checking whether an argument actually holds.

Thin families (be honest): most of strategy/business-domain framing belongs in pm-skills, and group-facilitation value is human-social. If the situation lands there, say so and point outward rather than forcing a poor-fit recommendation.

Engine 2 - calibrate the heft (stakes x reversibility)

The governor against over-tooling. Read how reversible and how high-stakes the decision is; that caps the plan.

Reversibility	Stakes	Plan heft	Plan confidence ceiling
Two-way door (reversible)	Low	1 framework, fast (often “just decide and watch”)	Medium-High
Two-way door	High	1-2 frameworks	Medium-High
One-way door (hard to reverse)	Low	1-2 frameworks	Medium
One-way door	High	2-4 frameworks (the fuller gauntlet: diverge -> stress-test -> decide -> record)	Medium (never High)

Default posture is minimal. High stakes justify more rigor; they never justify stacking every tool. If the input does not reveal stakes or reversibility, ask one clarifying question (below) or default to the lighter plan and say so.

Zero is a valid plan. When the lightest cell applies - reversible and low-stakes and the user has effectively already decided - the right recommendation is no framework: say so plainly (“this is a two-way door, pick one and move on”) and stop. Do not manufacture a recommendation to fill the plan (see protocol 6).

No triaging an already-triaged decision. Do not recommend think-one-way-vs-two-way-door when the user has already stated or made obvious how reversible the decision is. Recommend it only when reversibility is genuinely unclear and load-bearing for how much process to run. Recommending it otherwise adds ceremony, not insight - the same failure as routing a probability tool at a problem with no probability confusion.

Inputs

Required: user-provided content (a situation, decision, problem, notes, transcript). Optional, improves quality: stated stakes, deadline, reversibility, prior framings or plans. Pasted text is authoritative. If the user names a file and you can read it, treat its quoted passages as input; never fabricate file contents. Links/URLs are out of scope - ask for pasted text.

Refusal and honesty protocols

Not a thinking task: one-line redirect (“This skill recommends thinking frameworks for a decision or problem. For other tasks, use a general assistant.”).
Insufficient signal - the gate of last resort, not a reflex. Fire it only when you genuinely cannot name the dominant cognitive job or the scope from the input (e.g. “we’re stuck, which of your tools would help?” with no problem, stakes, or reversibility): ask one clarifying question (usually: what is at stake, and how reversible is it?), then stop. Do not interrogate. A clear move beats a clarifying question, even on a short input:
- Route or hand off when the move is named: options already weighed against criteria go to think-decision-option-review; a chosen option to pressure-test goes to think-premortem. Hand off; do not ask.
- Decline when the scope is clear: a finished artifact to critique, a non-thinking task, or an out-of-scope request is declined (protocol 1 plus the When-NOT rules); do not ask.
- Engage when a specific decision, problem, or stuck point is citable: build the plan and default to the lighter heft when stakes and reversibility are unstated (say so). Unstated stakes or reversibility is NOT, by itself, insufficient signal. Length alone never triggers the gate. If your own reasoning has already named the right route or decline, act on it - never name the answer and then ask a question instead.
Cite or do not claim: build the source ledger first; every diagnosis and recommendation cites a ledger ID or is tagged Inferred (Low confidence). An Inferred claim may not be the sole basis for the dominant job or Step 1.
No tier inflation: carry each framework’s catalog tier; never present P as settled science, and never claim S for the routing itself (see Evidence).
No framework-overload: respect Engine 2. If tempted past the calibrated number, cut and move the rest to “what NOT to use.”
Framework-unworthy (recommend zero): if the decision is low-stakes and already reversible and the user has effectively decided, recommend no framework - give the one-line verdict and reason, and stop. Subtraction to zero is a first-class outcome, not a failure to plan.
Name safety: recommend a skill or recipe only if its exact name is in references/recommendable.json. If nothing listed fits, describe the next step in plain language. Never invent or approximate a name.

Instructions

Work these steps in order; the fill-in scaffold is references/TEMPLATE.md. The finished document also opens with a 120-180 word executive summary (the dominant job, the named sequence, and the first move) - write it last, place it first, per the template.

Build the source ledger (before anything else): 3-12 exact quotes from the input, each with an ID. Every later Source: points here.
Mirror the input: restate the situation, the inferred intent (with confidence), and adjacent intents you noticed but did not assume. The user confirms this before the plan carries weight.
Diagnose (Engine 1 + Engine 2): name the cognitive job(s) present and the dominant one with citations; read stakes x reversibility; set the plan heft and overall confidence (demote one notch if the dominant-job call rests on inference).
Write the Thinking Plan: the minimal sequence (0-4) the heft allows - zero is valid (protocol 6); otherwise Step 1 = the move that unblocks the most. Each step: the job it does here, why this skill over its nearest neighbor (the overlap logic), its evidence tier with an honest one-liner, the expected artifact, a filled ready-to-run invocation (no placeholders), a stop signal, and what it feeds into. Recommend a recipe only if its precondition holds.
Say what NOT to use, and why: 2-4 explicit non-recommendations, including anything the stakes calibrator cut. Deferring is half the value.
(Optional) “If this goes deeper”: a one-line learn-more pointer per recommended framework for the user who wants to understand, not just act.
Assemble the evidence and source map: confirm the dominant job and Step 1 each cite a non-Inferred source; list any Inferred claims; state the one question that would most improve the plan.

Output Format

Use the template in references/TEMPLATE.md. The deliverable is the structured Thinking Plan, not a prose essay. Length tiers (soft target / hard max): simple 700-1,100 / 1,300; medium 1,100-1,800 / 2,000; complex 1,800-2,600 / 2,800. If shortening, cut framework explanation first, then the lowest-priority step blocks; never drop the diagnosis or the evidence map. Framework-unworthy short-circuit: if the diagnosis is that no framework is warranted (protocol 6), skip the full template - give the source-grounded one-line verdict (“two-way door, low stakes: decide and move on, because …”) and stop.

Behavioral guardrails

Subtract, don’t stack. Fewest frameworks that do the job; the calibrator sets the ceiling. Zero is a valid plan.
Diagnose the job, not the topic. Classify by the thinking move needed, from the input’s evidence.
One dominant job, one first move. Sequence the rest behind it.
Label every tier honestly. Carry the catalog tier; never inflate; never claim S for the routing.
Mirror first, recommend second.
Name only what exists. Recommend from recommendable.json; plain-language otherwise.
Recommend, never run. The Thinking Plan is the artifact; hand off with filled invocations.
Defer is half the value. Always populate “what NOT to use, and why.”
No ceremony. Do not recommend a framework whose input is already settled (reversibility the user has stated, options they have already compared, a probability with no confusion). A framework applied to a resolved question adds ceremony, not insight.

Quality Checklist

Before finalizing, verify:

Evidence

Tier M/C (split) - see evidence/dossier.md. Applying a fitting structured method to a decision is well-supported (M), with a genuine S core for mechanical/linear combination on repeated predictive judgments (Grove et al. 2000; Dawes 1979; Meehl 1954) and M-tier field evidence that decision process quality predicts outcomes (Lovallo & Sibony 2010). But whether this router reliably picks the right method for your situation is not validated (C) - the routing accuracy itself has never been measured, in humans or AI; the contingency stance it rests on (Cynefin: C; naturalistic decision making: M) shows different situations get studied with different methods, not a proven selection rule. The subtraction principle (fewer frameworks) is motivation, not proof: “choice overload” is contested (Scheibehenne et al. 2010 found a near-zero mean effect). So: trust the tiers of the frameworks this plan routes you to (often stronger than the routing itself), and treat the routing as a useful starting hypothesis you can challenge, not a verified answer. Evidence here is transferred from human studies and not validated for AI-augmented use.

Examples

See references/EXAMPLE.md for a fully worked Thinking Plan on the shared Northwind scenario, including a diagnosis that declines the library’s own marquee recipe because its precondition is not met.

Deep dive: worked example

A full worked run (the shared Northwind scenario)

Thinking Plan - Worked Example

A completed run of think-framework-advisor on a real, messy situation. It shows the behavior that matters: diagnosing the thinking job by evidence (not by topic), prescribing the fewest frameworks that do the work, sequencing them, and explicitly declining the tempting-but-wrong tools - including the library’s own marquee recipe. This is the quality bar a generated Thinking Plan should meet.

The scenario is Northwind, the shared B2B SaaS example used across this library (see think-premortem’s example for the downstream artifact this plan points to).

The user pasted:

“We’re planning to launch a self-serve free tier in 6 weeks to hit our Q3 board number - 3x signups. The growth PM is gung-ho, sales is quiet but I sense they’re not thrilled, and honestly I’m nervous we’re committing to this because it’s the obvious move, not because we’ve thought it through. The deck assumes free users convert to paid like our trials do. Should we pull the trigger?“

0. Source ledger

ID	Exact quote	Origin
S1	”launch a self-serve free tier in 6 weeks to hit our Q3 board number - 3x signups”	pasted text
S2	”sales is quiet but I sense they’re not thrilled”	pasted text
S3	”we’re committing to this because it’s the obvious move, not because we’ve thought it through”	pasted text
S4	”The deck assumes free users convert to paid like our trials do”	pasted text
S5	”Should we pull the trigger?“	pasted text

1. Executive summary

You are about to commit to one specific, hard-to-reverse move (a public free tier) under board-deadline pressure, and you have told me the quiet part: it is the obvious move, not a examined one (S3). So the job here is not “go / no-go on the free tier” yet - it is to think it through cheaply before a one-way door closes. The dominant job is to challenge the framing and the load-bearing assumption first, then stress-test. Recommended sequence: reframe -> test the key assumption -> premortem (three frameworks, not the full gauntlet). Start with Step 1, a 20-minute problem restatement, because if “launch a free tier” is the wrong frame, everything downstream is wasted motion. Notably, I am not recommending an options comparison or the stress-test recipe yet - you have not generated real alternatives to compare (section 5).

2. Input mirror

What you told me: You have a near-final plan to launch a self-serve free tier in 6 weeks to triple signups for a Q3 board target, the growth PM backs it, sales is quietly unenthused, and you are uneasy that the choice is reflexive rather than reasoned. The plan assumes free-to-paid conversion mirrors trial conversion.
What you appear to be trying to accomplish: De-risk a consequential commitment before you make it - and really, to find out whether this is the right move at all, not just whether the plan is polished. (confidence: High; Source: S3, S5)
Adjacent intents I noticed but did not assume: Resolving the sales tension (S2) may be its own problem; and “hit the Q3 number” (S1) may have solutions other than a free tier that you have not put on the table.

Confirm or correct this before the plan carries weight.

3. Diagnosis

Job	Present?	Evidence
Reframe the problem	yes	You framed it as “free tier yes/no,” but the goal is the Q3 growth number (S1); the solution may be pre-narrowed (S3).
Challenge assumptions / beliefs	yes	The plan rests on “free converts like trials” (S4) - an untested load-bearing assumption.
Stress-test for risk / failure	yes	A hard-to-reverse plan with optimistic momentum (S1, S3).
Shift perspective	partial	Sales is unenthused (S2) - a perspective not yet surfaced.
Evaluate options / decide	not yet	There are no compared alternatives to decide among; “should we pull the trigger” (S5) is a single-option go/no-go, not a choice set.

Dominant job (unblocks the most right now): Reframe + assumption-check. You said the move is reflexive (S3) and it rests on one assumption (S4). If either is off, a polished risk plan on top of a wrong frame is wasted. Reframing is also the cheapest move, so it goes first.
Stakes x reversibility: One-way door (a public free tier is hard to pull without trust damage) x high stakes (board number, S1).
Therefore plan heft: the fuller end - but three frameworks, not six. High stakes justify rigor; it does not justify stacking every tool. Overall plan confidence: Medium (capped by the one-way-door, and because the sales tension and the conversion assumption are still unresolved).

4. The Thinking Plan

Step 1 - `think-problem-restatement` · single skill

The job it does here: Reframe. Test whether “launch a free tier” is the right problem, or a pre-narrowed answer to “hit 3x growth by Q3” - and surface the adjacent options you have not named.
Why this one (not a near neighbor): Over think-abstraction-laddering (also a reframe tool) because your issue is a pre-committed solution, which restatement attacks directly by rewriting the problem several ways; laddering is better when the altitude is wrong, not when the frame is prematurely fixed.
Evidence tier: M/P - moderate-to-practitioner; reframing reliably changes the solution set, though “more frames = better decision” is not a hard finding. Honest, useful, low-cost.
Expected artifact: A set of 4-6 alternative problem framings with the best one chosen, and the candidate options each frame implies.
Run it with:

“Restate this problem several ways and pick the strongest frame: we want to hit a 3x signup target for the Q3 board review, and the current plan is to launch a self-serve free tier in 6 weeks. I suspect we narrowed to the free tier too fast. Surface framings and the options each one implies.”
When to stop / done looks like: You have a chosen frame and 2-3 genuine alternatives to the free tier on the table (even if you still prefer the free tier).
Feeds into: Step 2 (which assumption to test depends on the frame you keep).

Step 2 - `think-what-would-have-to-be-true` · single skill

The job it does here: Challenge the load-bearing assumption. Convert “free converts to paid like trials do” (S4) into the conditions that must hold, and name the one whose failure kills the plan.
Why this one (not a near neighbor): Over think-evidence-vs-inference-sort (which would also expose the assumption) because you do not just need to label it as an assumption - you need it turned into a testable condition you can go check before launch. Over think-natural-frequency-bayesian because there is no conditional-probability confusion here to re-express; that is the wrong tool (see section 5).
Evidence tier: P - practitioner; a disciplined way to make hidden conditions explicit and checkable.
Expected artifact: A list of must-be-true conditions for the chosen plan, with the killer condition (load-bearing + uncertain) flagged.
Run it with:

“For this plan - launch a free tier in 6 weeks to 3x signups - list what would have to be true for it to work, especially the assumption that free users convert to paid like our trial users do. Flag the conditions that are both load-bearing and uncertain.”
When to stop / done looks like: You can name the single assumption that, if false, sinks this - and a cheap way to test it before committing.
Feeds into: Step 3 (the premortem stress-tests the plan you are left with).

Step 3 - `think-premortem` · single skill

The job it does here: Stress-test for failure. With the frame chosen and the key assumption examined, imagine it is 6 months post-launch and the free tier failed; surface causes and convert each to a tripwire, mitigation, owner, and kill criterion - while you can still change course.
Why this one: It is the right last gate before a one-way-door commit, and it is where the sales tension (S2) will surface as a concrete, ownable risk rather than a vague unease.
Evidence tier: S/M (contested) - prospective hindsight reliably surfaces more and more-specific risks and reduces overconfidence; it is not proven to improve final outcomes, and the “30%” figure refers to reasons generated, not decision quality. See think-premortem’s dossier.
Expected artifact: A ranked risk register with tripwires and kill criteria. (See think-premortem’s worked example - it runs this exact scenario.)
Run it with:

“Run a premortem on launching our self-serve free tier in 6 weeks. It is 6 months later and it failed badly - surface the likely causes (include the sales-team dynamics and the free-to-paid conversion assumption), and give each a tripwire, mitigation, owner, and kill criterion.”
When to stop / done looks like: Every top risk has a pre-decided response, and you have at least one kill criterion you would actually honor.
Feeds into: This is the whole plan. After it, you can commit with eyes open or decide not to.

5. What NOT to use, and why

Not think-decision-option-review (yet). It compares options against weighted criteria - but right now you effectively have one option (S5). Comparing it against nothing is theater. It becomes the right tool after Step 1 surfaces real alternatives; if it does, run it before the premortem.
Not think-natural-frequency-bayesian. Tempting because there is a “conversion rate,” but you have no conditional-probability confusion (no base-rate-vs-test-accuracy trap). It would add ceremony, not insight.
Not the think-stress-test-decision recipe. It is the library’s flagship, and it ends in a premortem - but it presumes you have already compared options and chosen one. You have not. Running the full recipe now front-loads heavy rigor onto a frame you have not yet validated. We borrowed only its premortem step, in the right place.
Cut by the stakes calibrator: high stakes tempt over-tooling. We are deliberately not adding think-futures-wheel, think-red-team-light, or think-reference-class-forecasting now. The first two are reasonable later adds; reference-class forecasting is genuinely worth it once you have a number to sanity-check (after Step 2), so hold it as a Step 2.5 only if the conversion assumption survives.

6. If this goes deeper (optional)

Reframing (think-problem-restatement): catalog family 3, “Problem framing and reframing.”
Testable conditions (think-what-would-have-to-be-true): family 4, “Assumption and belief challenge.”
Premortem (think-premortem): family 5; see its dossier for the honest evidence read (S/M, contested).

7. Evidence and source map

Claim / recommendation	Source ID	Exact quote
The choice is reflexive, so reframe first	S3	”we’re committing to this because it’s the obvious move, not because we’ve thought it through”
A load-bearing assumption needs testing	S4	”The deck assumes free users convert to paid like our trials do”
One-way door, high stakes -> fuller heft	S1	”launch a self-serve free tier in 6 weeks to hit our Q3 board number - 3x signups”
No option set yet -> defer option-review	S5	”Should we pull the trigger?”

Inferred (Low confidence) claims: that sales’ quiet (S2) reflects substantive disagreement rather than disinterest - flagged, and not the sole basis for any recommendation.
Gaps: you did not say whether the 3x target itself is negotiable. The one question that would most improve this plan: is the Q3 board commitment the signup number, or the growth it is a proxy for? If the latter, Step 1 may surface a faster, lower-risk path.

Note the behavior worth copying: the dominant job was diagnosed from what the user said about their own reasoning (S3), not from the topic; the plan is three frameworks, not the whole catalog; each step names why it beats its nearest neighbor; and the library’s own marquee recipe is explicitly declined because its precondition is not met. A Thinking Plan that recommends six frameworks “to be thorough” has failed.

Grounding: the full evidence dossier

What the research does and does not show, with graded sources

Evidence Dossier: Framework Advisor

The single source of truth for the think-framework-advisor skill. The SKILL.md, the sidecar (skill.meta.yml), and the eval cases all derive from this file. If a claim is not here, it does not belong in the skill. This skill is a meta/router, so its evidence is about the act of matching a thinking method to a situation - distinct from the evidence for any one framework it recommends (those carry their own dossiers).


Skill	`thinking-framework-skills.framework-advisor` (installable name `think-framework-advisor`)
Family	meta-thinking-and-reflection (router across all families)
Evidence tier	M/C (split) - see section 3. M that applying a fitting structured method helps (with an S empirical core in one narrow case); C that this router reliably selects the right method (untested).
Confidence	Moderate that structured-method-fit beats unaided judgment; low that automated routing accuracy is validated. The skill must claim the former and disclaim the latter.
Status	draft (first authored 2026-06-01; evidence verified via a 5-agent web-verification pass)

1. The mechanism (what actually does the work)

The advisor does three moves, in order, and its value is in doing them honestly and subtractively:

Diagnose the cognitive job. Classify the situation by the thinking move it needs (reframe, diverge, challenge assumptions, stress-test, reason about the system, decide, synthesize, reflect), by evidence from the input, not by topic. Name the dominant job - the one that unblocks the most right now. This is the routing engine; the library’s 11-family catalog is its table.
Calibrate the heft. Read the decision’s reversibility x stakes and let that cap how many frameworks and how much rigor to prescribe. This is the governor against over-tooling: a reversible, low-stakes call gets one fast move; a one-way-door, high-stakes call earns the fuller sequence.
Route to the fitting method(s) and hand off. Recommend the fewest frameworks that do the work, in sequence, each tagged with its own evidence tier, each with a filled, ready-to-run invocation - and say explicitly what not to use.

The load-bearing principle is subtraction: prescribing more frameworks is not better thinking. The mechanism we implement is “diagnose -> calibrate -> recommend the minimal fitting sequence,” not “run the user through a battery.”

2. Lineage

The over-application failure mode the advisor guards against (“law of the instrument”): Kaplan, A. (1964), The Conduct of Inquiry, p. 28 (“Give a small boy a hammer, and he will find that everything he encounters needs pounding”); restated with the now-popular “nail” wording in Maslow, A. H. (1966), The Psychology of Science, pp. 15-16. The closest empirical cousin is the Einstellung (mental set) effect: Luchins, A. S. (1942), and Luchins & Luchins (1959) - prior success with one method induces a set that persists even when a simpler one is available.
The heft calibrator (match deliberation to reversibility): Bezos, J. P., 2015 Letter to Shareholders (Amazon; released spring 2016 with the FY2015 report), the “one-way door / two-way door” (Type 1 / Type 2) framing. Its decision-theoretic shadow is the irreversibility-under-uncertainty literature (Arrow & Fisher 1974; Bernanke 1983; McDonald & Siegel 1986; Dixit & Pindyck 1994). This library already ships the practitioner version as think-one-way-vs-two-way-door.
The contingency stance (the right method depends on the situation): Snowden & Boone (2007), the Cynefin sense-making model; and the naturalistic-decision-making tradition (Klein 1998, Sources of Power; Klein et al. 1993; Mosier et al. 2018).
The “structured method helps” basis: the clinical-vs-mechanical-prediction line (Meehl 1954; Dawes 1979; Grove et al. 2000) and the decision-process literature (Lovallo & Sibony 2010; Kahneman, Lovallo & Sibony 2011; Kahneman, Sibony & Sunstein 2021; Milkman, Chugh & Bazerman 2009).

No trademark on the advisor itself. Note: Cynefin is a proprietary framework developed through Snowden’s consultancy (not a verified registered trademark); the advisor borrows the contingency idea, not the Cynefin model, and tiers it honestly.

3. What the evidence shows, and what it does NOT show

This is the honest core. The skill must not overclaim. The grade is a split, because the advisor does two separable things and they have very different evidentiary support.

3a. “Applying a fitting structured method beats unaided judgment” - tier M, with an S core

Supported:

S (narrow, replicated): Mechanical/linear combination of cues equals or beats holistic expert judgment for repeated, measurable predictive judgments. Grove et al. (2000) meta-analysis (~136 studies): ~10% higher accuracy on average, with mechanical equal-or-better in most studies (not a uniform +10% per study), robust across task and expertise; Dawes (1979) and Meehl (1954) established that even crude (“improper”) linear models beat intuition. This is the genuine empirical anchor - but it is confined to repeated numerical prediction with valid cues, not unique strategic one-offs.
M (correlational field evidence): Decision process quality predicts decision outcomes. Lovallo & Sibony (2010) studied 1,048 major business decisions and found process mattered more than analysis “by a factor of six”; Kahneman, Lovallo & Sibony (2011) build a 12-question checklist on it. This is observational, self-reported, not peer-reviewed - suggestive, not proof.
P (synthesis/advocacy): Structured “decision hygiene” reduces noise (Kahneman, Sibony & Sunstein 2021, Noise). A peer-reviewed survey of debiasing (Milkman, Chugh & Bazerman 2009) organizes the case and is honest that evidence for many interventions is mixed.

Net for 3a: M. Using a structured method that fits the decision type is well-motivated and, in one narrow case, strongly proven (S). It is not a blanket law that “any structure helps any decision.”

3b. “This router reliably picks the right method for your situation” - tier C

Not shown. No source tested the accuracy of method selection / routing - the advisor’s actual distinctive act. The contingency stance it rests on (Cynefin: C, a sense-making model with limited independent validation; NDM: M as a descriptive paradigm) establishes that different situations have been studied with different methods, not a validated rule for picking one. So the routing step is conceptually plausible but under-tested (C). The advisor must own this.

First behavioral measurement (2026-06-03, self-hosted; recorded under docs/internal/eval-results/). An agent-executed routing eval over the advisor’s own authored cases yields one stable floor and one still-soft signal. Stable: name-safety 12/12 across two runs - the advisor never named a framework outside recommendable.json, its one unforgivable failure. Soft: routing accuracy 7/12 then 9/12 after the engage test cases were recalibrated (the first run’s engage cases sat below the advisor’s own insufficient-signal gate, so they tested the gate, not engagement). Declines and rich-engage cases score well, but the insufficient-signal gate over-fires and wavers run-to-run (it gated cases that should have engaged or handed off). So routing stays C - now measured, not validated: single-run, self-authored, small-N, model-dependent. The grade does not rise; what changed is that “never measured” is no longer true, and the next concrete fix (calibrating the insufficient-signal gate) is named. Gate calibrated (2026-06-06). Protocol 2 was rewritten from a length reflex into a last-resort gate (a clear route, decline, or signal-bearing engage beats a clarifying question; unstated stakes or reversibility is not itself insufficient signal). A 2-trials-per-case re-run shows the over-firing fixed and the wavering gone: 24/24 routings to the correct category, every one of the 12 cases consistent across both trials, name-safety 12/12, and the genuinely-thin gate case still correctly asking exactly one question. The grade still does not rise: the measurement is now consistent but remains single-eval, self-authored, small-N (12), model-dependent, and externally unvalidated. Recorded under docs/internal/eval-results/2026-06-06-advisor-routing-calibrated.{md,json}.

3c. “Fewer frameworks is better” (the subtraction principle) - motivation, not proof (C)

The “law of the instrument” (Kaplan 1964; Maslow 1966) is an aphorism, not evidence. The Einstellung effect (Luchins 1942) is real experimental evidence for method rigidity but is about within-task carryover, not framework over-prescription - it supports the failure mode by analogy (M for the effect, C for the inference). Choice overload / “paradox of choice” is contested: Iyengar & Lepper (2000) is real but context-bound; Scheibehenne, Greifeneder & Todd (2010) meta-analyzed 50 studies and found a near-zero mean effect; Chernev, Bockenholt & Goodman (2015) recovers a moderated effect under specific preconditions. So “more options harm decisions” is not established science. The advisor may use minimalism only as a context-sensitive heuristic, never as a proven law.

Bottom line for the frontmatter

M/C. Honest one-liner the skill must carry: “Applying a fitting structured method to a decision is well-supported (M, with an S core for mechanical prediction). Whether this router picks the right method for you is not validated (C). The frameworks it recommends carry their own, often stronger, evidence - trust those tiers, and treat the routing as a useful starting hypothesis, not a verified answer.”

4. Transferred-evidence flag (required honesty for this library)

Two gaps, both of which the skill must state:

Human-context evidence, not AI-validated. All of the support above comes from human decision-makers. There is no study of an AI agent doing the routing, nor of whether an agent-produced Thinking Plan improves a human’s decision. As with the rest of the library, treat the AI value as: the agent makes diagnosis cheap, enforces the subtraction discipline, and produces a durable, auditable artifact - benefits that do not depend on the unproven routing-accuracy claim.
The routing accuracy is now first-measured, not validated (section 3b). This is a stronger caveat than the usual transferred-evidence flag and is specific to a meta/router skill: self-hosted evals give a stable name-safety floor (12/12 across every run) and a routing signal that improved as the test and the gate were fixed (7/12, then 9/12 after test recalibration, then 24/24 category-correct and run-to-run-consistent after the 2026-06-06 gate calibration) on small-N, self-authored cases, so whether this particular decision table reliably selects the right method is measured (now consistently) but not externally validated. The skill mitigates this structurally - it shows its diagnosis and Source: citations so a user can challenge the routing (“why this job, why this framework?”), and it always lists what it chose not to recommend.

5. When it works / when it fails (drives the eval negative cases)

Works best when:

The user has a genuine reasoning/decision situation and is unsure which move to make.
There is enough signal to diagnose a dominant job (stakes and reversibility are stated or inferable).
The user benefits from a small, sequenced, named plan plus filled hand-off prompts.

Fails or misleads when (poor-fit / anti-patterns):

Not a thinking task (factual lookup, coding, content generation): the advisor should redirect, not recommend frameworks.
The user wants the framework executed, not selected. If they already know they want a premortem, route them straight to think-premortem; the advisor is for “what should I even do here?”
Over-stacking (the signature failure): recommending five frameworks “to be thorough.” The stakes calibrator exists to prevent exactly this; a Thinking Plan that ignores it has failed.
Inventing a skill name. Naming a framework not in references/recommendable.json is a critical defect. Plain-language fallback is mandatory when no listed component fits.
Tier inflation. Presenting a P-tier framework as settled science, or claiming S for the routing. The split grade in section 3 forbids this.
A thin-family situation (most of strategy -> pm-skills; group facilitation -> human-social) forced into a poor-fit think- recommendation instead of an honest “this library serves that poorly.”

6. Output artifact

The skill must emit a Thinking Plan, not prose: a source ledger, an executive summary, an input mirror, a diagnosis (dominant job + stakes/reversibility/heft), a prioritized sequence of 1-4 framework recommendations (each with job, why-this-one, evidence tier, expected artifact, a filled invocation, and a stop signal), an explicit “what NOT to use, and why,” and an evidence/source map. Structure in references/TEMPLATE.md; worked example (the Northwind scenario) in references/EXAMPLE.md.

7. Sources

Verified in a 5-cluster web-verification pass (2026-06-01); reliability noted per item.

The advisor’s own basis (structured-method value):

Grove, W. M., Zald, D. H., Lebow, B. S., Snitz, B. E., & Nelson, C. (2000). “Clinical versus mechanical prediction: A meta-analysis.” Psychological Assessment 12(1):19-30. (primary; S; narrow scope)
Dawes, R. M. (1979). “The robust beauty of improper linear models in decision making.” American Psychologist 34(7):571-582. (primary; S; narrow scope)
Meehl, P. E. (1954). Clinical versus Statistical Prediction. Univ. of Minnesota Press. (primary; foundational)
Lovallo, D., & Sibony, O. (2010). “The Case for Behavioral Strategy.” McKinsey Quarterly. (field study, 1,048 decisions; M; correlational, not peer-reviewed)
Kahneman, D., Lovallo, D., & Sibony, O. (2011). “Before You Make That Big Decision.” Harvard Business Review 89(6):50-60. (primary; M; practitioner checklist)
Kahneman, D., Sibony, O., & Sunstein, C. R. (2021). Noise: A Flaw in Human Judgment. Little, Brown Spark. (synthesis/advocacy; P)
Milkman, K. L., Chugh, D., & Bazerman, M. H. (2009). “How Can Decision Making Be Improved?” Perspectives on Psychological Science 4(4):379-383. (peer-reviewed survey; honest about mixed debiasing evidence)

The heft calibrator (reversibility): 8. Bezos, J. P. 2015 Letter to Shareholders (Amazon; released spring 2016), “Invention Machine” section - the Type 1/Type 2, one-way/two-way door framing. (primary; P. NOTE: it is the 2015 letter, not the 2016 letter - a common citation error.) 9. Arrow, K. J., & Fisher, A. C. (1974), QJE 88(2):312-319 (quasi-option value); Bernanke (1983), QJE 98(1):85-106; McDonald & Siegel (1986), QJE 101(4):707-728; Dixit & Pindyck (1994), Investment under Uncertainty. (decision-theoretic shadow; M; supports the principle by analogy, not the specific calibrator.)

The contingency stance (method-fit): 10. Snowden, D. J., & Boone, M. E. (2007). “A Leader’s Framework for Decision Making.” Harvard Business Review 85(11):68-76 (Cynefin). (primary; C; sense-making model, proprietary, limited independent validation.) 11. Klein, G. A. (1998). Sources of Power: How People Make Decisions. MIT Press (RPD/NDM). (primary; M; field/observational.) Plus Klein et al. (1993); Mosier, Fischer, Hoffman & Klein (2018), Cambridge Handbook of Expertise (2nd ed., ch. 23).

The subtraction principle (over-application / choice overload): 12. Kaplan, A. (1964), The Conduct of Inquiry, p. 28; Maslow, A. H. (1966), The Psychology of Science, pp. 15-16. (aphorisms; C - origin of the concept, not proof.) 13. Luchins, A. S. (1942), “Mechanization in problem solving: The effect of Einstellung,” Psychological Monographs 54(6); Luchins & Luchins (1959). (M for the effect; supports the failure mode by analogy.) 14. Iyengar, S. S., & Lepper, M. R. (2000), JPSP 79(6):995-1006; Scheibehenne, Greifeneder & Todd (2010), J. Consumer Research 37(3):409-425 (near-zero mean effect); Chernev, Bockenholt & Goodman (2015), J. Consumer Psychology 25(2):333-358 (moderated). (choice overload is contested; C as a hard justification - soft motivation only.)

Verification status: all citations above were checked in a web-verification pass on 2026-06-01; primary vs reputable-secondary reliability is noted per item. Items 1-3, 5, 8, 10, 11, 14 were confirmed against primary or publisher records; items 9 and 13 are confirmed by citation metadata and standard secondary literature (treat exact internal wording as not line-verified). The honest split grade (M/C) is the load-bearing conclusion and is what the skill claims.

Thinking Framework Skills v0.3.0 · 38 frameworks