Linear-Model Aggregation

For a judgment you make over and over - screening candidates, scoring leads, triaging tickets - holistic expert intuition is unreliable mainly because it is inconsistent: the same expert scores the same case differently on different days. A simple mechanical rule removes that: pick a few predictive cues, weight them (even equal weights work), score each case, combine by a fixed formula, and apply it identically every time. The robust, counterintuitive result is that such rules match or beat holistic judgment, because consistency beats brilliance applied erratically. The output is a scoring model. Two honest limits: it is for repeated judgments (not one-off strategic choices), and it is only as good as its cues.

When to Use

The same kind of evaluative/predictive judgment recurs (screening, lead/deal scoring, triage, prioritizing a queue).
Gut calls on these are inconsistent or overconfident.
A few cues with real predictive signal exist.

When NOT to Use

A genuinely one-off decision among a few options (use decision-option-review).
No real predictive cues or data exist - do not invent cues and weights (false precision).
High-stakes judgments about individuals (hiring, lending, justice) where mechanical scoring raises fairness/legal/ethical issues - flag these, do not silently automate.
When the point is a single strategic call, not a repeatable rule.

Instructions

When asked to build a scoring model, follow these steps:

State the recurring judgment and the outcome it predicts (and confirm the outcome is eventually measurable). If it is a one-off, stop and route to a decision review.
Choose a few predictive cues. 3 to 6 cues that plausibly carry real signal; say why each. Resist adding cues that feel thorough but lack validity.
Assign weights. Default to equal weights unless real data justifies otherwise - the evidence says simple/equal weights capture most of the benefit; do not fake precision.
Define the per-cue rubric. How each cue is scored, so two people would score a case the same way.
Set the formula and threshold. How the cue scores combine, and the decision rule (e.g. above X -> advance).
Mandate consistency, and flag the caveats. State that the model must be applied the same way every time (overriding it on a hunch reintroduces the noise it removes), that it is only as good as its cues, and any fairness/ethical caveat for judgments about people.
Emit the scoring model per references/TEMPLATE.md.

Output Format

Use the template in references/TEMPLATE.md. The deliverable is the scoring model (cues, weights, rubric, formula, threshold, caveats), not prose.

Quality Checklist

Before finalizing, verify:

The judgment is genuinely repeated with a (measurable) outcome, not a one-off.
Cues are few and each has a stated reason to be predictive.
Weights default to equal/simple unless data justifies otherwise (no fake precision).
The per-cue rubric is concrete enough for consistent scoring.
The model includes the “apply consistently” mandate and the cue-validity caveat.
Fairness/legal/ethical caveats are flagged for judgments about individuals.
The output is the scoring-model artifact, not prose.

Evidence

Tier S. Across decades and many domains, mechanical/actuarial combination of cues equals or beats holistic expert judgment in the large majority of studies (Meehl 1954; Grove et al. 2000 meta-analysis), and even equal-weight “improper” models capture most of the benefit (Dawes 1979); the driver is reduced inconsistency/noise (Kahneman, Noise, 2021). It applies to repeated judgments, is only as good as its cues, and raises fairness considerations for judgments about people. Evidence is from human expert-vs-model studies, transferred to AI use, not AI-validated. Full grading: evidence/dossier.md.

Examples

See references/EXAMPLE.md for a completed scoring model.

Deep dive: worked example

A full worked run (the shared Northwind scenario)

Scoring Model - Worked Example

A completed run of think-linear-model-aggregation, on the shared Northwind scenario. This is the quality bar a generated model should meet.

Northwind is a B2B SaaS. Reps decide which inbound leads are “worth pursuing” by gut, inconsistently. (This is the recurring judgment behind the lead-scoring signal that natural-frequency-bayesian examined.) This skill replaces the gut call with a simple consistent rule.

Judgment

The recurring judgment: is an inbound lead worth a rep’s time to pursue?
Outcome it predicts: the lead becomes a closed-won opportunity within 90 days.

Cues (few, predictive)

Cue	Why it plausibly predicts the outcome	Weight	How it is scored (rubric)
ICP firmographic fit (size, segment)	Closed-won rate is far higher inside the ICP	1 (equal)	2 = strong fit, 1 = partial, 0 = outside ICP
Engagement depth (key-action completed in trial)	Activated trials convert far more often	1	2 = key action done, 1 = logged in, 0 = neither
Budget/authority signal (named buyer, stated budget)	Deals without a buyer rarely close in 90 days	1	2 = both, 1 = one, 0 = neither
Inbound source quality (referral/demo-request vs cold list)	Higher-intent sources close more	1	2 = referral/demo, 1 = content, 0 = cold

(Equal weights, per the evidence that simple weights capture most of the benefit. Revisit weights only if outcome data justifies it.)

Formula and decision rule

Combine: sum of the four cue scores (range 0-8).
Threshold / rule: score >= 5 -> rep pursues now; 3-4 -> nurture queue; <= 2 -> auto-decline.

Mandate and caveats

Apply consistently: every lead scored the same way; no overriding the score on a hunch (“I have a good feeling”) - that gut override is exactly the inconsistency this removes.
Only as good as its cues: check quarterly whether high-scoring leads actually closed more than low-scoring ones; drop a cue that shows no predictive signal.
Fairness / ethics: this scores accounts, not protected individual attributes; keep it to firmographic/behavioral cues and review for proxy bias before any automation.

Note: the value is consistency, not cleverness. The reps’ holistic “good lead?” call varied day to day; a flat 4-cue rule, applied the same every time, will match or beat that gut judgment per Meehl/Dawes - and it is auditable. This is the right tool because the judgment recurs hundreds of times; a one-off “which of these 3 deals to chase” would instead use decision-option-review.

Grounding: the full evidence dossier

What the research does and does not show, with graded sources

Evidence Dossier: Linear-Model Aggregation (Mechanical Combination)

Single source of truth for the linear-model-aggregation skill. The SKILL.md, sidecar, and evals derive from this. A strong-evidence anchor (named empirical core).


Skill	`thinking-framework-skills.linear-model-aggregation` (installable name `think-linear-model-aggregation`)
Family	decision-and-option-evaluation
Evidence tier	S (one of the most replicated findings in judgment research)
Confidence	High that simple consistent rules match or beat holistic judgment for repeated predictions
Status	draft (authored 2026-05-31 from the discovery corpus)

1. The mechanism (what actually does the work)

For a repeated predictive or evaluative judgment - screening candidates, scoring leads, triaging tickets, rating applications - holistic expert intuition is unreliable mainly because it is inconsistent: the same expert, given the same case on different days, reaches different conclusions (noise), and weights cues differently each time. A simple mechanical rule removes that inconsistency: pick a few predictive cues, assign weights (even equal weights work), score each case on each cue, combine by a fixed formula, and apply it the same way every time.

The counterintuitive, robust result: such rules - including “improper” ones with equal or roughly-guessed weights - reliably match or beat holistic expert judgment, because consistency beats brilliance-applied-erratically. The skill’s value is producing that rule and committing to applying it consistently.

Two honest constraints are built in: (1) the model is only as good as its cues - garbage cues give a confident garbage model; (2) this is for repeated judgments of the same kind, not unique strategic one-offs.

2. Lineage

Paul Meehl, Clinical versus Statistical Prediction (1954) - actuarial beats clinical. Robyn Dawes, “The robust beauty of improper linear models in decision making” (1979) - even equal-weight models beat experts. Grove & Meehl (1996) and Grove et al. (2000) meta-analysis. Kahneman, Noise (2021) - inconsistency (noise) as the mechanism.

No trademark. Named descriptively.

3. What the evidence shows, and what it does NOT show

Strongly supported (the S): across decades and many domains (clinical, hiring, lending, academic admission, parole), mechanical/actuarial combination of cues equals or outperforms holistic expert judgment in the large majority of studies (Grove et al. 2000 meta-analysis). Improper linear models (equal/unit weights) capture most of the benefit (Dawes 1979). The driver is reduced inconsistency.

What it does NOT show / boundaries (honest):

It applies to repeated judgments where outcomes are (eventually) measurable, not to genuinely unique strategic one-offs. For a one-time choice among a few options, use a decision-option review, not a predictive model.
The model is only as good as its cues: cue selection requires real predictive validity; a tidy formula on bad cues is worse than honest uncertainty.
Mechanical scoring of individual people (hiring, lending, justice) carries fairness, legal, and ethical considerations the skill must flag, not ignore.

4. Transferred-evidence flag

The evidence is from human expert judgment vs statistical models. Transferred to AI use; an LLM is itself prone to inconsistent holistic gestalt across cases. The AI value: forcing an explicit, fixed, few-cue rule applied identically across cases removes that inconsistency and makes the judgment inspectable and auditable - the model produces and then follows the rule rather than re-deciding holistically each time.

5. When it works / when it fails

Works best when: the same kind of evaluative judgment recurs (screening candidates, scoring leads/deals, triaging, prioritizing a queue); gut calls are inconsistent or overconfident; a few cues with real predictive signal exist.

Fails or misleads when (poor-fit / anti-patterns):

A genuinely one-off decision (use decision-option-review).
No predictive cues / data - inventing cues and weights produces false precision (the central failure).
Over-engineering the weights (the evidence says equal/simple weights are fine; do not fake precision).
Applying the model inconsistently or overriding it case-by-case on a hunch (which reintroduces the noise it removes).
High-stakes judgments about individuals where mechanical scoring raises fairness/legal/ethical issues - flag these, do not silently automate.

6. Output artifact

A scoring model: the judgment it is for; the few predictive cues (with why each is plausibly predictive); the weights (equal-weight as the honest default unless data justifies otherwise); the per-cue scoring rubric; the combination formula; a threshold/decision rule; and an explicit “apply consistently” note plus the cue-validity and fairness caveats.

7. Sources

Meehl, P. (1954). Clinical versus Statistical Prediction.
Dawes, R. (1979). “The robust beauty of improper linear models in decision making.” American Psychologist.
Grove, W. et al. (2000). Meta-analysis of clinical vs mechanical prediction.
Kahneman, D., Sibony, O., & Sunstein, C. (2021). Noise - inconsistency as the mechanism.

Verification status: the Meehl/Dawes/Grove results are well-attested and frequently replicated; confirm the Grove 2000 meta-analytic specifics before a public quantified claim. The “only as good as its cues” and fairness caveats are mandatory honest framing.

Thinking Framework Skills v0.3.0 · 38 frameworks