Skip to content

Linear-Model Aggregation

For a judgment you make over and over - screening candidates, scoring leads, triaging tickets - holistic expert intuition is unreliable mainly because it is inconsistent: the same expert scores the same case differently on different days. A simple mechanical rule removes that: pick a few predictive cues, weight them (even equal weights work), score each case, combine by a fixed formula, and apply it identically every time. The robust, counterintuitive result is that such rules match or beat holistic judgment, because consistency beats brilliance applied erratically. The output is a scoring model. Two honest limits: it is for repeated judgments (not one-off strategic choices), and it is only as good as its cues.

  • The same kind of evaluative/predictive judgment recurs (screening, lead/deal scoring, triage, prioritizing a queue).
  • Gut calls on these are inconsistent or overconfident.
  • A few cues with real predictive signal exist.
  • A genuinely one-off decision among a few options (use decision-option-review).
  • No real predictive cues or data exist - do not invent cues and weights (false precision).
  • High-stakes judgments about individuals (hiring, lending, justice) where mechanical scoring raises fairness/legal/ethical issues - flag these, do not silently automate.
  • When the point is a single strategic call, not a repeatable rule.

When asked to build a scoring model, follow these steps:

  1. State the recurring judgment and the outcome it predicts (and confirm the outcome is eventually measurable). If it is a one-off, stop and route to a decision review.
  2. Choose a few predictive cues. 3 to 6 cues that plausibly carry real signal; say why each. Resist adding cues that feel thorough but lack validity.
  3. Assign weights. Default to equal weights unless real data justifies otherwise - the evidence says simple/equal weights capture most of the benefit; do not fake precision.
  4. Define the per-cue rubric. How each cue is scored, so two people would score a case the same way.
  5. Set the formula and threshold. How the cue scores combine, and the decision rule (e.g. above X -> advance).
  6. Mandate consistency, and flag the caveats. State that the model must be applied the same way every time (overriding it on a hunch reintroduces the noise it removes), that it is only as good as its cues, and any fairness/ethical caveat for judgments about people.
  7. Emit the scoring model per references/TEMPLATE.md.

Use the template in references/TEMPLATE.md. The deliverable is the scoring model (cues, weights, rubric, formula, threshold, caveats), not prose.

Before finalizing, verify:

  • The judgment is genuinely repeated with a (measurable) outcome, not a one-off.
  • Cues are few and each has a stated reason to be predictive.
  • Weights default to equal/simple unless data justifies otherwise (no fake precision).
  • The per-cue rubric is concrete enough for consistent scoring.
  • The model includes the “apply consistently” mandate and the cue-validity caveat.
  • Fairness/legal/ethical caveats are flagged for judgments about individuals.
  • The output is the scoring-model artifact, not prose.

Tier S. Across decades and many domains, mechanical/actuarial combination of cues equals or beats holistic expert judgment in the large majority of studies (Meehl 1954; Grove et al. 2000 meta-analysis), and even equal-weight “improper” models capture most of the benefit (Dawes 1979); the driver is reduced inconsistency/noise (Kahneman, Noise, 2021). It applies to repeated judgments, is only as good as its cues, and raises fairness considerations for judgments about people. Evidence is from human expert-vs-model studies, transferred to AI use, not AI-validated. Full grading: evidence/dossier.md.

See references/EXAMPLE.md for a completed scoring model.

A full worked run (the shared Northwind scenario)

A completed run of think-linear-model-aggregation, on the shared Northwind scenario. This is the quality bar a generated model should meet.

Northwind is a B2B SaaS. Reps decide which inbound leads are “worth pursuing” by gut, inconsistently. (This is the recurring judgment behind the lead-scoring signal that natural-frequency-bayesian examined.) This skill replaces the gut call with a simple consistent rule.


  • The recurring judgment: is an inbound lead worth a rep’s time to pursue?
  • Outcome it predicts: the lead becomes a closed-won opportunity within 90 days.
CueWhy it plausibly predicts the outcomeWeightHow it is scored (rubric)
ICP firmographic fit (size, segment)Closed-won rate is far higher inside the ICP1 (equal)2 = strong fit, 1 = partial, 0 = outside ICP
Engagement depth (key-action completed in trial)Activated trials convert far more often12 = key action done, 1 = logged in, 0 = neither
Budget/authority signal (named buyer, stated budget)Deals without a buyer rarely close in 90 days12 = both, 1 = one, 0 = neither
Inbound source quality (referral/demo-request vs cold list)Higher-intent sources close more12 = referral/demo, 1 = content, 0 = cold

(Equal weights, per the evidence that simple weights capture most of the benefit. Revisit weights only if outcome data justifies it.)

  • Combine: sum of the four cue scores (range 0-8).
  • Threshold / rule: score >= 5 -> rep pursues now; 3-4 -> nurture queue; <= 2 -> auto-decline.
  • Apply consistently: every lead scored the same way; no overriding the score on a hunch (“I have a good feeling”) - that gut override is exactly the inconsistency this removes.
  • Only as good as its cues: check quarterly whether high-scoring leads actually closed more than low-scoring ones; drop a cue that shows no predictive signal.
  • Fairness / ethics: this scores accounts, not protected individual attributes; keep it to firmographic/behavioral cues and review for proxy bias before any automation.

Note: the value is consistency, not cleverness. The reps’ holistic “good lead?” call varied day to day; a flat 4-cue rule, applied the same every time, will match or beat that gut judgment per Meehl/Dawes - and it is auditable. This is the right tool because the judgment recurs hundreds of times; a one-off “which of these 3 deals to chase” would instead use decision-option-review.

What the research does and does not show, with graded sources

Evidence Dossier: Linear-Model Aggregation (Mechanical Combination)

Section titled “Evidence Dossier: Linear-Model Aggregation (Mechanical Combination)”

Single source of truth for the linear-model-aggregation skill. The SKILL.md, sidecar, and evals derive from this. A strong-evidence anchor (named empirical core).

Skillthinking-framework-skills.linear-model-aggregation (installable name think-linear-model-aggregation)
Familydecision-and-option-evaluation
Evidence tierS (one of the most replicated findings in judgment research)
ConfidenceHigh that simple consistent rules match or beat holistic judgment for repeated predictions
Statusdraft (authored 2026-05-31 from the discovery corpus)

1. The mechanism (what actually does the work)

Section titled “1. The mechanism (what actually does the work)”

For a repeated predictive or evaluative judgment - screening candidates, scoring leads, triaging tickets, rating applications - holistic expert intuition is unreliable mainly because it is inconsistent: the same expert, given the same case on different days, reaches different conclusions (noise), and weights cues differently each time. A simple mechanical rule removes that inconsistency: pick a few predictive cues, assign weights (even equal weights work), score each case on each cue, combine by a fixed formula, and apply it the same way every time.

The counterintuitive, robust result: such rules - including “improper” ones with equal or roughly-guessed weights - reliably match or beat holistic expert judgment, because consistency beats brilliance-applied-erratically. The skill’s value is producing that rule and committing to applying it consistently.

Two honest constraints are built in: (1) the model is only as good as its cues - garbage cues give a confident garbage model; (2) this is for repeated judgments of the same kind, not unique strategic one-offs.

  • Paul Meehl, Clinical versus Statistical Prediction (1954) - actuarial beats clinical. Robyn Dawes, “The robust beauty of improper linear models in decision making” (1979) - even equal-weight models beat experts. Grove & Meehl (1996) and Grove et al. (2000) meta-analysis. Kahneman, Noise (2021) - inconsistency (noise) as the mechanism.

No trademark. Named descriptively.

3. What the evidence shows, and what it does NOT show

Section titled “3. What the evidence shows, and what it does NOT show”

Strongly supported (the S): across decades and many domains (clinical, hiring, lending, academic admission, parole), mechanical/actuarial combination of cues equals or outperforms holistic expert judgment in the large majority of studies (Grove et al. 2000 meta-analysis). Improper linear models (equal/unit weights) capture most of the benefit (Dawes 1979). The driver is reduced inconsistency.

What it does NOT show / boundaries (honest):

  • It applies to repeated judgments where outcomes are (eventually) measurable, not to genuinely unique strategic one-offs. For a one-time choice among a few options, use a decision-option review, not a predictive model.
  • The model is only as good as its cues: cue selection requires real predictive validity; a tidy formula on bad cues is worse than honest uncertainty.
  • Mechanical scoring of individual people (hiring, lending, justice) carries fairness, legal, and ethical considerations the skill must flag, not ignore.

The evidence is from human expert judgment vs statistical models. Transferred to AI use; an LLM is itself prone to inconsistent holistic gestalt across cases. The AI value: forcing an explicit, fixed, few-cue rule applied identically across cases removes that inconsistency and makes the judgment inspectable and auditable - the model produces and then follows the rule rather than re-deciding holistically each time.

Works best when: the same kind of evaluative judgment recurs (screening candidates, scoring leads/deals, triaging, prioritizing a queue); gut calls are inconsistent or overconfident; a few cues with real predictive signal exist.

Fails or misleads when (poor-fit / anti-patterns):

  • A genuinely one-off decision (use decision-option-review).
  • No predictive cues / data - inventing cues and weights produces false precision (the central failure).
  • Over-engineering the weights (the evidence says equal/simple weights are fine; do not fake precision).
  • Applying the model inconsistently or overriding it case-by-case on a hunch (which reintroduces the noise it removes).
  • High-stakes judgments about individuals where mechanical scoring raises fairness/legal/ethical issues - flag these, do not silently automate.

A scoring model: the judgment it is for; the few predictive cues (with why each is plausibly predictive); the weights (equal-weight as the honest default unless data justifies otherwise); the per-cue scoring rubric; the combination formula; a threshold/decision rule; and an explicit “apply consistently” note plus the cue-validity and fairness caveats.

  1. Meehl, P. (1954). Clinical versus Statistical Prediction.
  2. Dawes, R. (1979). “The robust beauty of improper linear models in decision making.” American Psychologist.
  3. Grove, W. et al. (2000). Meta-analysis of clinical vs mechanical prediction.
  4. Kahneman, D., Sibony, O., & Sunstein, C. (2021). Noise - inconsistency as the mechanism.

Verification status: the Meehl/Dawes/Grove results are well-attested and frequently replicated; confirm the Grove 2000 meta-analytic specifics before a public quantified claim. The “only as good as its cues” and fairness caveats are mandatory honest framing.

Thinking Framework Skills v0.3.0 · 38 frameworks