Evidence vs Inference Sort

Reasoning degrades when evidence (what is actually observed or verifiable) is blended with inference (what is deduced) and assumption (an unstated premise). Models are especially prone to this: they present fluent inference in the same confident register as fact. This skill separates them: it labels each claim in a body of text as evidence, inference, or assumption, records the basis for each, attaches a confidence level to inferences, and flags anything uncited. The output is an evidence/inference ledger. Note the boundary: this classifies claim type; it does not verify that the evidence is true (that is a separate, fact-checking job).

When to Use

A recommendation, plan, or conclusion must be trusted before it is acted on.
High-stakes contexts: legal, medical, financial, safety, architecture and planning.
Auditing the reasoning behind a conclusion, including the agent’s own.
As a step in a reasoning-audit workflow.

When NOT to Use

As a fact-checker. It labels what kind of claim something is, not whether it is true.
On creative or exploratory work where rigor is not the point.
On trivial claims, where sorting produces only noise.
When the claims are already well-sourced and the leaps are already explicit.

Instructions

When asked to sort evidence from inference, follow these steps:

Collect the claims. Break the prompt, document, or proposed conclusion into discrete claims. Keep each to one assertion.
Label each claim. Mark it Evidence (observed or verifiable, with a source), Inference (deduced from other claims), or Assumption (an unstated premise it depends on).
Record the basis. For evidence, name the source or observation. For inference, name what it is inferred from. For assumption, state the premise plainly.
Rate inference confidence. For each inference, assign high / medium / low and say why. Do not treat plausibility as verification.
Flag the gaps. Mark anything presented as fact but uncited, and any load-bearing assumption that is unexamined.
Surface the load-bearing unknowns. List the few unsupported claims that most need verification before the conclusion is trusted.
Emit the ledger per references/TEMPLATE.md.

Output Format

Use the template in references/TEMPLATE.md. The deliverable is the ledger plus the load-bearing-unknowns list, not prose.

Quality Checklist

Before finalizing, verify:

Every claim is labeled evidence, inference, or assumption.
No confident inference is mislabeled as evidence.
Inferences carry a confidence level with a reason.
Uncited “facts” and unexamined assumptions are flagged.
The output does not claim to have verified truth, only sorted claim type.
The output is the ledger artifact, not prose.

Evidence

Tier P. The evidence/inference distinction is a foundational critical-thinking competence (Facione, Delphi Report 1990), and explicit critical-thinking instruction shows moderate gains broadly (strongest for argument mapping, an adjacent technique). This specific “sort into a ledger” method is practitioner-grade, evidence is transferred from human contexts, and the skill verifies claim type, not truth. Full grading: evidence/dossier.md.

Examples

See references/EXAMPLE.md for a completed ledger.

Deep dive: worked example

A full worked run (the shared Northwind scenario)

Evidence / Inference Ledger - Worked Example

A completed run of think-evidence-vs-inference-sort, on the shared Northwind scenario. This is the quality bar a generated ledger should meet.

Northwind is a B2B SaaS weighing a self-serve free-tier launch. Here the skill audits the reasoning in the proposal that argues for it.

Subject

What was sorted: the proposal “We should build a free tier: our competitors all have one, it will triple signups, and more signups means more revenue, so it pays for itself.”

Ledger

#	Claim	Type	Basis or source	Confidence (inferences)	Flag
1	Our competitors all have a free tier	Evidence (claimed)	None given	-	uncited - presented as fact, not sourced
2	A free tier will triple signups	Inference	Extrapolated from competitor presence	low - no comparable baseline or test cited	-
3	More signups means more revenue	Assumption	Depends on free-to-paid conversion holding	-	unexamined - the load-bearing premise
4	The free tier “pays for itself”	Inference	Chains claims 2 and 3	low - inherits the weakness of both	-
5	Q3 growth target requires this approach	Assumption	No alternatives were compared	-	unexamined - assumes no cheaper option exists

Load-bearing unknowns

Free-to-paid conversion at Northwind’s ICP (claim 3): the whole case rests on it; verify against current self-serve conversion data before committing.
Competitor free-tier reality (claim 1): presented as fact; confirm which competitors actually offer one and on what terms.
Cheaper alternatives (claim 5): untested assumption that a free tier is the only path to the target.

Note: the value is catching that claims 3 and 5 are unexamined assumptions doing the heavy lifting, and that claim 1 is an uncited assertion wearing the costume of evidence. The skill did not verify any fact; it exposed which “facts” still need verifying.

Grounding: the full evidence dossier

What the research does and does not show, with graded sources

Evidence Dossier: Evidence vs Inference Sort

Single source of truth for the evidence-vs-inference-sort skill. The SKILL.md, sidecar, and evals derive from this. If a claim is not here, it does not belong in the skill.


Skill	`thinking-framework-skills.evidence-vs-inference-sort` (installable name `think-evidence-vs-inference-sort`)
Family	reasoning-clarity
Evidence tier	P (practitioner; the underlying critical-thinking competence has broader support)
Confidence	High that the distinction matters; the specific sort is a practitioner technique
Status	draft (authored 2026-05-31 from the discovery corpus)

1. The mechanism (what actually does the work)

Reasoning degrades when evidence (what is actually observed or verifiable) is silently blended with inference (what is deduced) and assumption (an unstated premise taken for granted). Language models are especially prone to this: they are probabilistic generators that present fluent inference in the same confident register as fact. The skill forces a separation: take a body of claims (a prompt, a document, or a proposed conclusion) and label each unit as evidence, inference, or assumption, record the basis for each, attach a confidence level to inferences, and flag uncited or unsupported claims. The work is done by making the leaps visible, so they can be challenged before they are built on.

Important boundary: this skill classifies claim type, it does not verify that the evidence is true. It separates “this is presented as fact” from “this is a deduction”; confirming the facts is a different job.

2. Lineage

Facione’s critical-thinking consensus (the 1990 Delphi Report) defines evaluation (judging the credibility of statements and the strength of inferential relationships) and inference (drawing reasonable conclusions from evidence) as distinct core skills. This skill operationalizes that distinction.
The intelligence-analysis tradition (structured analytic techniques) similarly insists on separating evidence from judgment and surfacing assumptions.

No trademark. Named descriptively.

3. What the evidence shows, and what it does NOT show

Supported: distinguishing evidence from inference is a foundational critical-thinking competence, and explicit critical-thinking instruction shows moderate gains in reasoning broadly (the strongest related result is for argument mapping, effect sizes around 0.7-0.85; that is adjacent, not this exact technique).

NOT shown: there is no controlled evidence that this specific “sort into a ledger” technique improves decisions or that an AI performing it improves a human’s judgment. Grade the technique as practitioner, not as a proven intervention. Do not imply the sort verifies truth.

4. Transferred-evidence flag

Evidence is from human critical-thinking and analysis contexts, not AI-augmented use. Transferred, not AI-validated. The AI value is concrete: a model blends fact and inference by default, so an explicit sort is a direct counter, and the ledger is an inspectable artifact a reviewer can challenge.

5. When it works / when it fails

Works best when: a conclusion or proposal needs to be trusted; in legal, medical, financial, safety, or architecture-planning contexts; when auditing the reasoning behind a recommendation (this skill’s own or another’s).

Fails or misleads when (poor-fit / anti-patterns):

The reader mistakes it for fact-checking. It labels claim type; it does not confirm the evidence is true.
Applied to creative or exploratory work where rigor is not the point.
Over-applied to trivial claims, producing noise.
Confident inference is mislabeled as evidence (the central failure mode), or plausibility is treated as verification.

6. Output artifact

An evidence / inference ledger: a table where each claim is tagged Evidence | Inference | Assumption, with its basis or source, a confidence level for inferences, and a flag for anything uncited or unsupported, followed by a short list of the load-bearing unsupported claims that most need verification.

7. Sources

Facione, P. A. (1990). Critical Thinking: A Statement of Expert Consensus (the Delphi Report) - evaluation vs inference as distinct skills.
Structured analytic techniques literature (intelligence analysis) - separating evidence from judgment; key-assumptions checks.
(Adjacent) van Gelder and others on argument mapping effect sizes - supports the broader critical-thinking competence, not this exact technique.

Verification status: the Facione/Delphi distinction is well-attested. The argument-mapping effect sizes are adjacent evidence and should not be presented as evidence for this technique in any public claim; they support the family, not the sort.

Thinking Framework Skills v0.3.0 · 38 frameworks