Does this actually work?

Most thinking-tool collections ask you to take their value on faith. This library measures two things you can check, and publishes the raw results.

The two questions a library like this must answer

Does the right tool fire?

If you describe a situation, does the catalog route you to the framework that actually fits, and stay quiet when no tool fits? This is what the Framework Advisor depends on. Measured by the trigger eval.

Is the output any good?

When a framework runs, does it produce a structured artifact that meets its own quality checklist, not a wall of prose? Measured by the output eval.

Trigger eval: the catalog routes cleanly

Every trigger and anti-case from the shipped skills’ own eval/cases.md was pooled into a blind answer key. Blind router agents - which never see which skill authored a case or the expected answer - routed each situation against the public advisor catalog, exactly as the live advisor does. A deterministic scorer graded routed-versus-expected.

0 false-fires

Across 280 anti-cases (deliberately wrong-tool or no-tool situations), no skill wrongly grabbed one. 100%. This is the metric that matters most: skills that do not over-trigger.

99% top-1, 100% top-3

Of 281 “should fire” situations, 278 routed to the intended framework on the first pick; all 281 had it in the top 3. The 3 first-pick misses are near-twins, each with the intended skill still in the top 3.

The headline: 561 cases, zero false-fires, 99% top-1 routing. The catalog discriminates - the descriptions and anti-triggers send a situation to the right framework.

Output eval: the artifacts hit their own bar

A separate measurement. For each skill, a producer agent runs it on a real prompt and emits the full artifact; a separate judge agent grades that artifact against the skill’s own “output checks,” check by check, so nothing grades itself.

99% of checks passed

311 of 315 individual output checks passed across freshly produced artifacts. 43 of 47 skills satisfied every single check.

The misses were specific

The four failures clustered on one element: the evidence caveat dropped from the artifact when a skill was run cold. A precise, fixable gap, not a vague score; the flagged skills were tightened so the caveat now ships by construction.

What this does not prove

Check it yourself

The methodology

How the blind routers, answer keys, and scorers work is documented in the harness (scripts/eval/). It runs without an API key and is reproducible.

The raw results

The full per-skill scorecards and machine records live in the repo (docs/internal/eval-results/) - every number above traces to a .json you can audit.

The evidence behind each method

Separately from behavior, every framework carries a graded dossier. Trace any claim to its source in the bibliography.

How a page is built

The pages are generated from the skills, so the docs cannot drift from what runs. See how to read a page.

Was this page helpful?

Thinking Framework Skills v0.8.0 · 56 frameworks