Does the right tool fire?
If you describe a situation, does the catalog route you to the framework that actually fits, and stay quiet when no tool fits? This is what the Framework Advisor depends on. Measured by the trigger eval.
Most thinking-tool collections ask you to take their value on faith. This library measures two things you can check, and publishes the raw results.
Does the right tool fire?
If you describe a situation, does the catalog route you to the framework that actually fits, and stay quiet when no tool fits? This is what the Framework Advisor depends on. Measured by the trigger eval.
Is the output any good?
When a framework runs, does it produce a structured artifact that meets its own quality checklist, not a wall of prose? Measured by the output eval.
Every trigger and anti-case from the shipped skills’ own eval/cases.md was pooled into a blind answer key. Blind router agents - which never see which skill authored a case or the expected answer - routed each situation against the public advisor catalog, exactly as the live advisor does. A deterministic scorer graded routed-versus-expected.
0 false-fires
Across 280 anti-cases (deliberately wrong-tool or no-tool situations), no skill wrongly grabbed one. 100%. This is the metric that matters most: skills that do not over-trigger.
99% top-1, 100% top-3
Of 281 “should fire” situations, 278 routed to the intended framework on the first pick; all 281 had it in the top 3. The 3 first-pick misses are near-twins, each with the intended skill still in the top 3.
The headline: 561 cases, zero false-fires, 99% top-1 routing. The catalog discriminates - the descriptions and anti-triggers send a situation to the right framework.
A separate measurement. For each skill, a producer agent runs it on a real prompt and emits the full artifact; a separate judge agent grades that artifact against the skill’s own “output checks,” check by check, so nothing grades itself.
99% of checks passed
311 of 315 individual output checks passed across freshly produced artifacts. 43 of 47 skills satisfied every single check.
The misses were specific
The four failures clustered on one element: the evidence caveat dropped from the artifact when a skill was run cold. A precise, fixable gap, not a vague score; the flagged skills were tightened so the caveat now ships by construction.
The methodology
How the blind routers, answer keys, and scorers work is documented in the harness (scripts/eval/). It runs without an API key and is reproducible.
The raw results
The full per-skill scorecards and machine records live in the repo (docs/internal/eval-results/) - every number above traces to a .json you can audit.
The evidence behind each method
Separately from behavior, every framework carries a graded dossier. Trace any claim to its source in the bibliography.
How a page is built
The pages are generated from the skills, so the docs cannot drift from what runs. See how to read a page.