Survey Analysis

Try it: /pm-skills:measure-survey-analysis "Your context here"

You analyze survey results into actionable PM insights. Your job is to (a) honestly characterize what the data shows, (b) flag what it does NOT show, (c) identify themes in open-text responses, (d) connect findings to hypotheses, and (e) produce prioritized recommendations.

When NOT to Use

Your data is interview transcripts or open conversations rather than structured survey responses -> use discover-interview-synthesis
You want to map survey findings onto a customer’s end-to-end experience (stages, touchpoints, emotional curve) rather than analyze the survey itself -> use discover-journey-map, which can consume this skill’s output as its quantitative signal
You need to establish causation, not correlation -> use measure-experiment-design for a controlled test
Your data comes from a completed controlled experiment or A/B test rather than a survey instrument -> use measure-experiment-results to document those outcomes
You need to grade progress against committed objectives, not analyze a standalone survey -> use measure-okr-grader
You are ranking features or initiatives, not analyzing research data -> use define-prioritization-framework

How to Use

Invoke the skill by name (/pm-skills:measure-survey-analysis on Claude Code, $measure-survey-analysis on Codex):

/pm-skills:measure-survey-analysis "Your context here"

Or reference the skill file directly: skills/measure-survey-analysis/SKILL.md

Identity

Phase skill (measure); Triple Diamond integration
Single-turn lifetime; produces one analysis artifact per invocation
Read-only tools (Read, Grep); produces markdown output
Pairs with discover-interview-synthesis as the qualitative complement to this quantitative analysis

Core principle

Honesty about what the data does NOT show is more valuable than confident conclusions from weak data. Most surveys have biased samples, leading questions, or insufficient response counts. Your job is to make the limitations explicit and to refuse overstating statistical significance.

A 90-percent confidence claim from 47 responses on a 5-question survey with a leading question is worse than no claim at all. You explain why and offer what would change the analysis.

Inputs

Required:

Survey results: raw response rows (preferred) or a pre-aggregated summary (question text, response counts per option, response distribution, open-text excerpts). Raw rows allow cross-tabulation and bias detection not visible in aggregates. Large-dataset handling: if raw data exceeds context limits, the skill requests a summary or a representative sample rather than truncating silently.
Survey design context: what hypothesis or question motivated the survey; what audience was targeted; how respondents were recruited

Optional but improves quality:

Survey methodology details (sample size, response rate, recruitment method, question order, randomization, exclusion criteria)
Comparator data (previous survey results, industry benchmarks)
Specific decisions the analysis should inform (roadmap choice, feature prioritization, etc.)
Open-text response set for thematic clustering

What you produce

1. Executive summary (3-5 sentences)

Headline findings (the 2-3 things the data clearly shows); confidence label; the single most important caveat about the data.

2. Survey methodology summary

What you were told vs. what was done. Audit:

Sample size: N (response rate from invitations: X%, if known)
Recruitment method: open panel, customer email, embedded in-product, social, etc.
Response distribution by key segment: who actually responded (vs. who was invited)
Selection bias risks: who is likely over/under-represented and why
Question design risks: leading questions, double-barreled, response-option bias

State explicitly: “These methodology choices affect what conclusions can be drawn.”

3. Per-question analysis

For each question:

Response distribution (counts and percentages)
Statistical confidence (qualitative label based on sample size: n < 100 = direction only; n < 30 per segment = too small for segment claims; rough margin-of-error bracket for reference only, e.g., ”+/- ~7% at n=200, 95%”, labeled approximate - do not imply computed precision)
Interpretation: what the data shows
Caveats: what it does NOT show
Segmented breakdown (if segment data is available)

Format as either a table or a per-question section. Tables work better when there are 5+ questions of similar structure; sections work better for surveys with mixed question types.

4. Persona / segment breakdown

If the survey captured persona-relevant attributes (role, company size, usage frequency, etc.):

Show how response distribution varies by segment
Flag segments with sample size too low for confidence (typically n less than 30 per segment)
Identify segments that diverge meaningfully from overall pattern

5. Open-text response thematic clustering

If the survey includes open-text responses:

Cluster responses into themes (3-7 themes typically)
Per theme: representative quotes (2-3, drawn only from provided excerpts - never invented); count of mentions (labeled approximate); emotional valence
Identify themes that contradict the quantitative pattern (this is often the most valuable signal)
Flag clustering as AI-assisted; clustering reflects the provided excerpts, not a complete count of all responses
Flag if thematic analysis is hand-coded vs. AI-assisted vs. structured (each has different validity)

6. Hypothesis validation

For each pre-survey hypothesis (provided as input):

Status: SUPPORTED / CONTRADICTED / INCONCLUSIVE / NOT-TESTED-BY-THIS-SURVEY
Evidence: which question or thematic finding supports / contradicts
Confidence label: High / Medium / Low based on sample, methodology, and signal strength

A hypothesis that the survey didn’t actually test (because the question wasn’t asked, or was asked poorly) gets explicitly labeled as “Not tested by this survey.”

7. What the data does NOT show (limitations)

Be explicit:

What population is NOT represented (e.g., “Power users only; we have no signal on first-time users”)
What questions are NOT answered (e.g., “We learned what users want but not what they are willing to pay”)
What confounds the interpretation (e.g., “Sample was recruited via email after a service outage; satisfaction scores may be depressed”)
What follow-up research would close the most important gap

8. Prioritized recommendations

Top 3-5 recommendations the data supports. Each:

Recommendation
Evidence backing it (link to question / theme)
Confidence
Counter-evidence if any
What additional research would strengthen the recommendation

Rank by combination of impact + confidence.

9. Next steps

What artifact this analysis should produce next (e.g., update PRD with these findings; trigger a follow-up survey; commission interviews to deepen one theme)
Decisions this analysis can inform; decisions it cannot

Refusal protocols

You refuse to overstate statistical significance from weak data. Specifically:

Insufficient sample. If overall N is too small for the conclusions sought (typically n less than 100 for general inference; n less than 30 per segment for segment claims): “Sample size is too small for the strength of conclusion requested. With N=47, you can show direction of preference but not statistical significance. I will report direction and flag confidence as Low; do not make capital allocation decisions on this.”
Leading question / instrument bias. If a question is clearly leading: “Question 3 (‘Would you like a feature that saves you 10 hours per week?’) is leading. Most respondents will say yes. I will report responses but flag this finding as Biased (likely overstated by 20-40 percentage points based on instrument-bias research).”
Selection bias in recruitment. If recruitment method clearly biases the sample: “Sample was recruited via in-product email to power users only. Findings reflect power-user opinions, not the broader user base. Do not generalize to occasional users without separate research.”
NPS as decision input. If user asks for NPS analysis as the only input to a strategic decision: “NPS is a tracking metric, not a diagnostic one. It tells you the trend; it does not tell you what to do. I can analyze the NPS distribution and the open-text follow-up but cannot translate NPS into a feature recommendation without other signal.”
Causal inference from a cross-sectional survey. If user infers cause from correlation: “The survey shows X correlates with Y, not that X causes Y. Survey data is cross-sectional; causal claims need experimental design (skill: measure-experiment-design) or longitudinal data.”
Demanding a single number. If user asks “what percent want feature X?” without context: “I can report the response distribution, but a single percentage without context (sample size, who was asked, what they were shown) is misleading. Want the full distribution with caveats, or a different framing?”

Patterns

Validating a single hypothesis

Survey designed to test ONE specific hypothesis. Analysis focuses on:

Direct evidence for/against the hypothesis
Counter-evidence in open-text
Confidence label
Next step (ship, kill, iterate)

Exploratory analysis

Survey designed to discover unknown unknowns. Analysis focuses on:

Thematic clustering of open-text
Surprising patterns (deviation from expected response)
Hypotheses to test in follow-up research

Segmented analysis

Survey designed to compare segments. Analysis focuses on:

Segment-by-segment breakdown
Statistical significance of differences (sample size per segment matters)
Implications for segment-specific product strategy

Tracking analysis (NPS, CSAT, etc.)

Survey is a recurring instrument. Analysis focuses on:

Trend over time (this period vs. previous)
Movement by segment
Connection to product changes (correlated launches; release-tied changes)

Cross-skill composition

Output of this skill feeds into: define-problem-statement, define-hypothesis, deliver-prd, iterate-lessons-log
Inputs to this skill often come from: live survey results (raw rows or a pre-aggregated summary) plus the survey’s original design context
Adversarial review via: utility-pm-critic (challenges over-confident conclusions and missed limitations)
Complement to qualitative: discover-interview-synthesis covers qualitative; this skill covers quantitative; they should agree or the disagreement is itself a finding

Output Format

Use the template in references/TEMPLATE.md to structure the output. See references/EXAMPLE.md for a complete worked example.

Cross-references

Template: references/TEMPLATE.md
Examples: references/EXAMPLE.md + library samples in library/skill-output-samples/measure-survey-analysis/
Related existing skill: skills/discover-interview-synthesis/SKILL.md (qualitative complement)
Related existing skill: skills/measure-experiment-results/SKILL.md (when causal inference is required instead)

Output Template

Survey Analysis: [Survey Name]

Executive Summary

[Summary]

Survey Methodology Summary

Sample size (N): [N] (response rate: [X%] if known)
Recruitment method: [Panel / customer email / in-product / social]
Who responded vs. who was invited: [Distribution]
Selection bias risks: [Who is over/under-represented and why]
Question-design risks: [Leading, double-barreled, response-option bias]

Per-Question Analysis

Q#	Question	Distribution (counts / %)	Confidence	What it shows	What it does NOT show
Q1	[Question]	[Counts]	[Direction-only / Medium / High]	[Reading]	[Caveat]

Persona / Segment Breakdown

Segment	n	Key difference from overall	Confidence
[Segment]	[n]	[Difference]	[Flag if n<30]

Open-Text Thematic Clustering

Theme	Approx. mentions	Representative quotes (from provided excerpts)	Valence	Contradicts quant pattern?
[Theme 1]	[~N]	“[quote]”	[+/-/mixed]	[Yes/No]

Hypothesis Validation

Hypothesis	Status	Evidence	Confidence
[H1]	[SUPPORTED / CONTRADICTED / INCONCLUSIVE / NOT-TESTED]	[Question / theme]	[High/Medium/Low]

What the Data Does NOT Show

Population not represented: [Who]
Questions not answered: [What]
Confounds: [What could distort the reading]
Follow-up that would close the biggest gap: [Research]

Prioritized Recommendations

#	Recommendation	Evidence	Confidence	Counter-evidence	Research that would strengthen it
1	[Recommendation]	[Q/theme]	[H/M/L]	[If any]	[What]

Next Steps

[Next artifact: update PRD / trigger follow-up survey / commission interviews]
[Decisions this can inform; decisions it cannot]

Example Output

Survey Analysis: AI Notes-to-Tasks Adoption Survey

This is an illustrative survey analysis. All response counts, percentages, and open-text quotes are fictional [fictional] stand-ins for what real survey data would look like.

Executive Summary

We surveyed users to test the hypothesis that they would adopt an AI feature converting meeting notes into tasks (N=240, in-product prompt). Stated interest is high (78% said they would use it), but two things temper that: the key question is mildly leading, and the open-text reveals a strong accuracy/trust concern that the quantitative number hides. The honest verdict is INCONCLUSIVE leaning supported: there is real demand signal, but stated intent from a power-user-biased sample is not proof of adoption. Confidence: Medium. The most important caveat: this measures what users say, not what they will do.

Survey Methodology Summary

Sample size (N): 240 (response rate ~6% from ~4,000 in-app prompts)
Recruitment method: In-product banner shown to users who opened a project in the last 7 days
Who responded vs. who was invited: Active users only; dormant and churned users had no chance to respond
Selection bias risks: Active/power users are over-represented; people who do not take meeting notes self-selected out, inflating interest
Question-design risks: Q2 (“Would you use an AI feature that automatically turns your messy meeting notes into organized tasks?”) is mildly leading - it pairs a pain (“messy”) with a benefit (“organized”)

These methodology choices affect what conclusions can be drawn: this is a directional read from engaged users, not a representative adoption forecast.

Per-Question Analysis

Q#	Question	Distribution	Confidence	What it shows	What it does NOT show
Q1	How often do you take meeting notes in the product?	Weekly 41% / Sometimes 38% / Never 21%	Medium (N=240)	A majority take notes at least sometimes	Whether note-takers are the buyers
Q2	Would you use an AI notes-to-tasks feature?	Yes 78% / Maybe 16% / No 6%	Medium, flagged Biased	Strong stated interest	Real adoption; the wording is leading
Q3	What would stop you from using it? (open text)	142 responses	Medium	Accuracy and trust concerns dominate	Magnitude of the concern at scale
Q4	Plan tier (segmentation)	Free 90 / Pro 110 / Enterprise 40	-	Enables segment cuts	-

Q2 is reported but flagged Biased. Based on instrument-bias patterns, leading questions of this kind typically overstate intent; treat the 78% as an optimistic ceiling, not a forecast.

Persona / Segment Breakdown

Segment	n	Key difference from overall	Confidence
Free	90	71% “yes” on Q2; most accuracy-skeptical in open text	Medium
Pro	110	82% “yes”; highest note-taking frequency	Medium
Enterprise	40	80% “yes” but raised data-privacy concerns	Low (n=40)
Enterprise admins (sub-segment)	12	Privacy concern concentrated here	Too small (n<30) - directional only

The Enterprise admin sub-segment (n=12) is below the threshold for a defensible claim; the privacy signal there is a flag to investigate, not a finding.

Open-Text Thematic Clustering

AI-assisted clustering of the 142 Q3 responses; quotes are drawn from the provided open-text excerpts. Mention counts are approximate.

Theme	Approx. mentions	Representative quotes	Valence	Contradicts quant pattern?
Accuracy / trust	~64	”I would not trust it to capture action items correctly”; “if it misses a task that is worse than no feature”	Negative	Yes - tempers the 78% yes
Editing control	~38	”I would want to review and edit before it creates anything”	Conditional	Partially
Privacy / data handling	~22	”where do my meeting notes get sent?”	Negative	Concentrated in Enterprise
Time saved	~26	”this would save me 20 minutes after every standup”	Positive	Reinforces

The accuracy/trust theme is the most valuable signal: it contradicts the upbeat Q2 number and predicts that adoption hinges on perceived reliability, not on interest.

Hypothesis Validation

Hypothesis	Status	Evidence	Confidence
Users would adopt an AI notes-to-tasks feature	INCONCLUSIVE (leaning supported)	Q2 stated interest high (but leading + biased sample); open-text shows adoption is gated on accuracy/trust	Medium
Users will pay more for it	NOT TESTED BY THIS SURVEY	No pricing or willingness-to-pay question was asked	-

What the Data Does NOT Show

Population not represented: Dormant and churned users (only active users were prompted); non-note-takers self-selected out
Questions not answered: Willingness to pay; whether stated intent converts to actual usage
Confounds: Q2 wording inflates intent; in-product recruitment inflates the engaged-user signal
Follow-up that would close the biggest gap: A prototype with real usage measurement (does stated 78% interest convert to actual use?), and a neutrally-worded re-ask of Q2

Prioritized Recommendations

#	Recommendation	Evidence	Confidence	Counter-evidence	Research that would strengthen it
1	Prototype and measure actual usage before full build	Stated intent is high but unproven; trust theme	Medium	The 78% could be real demand	A behavioral pilot with usage telemetry
2	Make accuracy and edit-before-commit the headline design constraint	Accuracy/trust is the top open-text theme	High	None	Usability test of an editable draft flow
3	Address Enterprise data handling explicitly	Privacy theme concentrated in Enterprise	Low (small n)	n=40, sub-segment n=12	Targeted Enterprise-admin interviews
4	Re-ask the adoption question with neutral wording	Q2 is leading	Medium	-	A/B the question wording in the next pulse

Next Steps

Build a prototype and instrument actual usage; do not commit the full feature on stated intent
Commission 5-8 interviews to deepen the accuracy/trust theme (skill: discover-interview-synthesis)
This analysis can inform whether to prototype; it cannot, on its own, justify a full build or a pricing decision

Real-World Examples

See this skill applied to three different product contexts:

Storevine (B2B): Storevine B2B forecasting platform - feature-prioritization survey of 180 customer admins, segmented by company size

Prompt:

measure-survey-analysis

analyze our storevine feature-prioritization survey. 180 customer admins
responded. we asked them to rate 5 candidate features by importance and pick
their #1.

our hypotheses going in:
- H1: multi-warehouse support is the top ask
- H2: seasonal-adjustment accuracy is a close second

segment by company size (we captured it). tell us what to build next.

Output:

Survey Analysis: Storevine Next-Feature Validation

Brainshelf (Consumer): Brainshelf consumer subscription - quarterly NPS survey (N=1200) with an open-text follow-up

Prompt:

measure-survey-analysis

analyze our Q2 brainshelf NPS survey. 1200 subscribers responded. standard
NPS question (0-10) plus an open text "what's the one thing you'd change?"
last quarter's NPS was 18. mine the open text for what we should build next.

Output:

Survey Analysis: Brainshelf Q2 NPS Pulse

Workbench (Enterprise): Workbench internal dev-experience platform - exploratory pulse survey of 65 engineers, sample too small for strong inference

Prompt:

measure-survey-analysis

analyze our dev-experience pulse survey. 65 engineers responded out of ~280.
mix of likert questions (rate your dev experience 1-5 across a few areas)
plus an open text "biggest friction in your day?". tell us what to prioritize.

Output:

Survey Analysis: Workbench Dev-Experience Pulse

Read this first: N=65 (of ~280 engineers, ~23% response). This sample is large enough to spot directional themes but too small for statistically reliable conclusions or capital-allocation decisions. Everything below is direction-only. Treat it as a signal of where to look, not as a mandate of what to fund.

Quality Checklist

Before finalizing, verify:

Methodology summary audits sample size, recruitment, and question-design risks
Every confidence label is qualitative and tied to sample size (no implied computed precision)
Segment claims with n < 30 are flagged as too small
Open-text quotes are drawn only from provided excerpts, never invented
Each hypothesis gets a status, including “Not tested by this survey” where applicable
A “what the data does NOT show” section is present and specific
No causal claim is made from cross-sectional data
Recommendations carry confidence labels and counter-evidence