¶

Quick facts

Phase: Measure | Version: 2.0.0 | Category: reflection | License: Apache-2.0

Try it: /experiment-results "Your context here"

Experiment Results¶

An experiment results document captures what happened when you tested a hypothesis, including statistical outcomes, segment analysis, learnings, and clear recommendations. Good results documentation turns individual experiments into organizational knowledge that improves future decision-making.

When to Use¶

After an A/B test or experiment reaches statistical significance
When an experiment is ended early (for any reason)
To communicate findings to stakeholders who weren't involved
During decision-making about whether to ship, iterate, or kill a feature
To build a repository of learnings that inform future experiments

How to Use¶

Use the /experiment-results slash command:

/experiment-results "Your context here"

Or reference the skill file directly: skills/measure-experiment-results/SKILL.md

Instructions¶

When asked to document experiment results, follow these steps:

Summarize the Experiment Provide context: what was tested, when it ran, how much traffic it received. Link to the original experiment design document if one exists.
Restate the Hypothesis Remind readers what you believed would happen and why. This frames the results interpretation.
Present Primary Results Show the primary metric outcome clearly: what were the values for control and treatment? Include statistical significance (p-value), confidence intervals, and sample sizes. Be honest about whether results are conclusive.
Analyze Secondary Metrics Present guardrail metrics that ensure you didn't cause unintended harm. Note any secondary metrics that moved unexpectedly—both positive and negative.
Segment the Data Look for differential effects across user segments (platform, tenure, plan type, etc.). Sometimes overall results mask important segment-level insights.
Extract Learnings What did you learn beyond the numbers? Include surprising findings, questions raised, and implications for the product hypothesis. Negative results are valuable learnings.
Make a Recommendation Be clear: should we ship, iterate, or kill? Support the recommendation with the evidence. If the decision is nuanced, explain the trade-offs.
Define Next Steps Specify what happens now—engineering work to ship, follow-up experiments, metrics to continue monitoring, or documentation to update.

Output Template¶

Experiment Results: [Experiment Name]¶

Summary¶

Attribute	Value
Experiment ID	[ID]
Experiment Name	[Name]
Status	Completed / Ended Early / Inconclusive
Duration	[Start date] to [End date] ([X] days)
Traffic Allocation	[X]% control / [Y]% treatment
Total Sample Size	[N] users
Owner	[Name]
Design Doc	[Link to experiment design]

Hypothesis Recap¶

Original Hypothesis:

We believed that [change/treatment] would [expected outcome] because [rationale].

Success Criteria:

Primary metric: [Metric] improves by [X]%
Statistical significance: p < 0.05
Minimum sample size: [N]

Results¶

Primary Metric: [Metric Name]¶

Variant	Value	Sample Size	Confidence Interval
Control	[X.XX%]	[N]	[Lower] - [Upper]
Treatment	[X.XX%]	[N]	[Lower] - [Upper]

Observed Difference: [±X.XX%] ([Relative change]%)

Statistical Significance: - p-value: [X.XXX] - Confidence level: [XX]% - Statistically significant: [Yes/No]

Interpretation:

[Plain language explanation of what these numbers mean]

Secondary Metrics¶

Metric	Control	Treatment	Difference	Significant?
[Metric 1]	[Value]	[Value]	[±X%]	[Yes/No]
[Metric 2]	[Value]	[Value]	[±X%]	[Yes/No]
[Metric 3]	[Value]	[Value]	[±X%]	[Yes/No]

Guardrail Metrics¶

Metric	Control	Treatment	Threshold	Status
[Metric 1]	[Value]	[Value]	No degradation > [X]%	Pass/Fail
[Metric 2]	[Value]	[Value]	No degradation > [X]%	Pass/Fail

Segment Analysis¶

By [Segment Dimension 1]¶

Segment	Control	Treatment	Difference	Significant?
[Segment A]	[Value]	[Value]	[±X%]	[Yes/No]
[Segment B]	[Value]	[Value]	[±X%]	[Yes/No]
[Segment C]	[Value]	[Value]	[±X%]	[Yes/No]

By [Segment Dimension 2]¶

Segment	Control	Treatment	Difference	Significant?
[Segment A]	[Value]	[Value]	[±X%]	[Yes/No]
[Segment B]	[Value]	[Value]	[±X%]	[Yes/No]

Segment Insights¶

[Notable findings from segment analysis - where did treatment work better/worse?]

Visualization¶

Primary Metric Over Time¶

[Link to chart or describe trend]

Conversion Funnel Impact¶

[Link to chart or describe impact on funnel]

Learnings¶

What We Learned¶

[Learning 1] [Description and evidence]
[Learning 2] [Description and evidence]
[Learning 3] [Description and evidence]

Surprising Findings¶

[Unexpected result 1]
[Unexpected result 2]

What We Still Don't Know¶

[Open question 1]
[Open question 2]

Recommendation¶

Decision: [Ship / Iterate / Kill]¶

Rationale:

[Clear explanation of why this is the recommendation]

If Shipping¶

[Engineering task 1]
[Engineering task 2]
[Documentation update]
[Metrics to continue monitoring]

If Iterating¶

What to change: [Changes based on learnings]
Next experiment: [Brief description]
Timeline: [When to run]

If Killing¶

Why: [Clear explanation]
Learnings to preserve: [What we take forward]
Alternative approaches: [What else might work]

Next Steps¶

Action	Owner	Due Date
[Action 1]	[Name]	[Date]
[Action 2]	[Name]	[Date]
[Action 3]	[Name]	[Date]

Appendix¶

Raw Data¶

[Link to data/dashboard]

Statistical Methodology¶

Test type: [Chi-squared / t-test / etc.]
Power calculation: [Details]
Multiple comparison correction: [If applicable]

Known Issues¶

[Any data quality issues or caveats]

Results documented on [date]. Decision implemented on [date].

Example Output¶

Experiment Results: One-Page Checkout vs. Multi-Step Checkout

Experiment Results: One-Page Checkout vs. Multi-Step Checkout¶

Summary¶

Attribute	Value
Experiment ID	EXP-2026-001
Experiment Name	One-Page Checkout
Status	Completed
Duration	December 15, 2025 to January 10, 2026 (26 days)
Traffic Allocation	50% control / 50% treatment
Total Sample Size	47,832 users (23,891 control, 23,941 treatment)
Owner	Sarah Martinez, Product Manager
Design Doc	EXP-2026-001 Design (internal link)

Hypothesis Recap¶

Original Hypothesis:

We believed that consolidating our 3-step checkout into a single page with accordion sections would increase checkout conversion rate because reducing perceived complexity and eliminating page loads would reduce drop-off at each step.

Success Criteria:

Primary metric: Checkout conversion rate improves by at least 3%
Statistical significance: p < 0.05 (two-tailed)
Minimum sample size: 20,000 users per variant

Results¶

Primary Metric: Checkout Conversion Rate¶

Variant	Value	Sample Size	Confidence Interval (95%)
Control (3-step)	62.4%	23,891	61.8% - 63.0%
Treatment (1-page)	65.6%	23,941	65.0% - 66.2%

Observed Difference: +3.2 percentage points (+5.1% relative improvement)

Statistical Significance: - p-value: 0.0003 - Confidence level: 95% - Statistically significant: Yes

Interpretation:

The one-page checkout significantly outperformed the 3-step checkout. We can be 95% confident that the true improvement is between 1.8% and 4.6% (absolute). The result exceeds our 3% threshold for success.

Secondary Metrics¶

Metric	Control	Treatment	Difference	Significant?
Cart-to-checkout start	78.2%	79.1%	+0.9%	No (p=0.12)
Average order value	$87.42	$86.91	-0.6%	No (p=0.34)
Items per order	2.31	2.28	-1.3%	No (p=0.22)
Time in checkout	4.2 min	3.1 min	-26.2%	Yes (p<0.001)
Payment errors	2.1%	1.8%	-14.3%	Yes (p=0.03)

Guardrail Metrics¶

Metric	Control	Treatment	Threshold	Status
Return rate (7-day)	8.2%	8.4%	No increase > 1%	Pass
Customer support tickets	0.9%	0.7%	No increase > 0.5%	Pass
Payment failure rate	3.4%	3.2%	No increase > 0.5%	Pass

Segment Analysis¶

By Device Type¶

Segment	Control	Treatment	Difference	Significant?
Desktop	68.3%	69.5%	+1.2%	No (p=0.18)
Mobile	54.2%	59.3%	+5.1%	Yes (p<0.001)
Tablet	61.7%	64.1%	+2.4%	No (p=0.09)

By Customer Type¶

Segment	Control	Treatment	Difference	Significant?
New customers	51.8%	57.2%	+5.4%	Yes (p<0.001)
Returning customers	71.3%	72.1%	+0.8%	No (p=0.28)

By Cart Value¶

Segment	Control	Treatment	Difference	Significant?
< $50	65.4%	68.9%	+3.5%	Yes (p=0.008)
$50-$100	62.1%	65.2%	+3.1%	Yes (p=0.02)
> $100	58.7%	61.4%	+2.7%	No (p=0.07)

Segment Insights¶

The treatment effect was strongest on mobile devices (+5.1% vs +1.2% desktop) and for new customers (+5.4% vs +0.8% returning). This makes sense: new customers benefit most from reduced cognitive load, and mobile users benefit most from fewer page loads and scrolling.

Returning customers showed minimal improvement, likely because they've already learned the existing checkout flow.

Visualization¶

Conversion Rate Over Time¶

The treatment consistently outperformed control throughout the experiment period. No significant interaction with time was observed (e.g., novelty effect wearing off).

Conversion Rate by Day
       70% |
           |    ████████████████████████  (Treatment)
       65% |████████████████████████████
           |
       60% |████████████████████████████  (Control)
           |
       55% |________________________________
              Dec 15                Jan 10

Funnel Drop-off Comparison¶

Step	Control Drop-off	Treatment Drop-off	Improvement
Cart → Checkout	21.8%	20.9%	+0.9%
Shipping info	8.4%	N/A (combined)	—
Payment info	5.2%	N/A (combined)	—
Review → Purchase	4.1%	3.5%	+0.6%
Total checkout drop	37.6%	34.4%	+3.2%

Learnings¶

What We Learned¶

Reducing page loads matters more than expected The 26% reduction in checkout time suggests page load latency was a bigger friction point than the form complexity itself. Users were abandoning during transitions between steps.
Mobile optimization is high-leverage Mobile users saw 4x the improvement of desktop users. Our multi-step checkout was particularly problematic on smaller screens with more scrolling and tapping between pages.
New user experience is critical New customers showed 5.4% improvement vs 0.8% for returning customers. First-time checkout experience has outsized impact on conversion; returning customers have already overcome the learning curve.
Form length matters less than perceived progress The one-page design actually shows MORE fields at once, but the accordion UI creates a sense of progress and control. Perception matters more than raw field count.

Surprising Findings¶

Payment errors decreased by 14%: We didn't expect this. Hypothesis: single-page reduces context-switching errors where users forget information between steps.
No impact on AOV or items per order: We worried that faster checkout might mean less consideration, leading to smaller orders. This didn't materialize.

What We Still Don't Know¶

Will the mobile improvement persist on different device types (older phones, low-bandwidth connections)?
Is there a specific accordion section that causes the most friction? (We didn't instrument that level of detail)
How does this interact with guest checkout vs. account checkout?

Recommendation¶

Decision: Ship to 100%¶

Rationale:

The experiment delivered a clear, statistically significant improvement that exceeded our success threshold. The +3.2% lift in checkout conversion represents approximately $840K in additional annual revenue at current traffic levels. No guardrail metrics were violated, and the improvement was consistent across the experiment duration.

The segment analysis reveals even higher impact on mobile and new customers—our strategic growth priorities.

Shipping Plan¶

Merge feature branch to main (Eng: Chen, by Jan 17)
Update checkout analytics events for new flow (Eng: Chen, by Jan 17)
Remove experiment infrastructure and flags (Eng: Chen, by Jan 20)
Update help center documentation (Support: Lisa, by Jan 22)
Communicate change to CS team (PM: Sarah, by Jan 17)
Monitor conversion rate daily for 2 weeks post-launch (PM: Sarah, ongoing)

Metrics to Continue Monitoring¶

Checkout conversion rate (daily for 2 weeks, then weekly)
Mobile vs. desktop conversion gap
Payment error rate
Customer support ticket volume related to checkout

Next Steps¶

Action	Owner	Due Date
Ship to 100% traffic	Chen (Eng)	Jan 17
Update analytics events	Chen (Eng)	Jan 17
Remove experiment flags	Chen (Eng)	Jan 20
Update help documentation	Lisa (Support)	Jan 22
Post-launch monitoring report	Sarah (PM)	Jan 31
Plan follow-up: guest checkout optimization	Sarah (PM)	Feb 1

Appendix¶

Raw Data¶

Experiment Dashboard (Looker link)
Raw data export (CSV link)
Statistical analysis notebook (analysis link)

Statistical Methodology¶

Test type: Two-proportion z-test for primary metric
Multiple comparison correction: Benjamini-Hochberg for segment analysis
Power calculation: 80% power to detect 3% lift at α=0.05 required 18,500 per variant
Sequential analysis: Not used; experiment ran to planned end date

Known Issues¶

Dec 24-26 excluded: Holiday traffic anomaly removed from analysis (both variants affected equally)
Bot traffic: Standard bot filters applied; no unusual activity detected
One variant had 0.2% more iOS traffic: Not statistically significant; analysis not adjusted

Results documented on January 14, 2026. Full rollout completed January 20, 2026.

Real-World Examples¶

See this skill applied to three different product contexts:

Storevine (B2B): Storevine B2B ecommerce platform — Campaigns guided first-campaign flow A/B test results

Prompt:

/experiment-results

Experiment: Campaigns guided first-campaign flow A/B test
Ran: April 28 – June 27, 2026 (60 days from GA)

Results:
- Control (blank canvas): 13.4% first-send rate, N=1,204 merchants [fictional]
- Treatment (guided flow): 31.7% first-send rate, N=1,198 merchants [fictional]
- p < 0.001, statistically significant

Secondary metrics:
- Median time to first send: 43 min (control) vs. 12 min (treatment) [fictional]
- Second campaign rate (90 days): 9.1% control vs. 22.8% treatment [fictional]

Guardrail metrics all within acceptable range.

Segment finding: treatment effect strongest in fashion/home merchants
(38.2% vs. 14.6% control) [fictional]; weakest in specialty food (24.1%
vs. 11.8% control) [fictional]

Decision: ship the guided flow as default. Need full results document
for stakeholder review and decision record.

Output:

Experiment Results: Campaigns Guided First-Campaign Flow A/B Test¶

Brainshelf (Consumer): Brainshelf consumer PKM app — Resurface A/B test results and ship decision

Prompt:

/experiment-results

resurface a/b test results. test ran mar 9 - apr 5 (4 weeks).
800 users, 50/50 split [fictional].

headline: treatment won.

primary metric (7-day return rate):
- control: 18.1% [fictional]
- treatment: 23.4% [fictional]
- delta: +5.3pp, p = 0.008 [fictional]

secondary (email CTR, treatment only): 17.2% [fictional]
opt-in rate: 41% of treatment group (164/400) [fictional]

guardrails all passed:
- unsub: 1.3%/week [fictional] (under 2% threshold)
- app uninstall: no difference
- save rate: no difference

segments:
- heavy savers (100+ items): return rate lift = +7.1pp [fictional]
- light savers (10-50 items): return rate lift = +3.8pp [fictional]
- daily cadence users: CTR 18.6% [fictional]
- 3x/week users: CTR 14.1% [fictional]

surprising: opt-in rate was 41% [fictional], way above the 10% target.
also surprising: week 4 return rate was HIGHER than week 1 (not
novelty decay — habit formation).

recommendation: ship to all eligible users.

Output:

Experiment Results: Resurface Daily Digest A/B Test¶

Workbench (Enterprise): Workbench enterprise collaboration platform: Blueprints required-section enforcement A/B test results

Prompt:

/experiment-results

Leo M. just finished the A/B test analysis. I need a full experiment-results document for the April 14 ship-decision meeting. Here are the results:

**Experiment details:**
- Name: Required-Section Enforcement A/B Test
- Duration: Mar 9 -- Apr 5, 2026 (28 days)
- Accounts: 80 closed-beta enterprise accounts [fictional]; 40 treatment, 40 control
- Randomization: Account-level (all users within an account see the same variant)
- Total Blueprints created during test: 612 (298 treatment, 314 control) [fictional]

**Primary metric:**
- Median time-to-approved: Treatment 2.4 days [fictional] vs. Control 4.0 days [fictional]
- Relative change: -40% [fictional]
- Mann-Whitney U test: p = 0.001
- 95% CI for median difference: 1.2 -- 2.0 days [fictional]

**Secondary metrics:**
- Empty-section submission rate: Treatment 6% [fictional] vs. Control 37% [fictional]
- Approval cycle count: Treatment 1.4 [fictional] vs. Control 2.3 [fictional]
- Blueprint completion rate (% of started Blueprints that reach approval): Treatment 78% [fictional] vs. Control 71% [fictional]

**Guardrail metrics:**
- Author satisfaction (in-app survey, 1-5 scale): Treatment 3.8 [fictional] vs. Control 3.9 [fictional] (not significant, p=0.42)
- Blueprint creation rate (Blueprints started per account per week): Treatment 3.7 [fictional] vs. Control 3.9 [fictional] (not significant, p=0.31)
- Average section word count: Treatment 142 words [fictional] vs. Control 128 words [fictional] (significant, p=0.03)

**Segment analysis:**
- By company size: Large (200+ employees): -44% time-to-approved [fictional]; Mid (50-199): -38% [fictional]; Small (10-49): -31% [fictional]
- By template type: Project kickoff templates: -45% [fictional]; Process documentation: -35% [fictional]; Meeting notes: -22% [fictional]

**Surprising finding:** Authors in the treatment group wrote 11% more words per section on average [fictional]. We think the required-section badges acted as a psychological prompt to write more thoroughly.

**Recommendation:** Ship. Required-section enforcement exceeded our MDE of 1.0 day with high confidence.

Please generate the full experiment-results document for the ship-decision meeting.

Output:

Experiment Results: Required-Section Enforcement A/B Test¶

Quality Checklist¶

Before finalizing, verify:

Statistical methods and significance are clearly stated
Confidence intervals are included (not just p-values)
Segment analysis checked for differential effects
Secondary/guardrail metrics are reported
Learnings go beyond just the numbers
Recommendation is clear and actionable
Negative or inconclusive results are reported honestly

Output Format¶

Use the template in references/TEMPLATE.md to structure the output.