Experiment Design

Try it: /pm-skills:measure-experiment-design "Your context here"

An experiment design document defines all parameters needed to run a rigorous A/B test or controlled experiment. It ensures the team aligns on what you’re testing, how you’ll measure success, and how long to run the test before drawing conclusions. Good experiment design prevents common pitfalls: underpowered tests, unclear success criteria, and decisions based on noise rather than signal.

When to Use

Before launching an A/B test to validate a product change
When testing a hypothesis that requires quantitative validation
After solution design to validate assumptions before full rollout
When stakeholders want data-driven evidence for a decision
To establish a culture of experimentation and learning

When NOT to Use

The hypothesis itself is not yet articulated -> use define-hypothesis first; this skill designs the test for a claim you already have
You are analyzing a completed experiment -> use measure-experiment-results
You need the event tracking that will measure the experiment -> use measure-instrumentation-spec
You are gathering opinions rather than running a controlled test -> use measure-survey-analysis

How to Use

Invoke the skill by name (/pm-skills:measure-experiment-design on Claude Code, $measure-experiment-design on Codex):

/pm-skills:measure-experiment-design "Your context here"

Or reference the skill file directly: skills/measure-experiment-design/SKILL.md

Instructions

When asked to design an experiment, follow these steps:

Articulate the Hypothesis Write a clear, testable hypothesis in the format: “We believe [change] for [users] will [outcome] as measured by [metric].” One hypothesis per experiment - if you’re testing multiple things, run multiple experiments.
Define the Variants Describe the control (current experience) and treatment (new experience) in sufficient detail. Include screenshots, mockups, or precise descriptions so anyone can understand what users will see.
Choose Primary and Secondary Metrics Select one primary metric that will determine success or failure. Add 2-3 secondary metrics to understand the broader impact. Include guardrail metrics to catch unintended negative effects.
Calculate Sample Size Determine how many users you need per variant to detect your minimum detectable effect (MDE) with statistical significance. Specify your significance level (typically 0.05) and power (typically 0.80).
Estimate Duration Based on sample size and available traffic, calculate how long the experiment needs to run. Account for weekly patterns - avoid ending mid-week if behavior varies by day.
Define Targeting and Allocation Specify which users are eligible for the experiment and how traffic is split between variants. Document any exclusions (e.g., employees, specific segments).
Set Success Criteria Define upfront what constitutes a win, a loss, or an inconclusive result. This prevents post-hoc rationalization and moving goalposts.
Document Risks and Mitigations Identify what could go wrong and how you’ll detect/address it. Include monitoring plans and rollback criteria.

Output Format

Use the template in references/TEMPLATE.md to structure the output. A complete design fills every template section: Overview; Hypothesis; Background; Variants; Metrics; Sample Size & Duration; Audience Targeting; Success Criteria; Risks & Mitigations; Implementation Notes; and References.

Examples

See references/EXAMPLE.md for a completed example.

Output Template

Experiment Design: [Experiment Name]

Overview

Field	Value
Experiment Name	[Short descriptive name]
Owner	[Name, role]
Start Date	[Planned start date]
End Date	[Planned end date]
Status	Draft / Ready / Running / Completed

Hypothesis

We believe [proposed change]

for [target user segment]

will [expected outcome]

as measured by [primary metric]

Background

[Context explaining why this experiment matters and what led to the hypothesis]

Variants

Control (A)

Description: [What users currently experience]

Details:

[Specific element 1]
[Specific element 2]

Screenshot/Mockup: [Link or embed]

Treatment (B)

Description: [What users will experience in the new variant]

Details:

[Specific change 1]
[Specific change 2]

Screenshot/Mockup: [Link or embed]

Metrics

Primary Metric

Metric	Definition	Current Baseline	Minimum Detectable Effect
[Metric name]	[How it’s calculated]	[Current value]	[Smallest change we want to detect]

Secondary Metrics

Metric	Definition	Purpose
[Metric 1]	[Definition]	[Why we’re tracking this]
[Metric 2]	[Definition]	[Why we’re tracking this]

Guardrail Metrics

Metric	Definition	Threshold
[Metric 1]	[Definition]	[Must not decrease by more than X%]
[Metric 2]	[Definition]	[Must not decrease by more than X%]

Sample Size & Duration

Sample Size Calculation

Parameter	Value
Baseline conversion rate	[X%]
Minimum detectable effect (MDE)	[X% relative / X pp absolute]
Statistical significance (alpha)	[0.05 typical]
Statistical power (1-beta)	[0.80 typical]
Users per variant	[Calculated number]
Total users needed	[Sum across variants]

Duration Estimate

Parameter	Value
Daily eligible traffic	[X users/day]
Traffic allocation	[X% to experiment]
Users per day in experiment	[Calculated]
Minimum duration	[X days to reach sample size]
Recommended duration	[X days, accounting for weekly patterns]

Audience Targeting

Inclusion Criteria

[Criterion 1: e.g., “All logged-in users”]
[Criterion 2: e.g., “Users in US and Canada”]
[Criterion 3: e.g., “Users on mobile web”]

Exclusion Criteria

[Exclusion 1: e.g., “Employees and internal testers”]
[Exclusion 2: e.g., “Users in active experiments that may conflict”]

Traffic Allocation

Variant	Allocation
Control (A)	[X%]
Treatment (B)	[X%]

Success Criteria

Win (Ship Treatment)

[Conditions that constitute a clear win for the treatment, e.g., “Primary metric improves by >= MDE with p < 0.05, and no guardrail metrics regress beyond threshold”]

Loss (Keep Control)

[Conditions that indicate treatment is worse, e.g., “Primary metric decreases with p < 0.05, OR any guardrail metric regresses beyond threshold”]

Inconclusive (More Data Needed)

[Conditions that require further investigation, e.g., “Primary metric change is not statistically significant after full duration”]

Risks & Mitigations

Risk	Likelihood	Impact	Mitigation
[Risk 1]	High/Med/Low	High/Med/Low	[How we’ll address it]
[Risk 2]	High/Med/Low	High/Med/Low	[How we’ll address it]

Monitoring Plan

[What we’ll monitor during the experiment]
[Alert thresholds that would trigger early review]
[Rollback criteria]

Implementation Notes

[Feature flag name/ID]
[Instrumentation requirements]
[Any technical considerations]

References

[Link to related hypothesis document]
[Link to design mockups]
[Link to previous related experiments]

Example Output

Experiment Design: One-Page Checkout

Overview

Field	Value
Experiment Name	checkout-single-page-v1
Owner	Maria Santos, Product Manager
Start Date	January 20, 2026
End Date	February 3, 2026
Status	Ready

Hypothesis

We believe replacing our 3-step checkout with a single-page checkout

for mobile web users

will increase checkout completion rate

as measured by checkout conversion rate (orders / checkout starts)

Background

Mobile checkout abandonment is at 73%, compared to 45% on desktop. User research identified friction points: confusion navigating between steps, anxiety about hidden costs appearing late in the flow, and form field frustration on small screens. A single-page checkout addresses these by showing all information upfront and reducing navigation.

Competitor analysis shows that Amazon, Shopify stores, and Apple use single-page or minimal-step checkouts. Our hypothesis is that reducing cognitive load and providing visibility into the full process will improve conversion.

Variants

Control (A)

Description: Current 3-step checkout flow

Details:

Step 1: Shipping address entry
Step 2: Shipping method selection + payment information
Step 3: Order review and confirmation
Each step loads a new page
Progress indicator shows current step

Screenshot/Mockup: [designs/checkout-control-v3.png]

Treatment (B)

Description: Single-page checkout with accordion sections

Details:

All sections visible on one page (shipping, payment, review)
Accordion UI: one section expanded at a time, others collapsed but visible
Express payment buttons (Apple Pay, Google Pay) at top
Shipping cost shown immediately based on cart
Edit any section without losing data

Screenshot/Mockup: [designs/checkout-treatment-v2.png]

Metrics

Primary Metric

Metric	Definition	Current Baseline	Minimum Detectable Effect
Checkout conversion rate	Orders completed / Checkout sessions started	27%	5% relative (1.35 pp absolute)

Secondary Metrics

Metric	Definition	Purpose
Checkout completion time	Median seconds from checkout start to order	Confirm UX improvement
Express payment usage	% of orders via Apple Pay / Google Pay	Track new feature adoption
Cart abandonment rate	Carts abandoned before checkout start	Ensure we’re not shifting drop-off

Guardrail Metrics

Metric	Definition	Threshold
Average order value	Total revenue / Orders	Must not decrease by more than 2%
Payment failure rate	Failed payments / Payment attempts	Must not increase by more than 1pp
Support tickets (checkout)	CS tickets tagged “checkout”	Must not increase by more than 20%

Sample Size & Duration

Sample Size Calculation

Parameter	Value
Baseline conversion rate	27%
Minimum detectable effect (MDE)	5% relative (28.35% target)
Statistical significance (alpha)	0.05 (one-tailed)
Statistical power (1-beta)	0.80
Users per variant	9,800
Total users needed	19,600

Calculation performed using Evan Miller’s sample size calculator, assuming one-tailed test (we only ship if treatment is better).

Duration Estimate

Parameter	Value
Daily eligible traffic	18,000 mobile checkout sessions/day
Traffic allocation	80% to experiment (20% holdout for safety)
Users per day in experiment	14,400
Minimum duration	2 days to reach sample size
Recommended duration	14 days to capture weekly patterns and ensure stability

We’re running for 14 days despite reaching sample size in 2 days to:

Capture full weekly shopping patterns (weekday vs. weekend)
Allow time to detect delayed effects on returns/chargebacks
Ensure novelty effect wears off

Audience Targeting

Inclusion Criteria

Logged-in or guest users
Mobile web (not native apps - app experiment runs separately)
Users in US and Canada (payment methods configured)
Cart value >= $10 (exclude micro-purchases)

Exclusion Criteria

Employees (identified by email domain)
Users enrolled in conflicting experiments (cart-upsell-test, payment-method-test)
Users who have participated in checkout experiments in past 30 days

Traffic Allocation

Variant	Allocation
Control (A)	50%
Treatment (B)	50%

Success Criteria

Win (Ship Treatment)

Primary metric (checkout conversion) improves by >= 5% relative with p < 0.05, AND:

Average order value does not decrease by more than 2%
Payment failure rate does not increase by more than 1pp
No critical bugs or UX issues identified

Action: Roll out to 100% of mobile web, begin desktop adaptation.

Loss (Keep Control)

Any of:

Checkout conversion decreases with p < 0.05
Any guardrail metric breaches threshold
Critical UX issues discovered affecting > 1% of users

Action: Revert to control, analyze learnings, iterate on design.

Inconclusive (More Data Needed)

Primary metric change is between -5% and +5% relative and not statistically significant after 14 days.

Action: Extend experiment to 21 days if traffic allows. If still inconclusive, treat as neutral and decide based on qualitative factors (user feedback, operational simplicity).

Risks & Mitigations

Risk	Likelihood	Impact	Mitigation
Apple Pay integration issues	Medium	High	Test extensively in staging; 5% initial ramp with monitoring
Page load performance degradation	Low	Medium	Performance budget enforced; lazy-load payment forms
Users confused by accordion UX	Low	Medium	Added progress indicators and clear “Continue” buttons
Increased payment failures	Low	High	Same payment provider; monitor failure rate hourly

Monitoring Plan

Real-time dashboard: Checkout funnel by variant, updated every 5 minutes
Alert thresholds:
- Payment failure rate > 5% in any 1-hour window
- Checkout conversion drops > 20% relative to control
- Error rate > 1% on any checkout action
Rollback criteria: Any alert sustained for 30 minutes triggers automatic rollback
Daily check-ins: Team reviews metrics at 10am daily for first 5 days

Implementation Notes

Feature flag: checkout_single_page_v1 in LaunchDarkly
Instrumentation: New events checkout_section_opened, checkout_section_completed for funnel analysis
Cache invalidation: Clear checkout page cache at experiment start
Mobile detection: User-Agent parsing; tablet = desktop experience

References

Hypothesis Document: Mobile Checkout Improvement (internal doc link)
Design Mockups (Figma link)
Previous Experiment: Guest Checkout (results/guest-checkout-q3-2025.md) - 3% lift, informed this design
User Research: Checkout Friction Study (research PDF)

Real-World Examples

See this skill applied to three different product contexts:

Storevine (B2B): Storevine B2B ecommerce platform . Campaigns guided first-campaign flow A/B experiment design

Prompt:

measure-experiment-design

Project: Campaigns . Campaigns guided first-campaign flow
Experiment: Does the guided first-campaign flow increase first-send rate
            for non-adopter merchants?

Hypothesis (from Define phase doc):
- We believe pre-populated templates for non-adopter merchants (<250
  customers [fictional], no external email tool) will drive first-send
  rate from 12% [fictional] to ≥30% [fictional] within 60 days of GA

Variants:
- Control: Standard Campaigns creation flow (blank template editor,
  named segment library, no pre-population)
- Treatment: Guided first-campaign flow (product-seeded template,
  audience defaulted to "Customers who purchased in the last 90 days")

Sample: ~6,800 eligible non-adopter merchants [fictional]; need enough
  per variant to detect a 8 pp improvement with 80% power

Run period: April 28  -  June 27, 2026 (60 days from GA)

Need: full experiment design with sample size calculation, success
criteria, risks, and implementation notes.

Output:

Experiment Design: Guided First-Campaign Flow for Non-Adopter Merchants

Brainshelf (Consumer): Brainshelf consumer PKM app . Resurface A/B test experiment design

Prompt:

measure-experiment-design

resurface a/b test. feature is shipped behind a flag. need the full
experiment design for chloe to set up in amplitude.

hypothesis: daily digest → higher 7-day return rate.

design: intent-to-treat. treatment gets the opt-in prompt + digest.
control gets nothing (current experience). measure 7-day return rate
for both groups.

secondary metric: email CTR (treatment only . control doesn't get
email). guardrail: unsub rate ≤2%/week.

sample: 400 per variant from the 9,800 eligible users [fictional].
duration: 4 weeks (mar 9 - apr 5). 50/50 split on enrollment cohort.

want to have the design doc locked before the setup week (mar 2-8).

Output:

Experiment Design: Resurface Daily Digest A/B Test

Workbench (Enterprise): "Workbench enterprise collaboration platform: Blueprints required vs. optional sections A/B test"

Prompt:

measure-experiment-design

Experiment: Required vs. optional Blueprint sections
Product: Workbench Blueprints (enterprise doc templates with approval gates)
Stage: Closed beta shipped; need to A/B test before expanding to full 500-customer base [fictional]

Context:
- Blueprints allows admins to create doc templates with sections
- Currently all sections are optional -- authors can submit incomplete Blueprints for approval
- Data: 38% of Blueprints reach approval with ≥1 empty section [fictional]; most rejections are for missing content, not quality
- Hypothesis from Define phase: making sections required (must complete before submitting) reduces time to first approval
- Baseline: median time to first approved Blueprint = 4.0 days [fictional]
- Goal: reduce to ≤2.5 days [fictional]

Treatment: Required sections -- authoring UI blocks submission if any required section is empty. Show inline validation message, highlight empty sections.
Control: Current optional sections -- authors can submit with empty sections as today.

Primary metric: median time-to-first-approval (days)
Secondary: approval rejection rate, Blueprint completion rate
Guardrail: don't tank author-side NPS or increase abandonment

Audience: Project leads at enterprise customers in closed beta (excludes IT admins and approvers).
Stakeholders: Head of Product, Data Science, Engineering Lead (Blueprints)

Output:

Experiment Design: Blueprints Required vs. Optional Sections

Quality Checklist

Before finalizing, verify:

Hypothesis is falsifiable and specific
Only one primary metric is defined
Sample size calculation is documented with assumptions
Duration accounts for traffic patterns and statistical requirements
Success criteria are defined before the experiment starts
Guardrail metrics protect against unintended harm