Experiment Design
Quick facts
Phase: Measure | Version: 2.0.0 | Category: validation | License: Apache-2.0
Try it: /experiment-design "Your context here"
An experiment design document defines all parameters needed to run a rigorous A/B test or controlled experiment. It ensures the team aligns on what you’re testing, how you’ll measure success, and how long to run the test before drawing conclusions. Good experiment design prevents common pitfalls: underpowered tests, unclear success criteria, and decisions based on noise rather than signal.
When to Use
- Before launching an A/B test to validate a product change
- When testing a hypothesis that requires quantitative validation
- After solution design to validate assumptions before full rollout
- When stakeholders want data-driven evidence for a decision
- To establish a culture of experimentation and learning
How to Use
Use the /experiment-design slash command:
/experiment-design "Your context here"Or reference the skill file directly: skills/measure-experiment-design/SKILL.md
Instructions
When asked to design an experiment, follow these steps:
-
Articulate the Hypothesis Write a clear, testable hypothesis in the format: “We believe [change] for [users] will [outcome] as measured by [metric].” One hypothesis per experiment . if you’re testing multiple things, run multiple experiments.
-
Define the Variants Describe the control (current experience) and treatment (new experience) in sufficient detail. Include screenshots, mockups, or precise descriptions so anyone can understand what users will see.
-
Choose Primary and Secondary Metrics Select one primary metric that will determine success or failure. Add 2-3 secondary metrics to understand the broader impact. Include guardrail metrics to catch unintended negative effects.
-
Calculate Sample Size Determine how many users you need per variant to detect your minimum detectable effect (MDE) with statistical significance. Specify your significance level (typically 0.05) and power (typically 0.80).
-
Estimate Duration Based on sample size and available traffic, calculate how long the experiment needs to run. Account for weekly patterns . avoid ending mid-week if behavior varies by day.
-
Define Targeting and Allocation Specify which users are eligible for the experiment and how traffic is split between variants. Document any exclusions (e.g., employees, specific segments).
-
Set Success Criteria Define upfront what constitutes a win, a loss, or an inconclusive result. This prevents post-hoc rationalization and moving goalposts.
-
Document Risks and Mitigations Identify what could go wrong and how you’ll detect/address it. Include monitoring plans and rollback criteria.
Output Template
Experiment Design: [Experiment Name]
Overview
| Field | Value |
|---|---|
| Experiment Name | [Short descriptive name] |
| Owner | [Name, role] |
| Start Date | [Planned start date] |
| End Date | [Planned end date] |
| Status | Draft / Ready / Running / Completed |
Hypothesis
We believe [proposed change]
for [target user segment]
will [expected outcome]
as measured by [primary metric]
Background
[Context explaining why this experiment matters and what led to the hypothesis]
Variants
Control (A)
Description: [What users currently experience]
Details:
- [Specific element 1]
- [Specific element 2]
Screenshot/Mockup: [Link or embed]
Treatment (B)
Description: [What users will experience in the new variant]
Details:
- [Specific change 1]
- [Specific change 2]
Screenshot/Mockup: [Link or embed]
Metrics
Primary Metric
| Metric | Definition | Current Baseline | Minimum Detectable Effect |
|---|---|---|---|
| [Metric name] | [How it’s calculated] | [Current value] | [Smallest change we want to detect] |
Secondary Metrics
| Metric | Definition | Purpose |
|---|---|---|
| [Metric 1] | [Definition] | [Why we’re tracking this] |
| [Metric 2] | [Definition] | [Why we’re tracking this] |
Guardrail Metrics
| Metric | Definition | Threshold |
|---|---|---|
| [Metric 1] | [Definition] | [Must not decrease by more than X%] |
| [Metric 2] | [Definition] | [Must not decrease by more than X%] |
Sample Size & Duration
Sample Size Calculation
| Parameter | Value |
|---|---|
| Baseline conversion rate | [X%] |
| Minimum detectable effect (MDE) | [X% relative / X pp absolute] |
| Statistical significance (alpha) | [0.05 typical] |
| Statistical power (1-beta) | [0.80 typical] |
| Users per variant | [Calculated number] |
| Total users needed | [Sum across variants] |
Duration Estimate
| Parameter | Value |
|---|---|
| Daily eligible traffic | [X users/day] |
| Traffic allocation | [X% to experiment] |
| Users per day in experiment | [Calculated] |
| Minimum duration | [X days to reach sample size] |
| Recommended duration | [X days, accounting for weekly patterns] |
Audience Targeting
Inclusion Criteria
- [Criterion 1: e.g., “All logged-in users”]
- [Criterion 2: e.g., “Users in US and Canada”]
- [Criterion 3: e.g., “Users on mobile web”]
Exclusion Criteria
- [Exclusion 1: e.g., “Employees and internal testers”]
- [Exclusion 2: e.g., “Users in active experiments that may conflict”]
Traffic Allocation
| Variant | Allocation |
|---|---|
| Control (A) | [X%] |
| Treatment (B) | [X%] |
Success Criteria
Win (Ship Treatment)
[Conditions that constitute a clear win for the treatment, e.g., “Primary metric improves by >= MDE with p < 0.05, and no guardrail metrics regress beyond threshold”]
Loss (Keep Control)
[Conditions that indicate treatment is worse, e.g., “Primary metric decreases with p < 0.05, OR any guardrail metric regresses beyond threshold”]
Inconclusive (More Data Needed)
[Conditions that require further investigation, e.g., “Primary metric change is not statistically significant after full duration”]
Risks & Mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| [Risk 1] | High/Med/Low | High/Med/Low | [How we’ll address it] |
| [Risk 2] | High/Med/Low | High/Med/Low | [How we’ll address it] |
Monitoring Plan
- [What we’ll monitor during the experiment]
- [Alert thresholds that would trigger early review]
- [Rollback criteria]
Implementation Notes
- [Feature flag name/ID]
- [Instrumentation requirements]
- [Any technical considerations]
References
- [Link to related hypothesis document]
- [Link to design mockups]
- [Link to previous related experiments]
Example Output
Experiment Design: One-Page Checkout
Experiment Design: One-Page Checkout
Overview
| Field | Value |
|---|---|
| Experiment Name | checkout-single-page-v1 |
| Owner | Maria Santos, Product Manager |
| Start Date | January 20, 2026 |
| End Date | February 3, 2026 |
| Status | Ready |
Hypothesis
We believe replacing our 3-step checkout with a single-page checkout
for mobile web users
will increase checkout completion rate
as measured by checkout conversion rate (orders / checkout starts)
Background
Mobile checkout abandonment is at 73%, compared to 45% on desktop. User research identified friction points: confusion navigating between steps, anxiety about hidden costs appearing late in the flow, and form field frustration on small screens. A single-page checkout addresses these by showing all information upfront and reducing navigation.
Competitor analysis shows that Amazon, Shopify stores, and Apple use single-page or minimal-step checkouts. Our hypothesis is that reducing cognitive load and providing visibility into the full process will improve conversion.
Variants
Control (A)
Description: Current 3-step checkout flow
Details:
- Step 1: Shipping address entry
- Step 2: Shipping method selection + payment information
- Step 3: Order review and confirmation
- Each step loads a new page
- Progress indicator shows current step
Screenshot/Mockup: [designs/checkout-control-v3.png]
Treatment (B)
Description: Single-page checkout with accordion sections
Details:
- All sections visible on one page (shipping, payment, review)
- Accordion UI: one section expanded at a time, others collapsed but visible
- Express payment buttons (Apple Pay, Google Pay) at top
- Shipping cost shown immediately based on cart
- Edit any section without losing data
Screenshot/Mockup: [designs/checkout-treatment-v2.png]
Metrics
Primary Metric
| Metric | Definition | Current Baseline | Minimum Detectable Effect |
|---|---|---|---|
| Checkout conversion rate | Orders completed / Checkout sessions started | 27% | 5% relative (1.35 pp absolute) |
Secondary Metrics
| Metric | Definition | Purpose |
|---|---|---|
| Checkout completion time | Median seconds from checkout start to order | Confirm UX improvement |
| Express payment usage | % of orders via Apple Pay / Google Pay | Track new feature adoption |
| Cart abandonment rate | Carts abandoned before checkout start | Ensure we’re not shifting drop-off |
Guardrail Metrics
| Metric | Definition | Threshold |
|---|---|---|
| Average order value | Total revenue / Orders | Must not decrease by more than 2% |
| Payment failure rate | Failed payments / Payment attempts | Must not increase by more than 1pp |
| Support tickets (checkout) | CS tickets tagged “checkout” | Must not increase by more than 20% |
Sample Size & Duration
Sample Size Calculation
| Parameter | Value |
|---|---|
| Baseline conversion rate | 27% |
| Minimum detectable effect (MDE) | 5% relative (28.35% target) |
| Statistical significance (alpha) | 0.05 (one-tailed) |
| Statistical power (1-beta) | 0.80 |
| Users per variant | 9,800 |
| Total users needed | 19,600 |
Calculation performed using Evan Miller’s sample size calculator, assuming one-tailed test (we only ship if treatment is better).
Duration Estimate
| Parameter | Value |
|---|---|
| Daily eligible traffic | 18,000 mobile checkout sessions/day |
| Traffic allocation | 80% to experiment (20% holdout for safety) |
| Users per day in experiment | 14,400 |
| Minimum duration | 2 days to reach sample size |
| Recommended duration | 14 days to capture weekly patterns and ensure stability |
We’re running for 14 days despite reaching sample size in 2 days to:
- Capture full weekly shopping patterns (weekday vs. weekend)
- Allow time to detect delayed effects on returns/chargebacks
- Ensure novelty effect wears off
Audience Targeting
Inclusion Criteria
- Logged-in or guest users
- Mobile web (not native apps . app experiment runs separately)
- Users in US and Canada (payment methods configured)
- Cart value >= $10 (exclude micro-purchases)
Exclusion Criteria
- Employees (identified by email domain)
- Users enrolled in conflicting experiments (cart-upsell-test, payment-method-test)
- Users who have participated in checkout experiments in past 30 days
Traffic Allocation
| Variant | Allocation |
|---|---|
| Control (A) | 50% |
| Treatment (B) | 50% |
Success Criteria
Win (Ship Treatment)
Primary metric (checkout conversion) improves by >= 5% relative with p < 0.05, AND:
- Average order value does not decrease by more than 2%
- Payment failure rate does not increase by more than 1pp
- No critical bugs or UX issues identified
Action: Roll out to 100% of mobile web, begin desktop adaptation.
Loss (Keep Control)
Any of:
- Checkout conversion decreases with p < 0.05
- Any guardrail metric breaches threshold
- Critical UX issues discovered affecting > 1% of users
Action: Revert to control, analyze learnings, iterate on design.
Inconclusive (More Data Needed)
Primary metric change is between -5% and +5% relative and not statistically significant after 14 days.
Action: Extend experiment to 21 days if traffic allows. If still inconclusive, treat as neutral and decide based on qualitative factors (user feedback, operational simplicity).
Risks & Mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Apple Pay integration issues | Medium | High | Test extensively in staging; 5% initial ramp with monitoring |
| Page load performance degradation | Low | Medium | Performance budget enforced; lazy-load payment forms |
| Users confused by accordion UX | Low | Medium | Added progress indicators and clear “Continue” buttons |
| Increased payment failures | Low | High | Same payment provider; monitor failure rate hourly |
Monitoring Plan
- Real-time dashboard: Checkout funnel by variant, updated every 5 minutes
- Alert thresholds:
- Payment failure rate > 5% in any 1-hour window
- Checkout conversion drops > 20% relative to control
- Error rate > 1% on any checkout action
- Rollback criteria: Any alert sustained for 30 minutes triggers automatic rollback
- Daily check-ins: Team reviews metrics at 10am daily for first 5 days
Implementation Notes
- Feature flag:
checkout_single_page_v1in LaunchDarkly - Instrumentation: New events
checkout_section_opened,checkout_section_completedfor funnel analysis - Cache invalidation: Clear checkout page cache at experiment start
- Mobile detection: User-Agent parsing; tablet = desktop experience
References
- Hypothesis Document: Mobile Checkout Improvement (internal doc link)
- Design Mockups (Figma link)
- Previous Experiment: Guest Checkout (results/guest-checkout-q3-2025.md) . 3% lift, informed this design
- User Research: Checkout Friction Study (research PDF)
Real-World Examples
See this skill applied to three different product contexts:
Storevine (B2B): Storevine B2B ecommerce platform . Campaigns guided first-campaign flow A/B experiment design
Prompt:
/experiment-design
Project: Campaigns . Campaigns guided first-campaign flowExperiment: Does the guided first-campaign flow increase first-send rate for non-adopter merchants?
Hypothesis (from Define phase doc):- We believe pre-populated templates for non-adopter merchants (<250 customers [fictional], no external email tool) will drive first-send rate from 12% [fictional] to ≥30% [fictional] within 60 days of GA
Variants:- Control: Standard Campaigns creation flow (blank template editor, named segment library, no pre-population)- Treatment: Guided first-campaign flow (product-seeded template, audience defaulted to "Customers who purchased in the last 90 days")
Sample: ~6,800 eligible non-adopter merchants [fictional]; need enough per variant to detect a 8 pp improvement with 80% power
Run period: April 28 - June 27, 2026 (60 days from GA)
Need: full experiment design with sample size calculation, successcriteria, risks, and implementation notes.Output:
Experiment Design: Guided First-Campaign Flow for Non-Adopter Merchants
Brainshelf (Consumer): Brainshelf consumer PKM app . Resurface A/B test experiment design
Prompt:
/experiment-design
resurface a/b test. feature is shipped behind a flag. need the fullexperiment design for chloe to set up in amplitude.
hypothesis: daily digest → higher 7-day return rate.
design: intent-to-treat. treatment gets the opt-in prompt + digest.control gets nothing (current experience). measure 7-day return ratefor both groups.
secondary metric: email CTR (treatment only . control doesn't getemail). guardrail: unsub rate ≤2%/week.
sample: 400 per variant from the 9,800 eligible users [fictional].duration: 4 weeks (mar 9 - apr 5). 50/50 split on enrollment cohort.
want to have the design doc locked before the setup week (mar 2-8).Output:
Experiment Design: Resurface Daily Digest A/B Test
Workbench (Enterprise): "Workbench enterprise collaboration platform: Blueprints required vs. optional sections A/B test"
Prompt:
/experiment-design
Experiment: Required vs. optional Blueprint sectionsProduct: Workbench Blueprints (enterprise doc templates with approval gates)Stage: Closed beta shipped; need to A/B test before expanding to full 500-customer base [fictional]
Context:- Blueprints allows admins to create doc templates with sections- Currently all sections are optional -- authors can submit incomplete Blueprints for approval- Data: 38% of Blueprints reach approval with ≥1 empty section [fictional]; most rejections are for missing content, not quality- Hypothesis from Define phase: making sections required (must complete before submitting) reduces time to first approval- Baseline: median time to first approved Blueprint = 4.0 days [fictional]- Goal: reduce to ≤2.5 days [fictional]
Treatment: Required sections -- authoring UI blocks submission if any required section is empty. Show inline validation message, highlight empty sections.Control: Current optional sections -- authors can submit with empty sections as today.
Primary metric: median time-to-first-approval (days)Secondary: approval rejection rate, Blueprint completion rateGuardrail: don't tank author-side NPS or increase abandonment
Audience: Project leads at enterprise customers in closed beta (excludes IT admins and approvers).Stakeholders: Head of Product, Data Science, Engineering Lead (Blueprints)Output:
Experiment Design: Blueprints Required vs. Optional Sections
Quality Checklist
Before finalizing, verify:
- Hypothesis is falsifiable and specific
- Only one primary metric is defined
- Sample size calculation is documented with assumptions
- Duration accounts for traffic patterns and statistical requirements
- Success criteria are defined before the experiment starts
- Guardrail metrics protect against unintended harm
Output Format
Use the template in references/TEMPLATE.md to structure the output.