¶
Quick facts
Phase: Measure | Version: 2.0.0 | Category: validation | License: Apache-2.0
Experiment Design¶
An experiment design document defines all parameters needed to run a rigorous A/B test or controlled experiment. It ensures the team aligns on what you're testing, how you'll measure success, and how long to run the test before drawing conclusions. Good experiment design prevents common pitfalls: underpowered tests, unclear success criteria, and decisions based on noise rather than signal.
When to Use¶
- Before launching an A/B test to validate a product change
- When testing a hypothesis that requires quantitative validation
- After solution design to validate assumptions before full rollout
- When stakeholders want data-driven evidence for a decision
- To establish a culture of experimentation and learning
How to Use¶
Use the /experiment-design slash command:
Or reference the skill file directly: skills/measure-experiment-design/SKILL.md
Instructions¶
When asked to design an experiment, follow these steps:
-
Articulate the Hypothesis Write a clear, testable hypothesis in the format: "We believe [change] for [users] will [outcome] as measured by [metric]." One hypothesis per experiment — if you're testing multiple things, run multiple experiments.
-
Define the Variants Describe the control (current experience) and treatment (new experience) in sufficient detail. Include screenshots, mockups, or precise descriptions so anyone can understand what users will see.
-
Choose Primary and Secondary Metrics Select one primary metric that will determine success or failure. Add 2-3 secondary metrics to understand the broader impact. Include guardrail metrics to catch unintended negative effects.
-
Calculate Sample Size Determine how many users you need per variant to detect your minimum detectable effect (MDE) with statistical significance. Specify your significance level (typically 0.05) and power (typically 0.80).
-
Estimate Duration Based on sample size and available traffic, calculate how long the experiment needs to run. Account for weekly patterns — avoid ending mid-week if behavior varies by day.
-
Define Targeting and Allocation Specify which users are eligible for the experiment and how traffic is split between variants. Document any exclusions (e.g., employees, specific segments).
-
Set Success Criteria Define upfront what constitutes a win, a loss, or an inconclusive result. This prevents post-hoc rationalization and moving goalposts.
-
Document Risks and Mitigations Identify what could go wrong and how you'll detect/address it. Include monitoring plans and rollback criteria.
Output Template¶
Experiment Design: [Experiment Name]¶
Overview¶
| Field | Value |
|---|---|
| Experiment Name | [Short descriptive name] |
| Owner | [Name, role] |
| Start Date | [Planned start date] |
| End Date | [Planned end date] |
| Status | Draft / Ready / Running / Completed |
Hypothesis¶
We believe [proposed change]
for [target user segment]
will [expected outcome]
as measured by [primary metric]
Background¶
[Context explaining why this experiment matters and what led to the hypothesis]
Variants¶
Control (A)¶
Description: [What users currently experience]
Details: - [Specific element 1] - [Specific element 2]
Screenshot/Mockup: [Link or embed]
Treatment (B)¶
Description: [What users will experience in the new variant]
Details: - [Specific change 1] - [Specific change 2]
Screenshot/Mockup: [Link or embed]
Metrics¶
Primary Metric¶
| Metric | Definition | Current Baseline | Minimum Detectable Effect |
|---|---|---|---|
| [Metric name] | [How it's calculated] | [Current value] | [Smallest change we want to detect] |
Secondary Metrics¶
| Metric | Definition | Purpose |
|---|---|---|
| [Metric 1] | [Definition] | [Why we're tracking this] |
| [Metric 2] | [Definition] | [Why we're tracking this] |
Guardrail Metrics¶
| Metric | Definition | Threshold |
|---|---|---|
| [Metric 1] | [Definition] | [Must not decrease by more than X%] |
| [Metric 2] | [Definition] | [Must not decrease by more than X%] |
Sample Size & Duration¶
Sample Size Calculation¶
| Parameter | Value |
|---|---|
| Baseline conversion rate | [X%] |
| Minimum detectable effect (MDE) | [X% relative / X pp absolute] |
| Statistical significance (alpha) | [0.05 typical] |
| Statistical power (1-beta) | [0.80 typical] |
| Users per variant | [Calculated number] |
| Total users needed | [Sum across variants] |
Duration Estimate¶
| Parameter | Value |
|---|---|
| Daily eligible traffic | [X users/day] |
| Traffic allocation | [X% to experiment] |
| Users per day in experiment | [Calculated] |
| Minimum duration | [X days to reach sample size] |
| Recommended duration | [X days, accounting for weekly patterns] |
Audience Targeting¶
Inclusion Criteria¶
- [Criterion 1: e.g., "All logged-in users"]
- [Criterion 2: e.g., "Users in US and Canada"]
- [Criterion 3: e.g., "Users on mobile web"]
Exclusion Criteria¶
- [Exclusion 1: e.g., "Employees and internal testers"]
- [Exclusion 2: e.g., "Users in active experiments that may conflict"]
Traffic Allocation¶
| Variant | Allocation |
|---|---|
| Control (A) | [X%] |
| Treatment (B) | [X%] |
Success Criteria¶
Win (Ship Treatment)¶
[Conditions that constitute a clear win for the treatment, e.g., "Primary metric improves by >= MDE with p < 0.05, and no guardrail metrics regress beyond threshold"]
Loss (Keep Control)¶
[Conditions that indicate treatment is worse, e.g., "Primary metric decreases with p < 0.05, OR any guardrail metric regresses beyond threshold"]
Inconclusive (More Data Needed)¶
[Conditions that require further investigation, e.g., "Primary metric change is not statistically significant after full duration"]
Risks & Mitigations¶
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| [Risk 1] | High/Med/Low | High/Med/Low | [How we'll address it] |
| [Risk 2] | High/Med/Low | High/Med/Low | [How we'll address it] |
Monitoring Plan¶
- [What we'll monitor during the experiment]
- [Alert thresholds that would trigger early review]
- [Rollback criteria]
Implementation Notes¶
- [Feature flag name/ID]
- [Instrumentation requirements]
- [Any technical considerations]
References¶
- [Link to related hypothesis document]
- [Link to design mockups]
- [Link to previous related experiments]
Example Output¶
Experiment Design: One-Page Checkout
Experiment Design: One-Page Checkout¶
Overview¶
| Field | Value |
|---|---|
| Experiment Name | checkout-single-page-v1 |
| Owner | Maria Santos, Product Manager |
| Start Date | January 20, 2026 |
| End Date | February 3, 2026 |
| Status | Ready |
Hypothesis¶
We believe replacing our 3-step checkout with a single-page checkout
for mobile web users
will increase checkout completion rate
as measured by checkout conversion rate (orders / checkout starts)
Background¶
Mobile checkout abandonment is at 73%, compared to 45% on desktop. User research identified friction points: confusion navigating between steps, anxiety about hidden costs appearing late in the flow, and form field frustration on small screens. A single-page checkout addresses these by showing all information upfront and reducing navigation.
Competitor analysis shows that Amazon, Shopify stores, and Apple use single-page or minimal-step checkouts. Our hypothesis is that reducing cognitive load and providing visibility into the full process will improve conversion.
Variants¶
Control (A)¶
Description: Current 3-step checkout flow
Details: - Step 1: Shipping address entry - Step 2: Shipping method selection + payment information - Step 3: Order review and confirmation - Each step loads a new page - Progress indicator shows current step
Screenshot/Mockup: [designs/checkout-control-v3.png]
Treatment (B)¶
Description: Single-page checkout with accordion sections
Details: - All sections visible on one page (shipping, payment, review) - Accordion UI: one section expanded at a time, others collapsed but visible - Express payment buttons (Apple Pay, Google Pay) at top - Shipping cost shown immediately based on cart - Edit any section without losing data
Screenshot/Mockup: [designs/checkout-treatment-v2.png]
Metrics¶
Primary Metric¶
| Metric | Definition | Current Baseline | Minimum Detectable Effect |
|---|---|---|---|
| Checkout conversion rate | Orders completed / Checkout sessions started | 27% | 5% relative (1.35 pp absolute) |
Secondary Metrics¶
| Metric | Definition | Purpose |
|---|---|---|
| Checkout completion time | Median seconds from checkout start to order | Confirm UX improvement |
| Express payment usage | % of orders via Apple Pay / Google Pay | Track new feature adoption |
| Cart abandonment rate | Carts abandoned before checkout start | Ensure we're not shifting drop-off |
Guardrail Metrics¶
| Metric | Definition | Threshold |
|---|---|---|
| Average order value | Total revenue / Orders | Must not decrease by more than 2% |
| Payment failure rate | Failed payments / Payment attempts | Must not increase by more than 1pp |
| Support tickets (checkout) | CS tickets tagged "checkout" | Must not increase by more than 20% |
Sample Size & Duration¶
Sample Size Calculation¶
| Parameter | Value |
|---|---|
| Baseline conversion rate | 27% |
| Minimum detectable effect (MDE) | 5% relative (28.35% target) |
| Statistical significance (alpha) | 0.05 (one-tailed) |
| Statistical power (1-beta) | 0.80 |
| Users per variant | 9,800 |
| Total users needed | 19,600 |
Calculation performed using Evan Miller's sample size calculator, assuming one-tailed test (we only ship if treatment is better).
Duration Estimate¶
| Parameter | Value |
|---|---|
| Daily eligible traffic | 18,000 mobile checkout sessions/day |
| Traffic allocation | 80% to experiment (20% holdout for safety) |
| Users per day in experiment | 14,400 |
| Minimum duration | 2 days to reach sample size |
| Recommended duration | 14 days to capture weekly patterns and ensure stability |
We're running for 14 days despite reaching sample size in 2 days to: - Capture full weekly shopping patterns (weekday vs. weekend) - Allow time to detect delayed effects on returns/chargebacks - Ensure novelty effect wears off
Audience Targeting¶
Inclusion Criteria¶
- Logged-in or guest users
- Mobile web (not native apps — app experiment runs separately)
- Users in US and Canada (payment methods configured)
- Cart value >= $10 (exclude micro-purchases)
Exclusion Criteria¶
- Employees (identified by email domain)
- Users enrolled in conflicting experiments (cart-upsell-test, payment-method-test)
- Users who have participated in checkout experiments in past 30 days
Traffic Allocation¶
| Variant | Allocation |
|---|---|
| Control (A) | 50% |
| Treatment (B) | 50% |
Success Criteria¶
Win (Ship Treatment)¶
Primary metric (checkout conversion) improves by >= 5% relative with p < 0.05, AND: - Average order value does not decrease by more than 2% - Payment failure rate does not increase by more than 1pp - No critical bugs or UX issues identified
Action: Roll out to 100% of mobile web, begin desktop adaptation.
Loss (Keep Control)¶
Any of: - Checkout conversion decreases with p < 0.05 - Any guardrail metric breaches threshold - Critical UX issues discovered affecting > 1% of users
Action: Revert to control, analyze learnings, iterate on design.
Inconclusive (More Data Needed)¶
Primary metric change is between -5% and +5% relative and not statistically significant after 14 days.
Action: Extend experiment to 21 days if traffic allows. If still inconclusive, treat as neutral and decide based on qualitative factors (user feedback, operational simplicity).
Risks & Mitigations¶
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Apple Pay integration issues | Medium | High | Test extensively in staging; 5% initial ramp with monitoring |
| Page load performance degradation | Low | Medium | Performance budget enforced; lazy-load payment forms |
| Users confused by accordion UX | Low | Medium | Added progress indicators and clear "Continue" buttons |
| Increased payment failures | Low | High | Same payment provider; monitor failure rate hourly |
Monitoring Plan¶
- Real-time dashboard: Checkout funnel by variant, updated every 5 minutes
- Alert thresholds:
- Payment failure rate > 5% in any 1-hour window
- Checkout conversion drops > 20% relative to control
- Error rate > 1% on any checkout action
- Rollback criteria: Any alert sustained for 30 minutes triggers automatic rollback
- Daily check-ins: Team reviews metrics at 10am daily for first 5 days
Implementation Notes¶
- Feature flag:
checkout_single_page_v1in LaunchDarkly - Instrumentation: New events
checkout_section_opened,checkout_section_completedfor funnel analysis - Cache invalidation: Clear checkout page cache at experiment start
- Mobile detection: User-Agent parsing; tablet = desktop experience
References¶
- Hypothesis Document: Mobile Checkout Improvement (internal doc link)
- Design Mockups (Figma link)
- Previous Experiment: Guest Checkout (results/guest-checkout-q3-2025.md) — 3% lift, informed this design
- User Research: Checkout Friction Study (research PDF)
Real-World Examples¶
See this skill applied to three different product contexts:
Storevine (B2B): Storevine B2B ecommerce platform — Campaigns guided first-campaign flow A/B experiment design
Prompt:
/experiment-design
Project: Campaigns — Campaigns guided first-campaign flow
Experiment: Does the guided first-campaign flow increase first-send rate
for non-adopter merchants?
Hypothesis (from Define phase doc):
- We believe pre-populated templates for non-adopter merchants (<250
customers [fictional], no external email tool) will drive first-send
rate from 12% [fictional] to ≥30% [fictional] within 60 days of GA
Variants:
- Control: Standard Campaigns creation flow (blank template editor,
named segment library, no pre-population)
- Treatment: Guided first-campaign flow (product-seeded template,
audience defaulted to "Customers who purchased in the last 90 days")
Sample: ~6,800 eligible non-adopter merchants [fictional]; need enough
per variant to detect a 8 pp improvement with 80% power
Run period: April 28 – June 27, 2026 (60 days from GA)
Need: full experiment design with sample size calculation, success
criteria, risks, and implementation notes.
Output:
Experiment Design: Guided First-Campaign Flow for Non-Adopter Merchants¶
Brainshelf (Consumer): Brainshelf consumer PKM app — Resurface A/B test experiment design
Prompt:
/experiment-design
resurface a/b test. feature is shipped behind a flag. need the full
experiment design for chloe to set up in amplitude.
hypothesis: daily digest → higher 7-day return rate.
design: intent-to-treat. treatment gets the opt-in prompt + digest.
control gets nothing (current experience). measure 7-day return rate
for both groups.
secondary metric: email CTR (treatment only — control doesn't get
email). guardrail: unsub rate ≤2%/week.
sample: 400 per variant from the 9,800 eligible users [fictional].
duration: 4 weeks (mar 9 - apr 5). 50/50 split on enrollment cohort.
want to have the design doc locked before the setup week (mar 2-8).
Output:
Experiment Design: Resurface Daily Digest A/B Test¶
Workbench (Enterprise): Workbench enterprise collaboration platform: Blueprints required vs. optional sections A/B test
Prompt:
/experiment-design
Experiment: Required vs. optional Blueprint sections
Product: Workbench Blueprints (enterprise doc templates with approval gates)
Stage: Closed beta shipped; need to A/B test before expanding to full 500-customer base [fictional]
Context:
- Blueprints allows admins to create doc templates with sections
- Currently all sections are optional -- authors can submit incomplete Blueprints for approval
- Data: 38% of Blueprints reach approval with ≥1 empty section [fictional]; most rejections are for missing content, not quality
- Hypothesis from Define phase: making sections required (must complete before submitting) reduces time to first approval
- Baseline: median time to first approved Blueprint = 4.0 days [fictional]
- Goal: reduce to ≤2.5 days [fictional]
Treatment: Required sections -- authoring UI blocks submission if any required section is empty. Show inline validation message, highlight empty sections.
Control: Current optional sections -- authors can submit with empty sections as today.
Primary metric: median time-to-first-approval (days)
Secondary: approval rejection rate, Blueprint completion rate
Guardrail: don't tank author-side NPS or increase abandonment
Audience: Project leads at enterprise customers in closed beta (excludes IT admins and approvers).
Stakeholders: Head of Product, Data Science, Engineering Lead (Blueprints)
Output:
Experiment Design: Blueprints Required vs. Optional Sections¶
Quality Checklist¶
Before finalizing, verify:
- Hypothesis is falsifiable and specific
- Only one primary metric is defined
- Sample size calculation is documented with assumptions
- Duration accounts for traffic patterns and statistical requirements
- Success criteria are defined before the experiment starts
- Guardrail metrics protect against unintended harm
Output Format¶
Use the template in references/TEMPLATE.md to structure the output.