Free tool

Free Marketing Sample Size Calculator

Plug in your baseline conversion rate, smallest detectable lift, and confidence level. See exactly how many visitors per variant you need and how long the test will run. Two-tailed proportions test, marketing-standard defaults.

Results

Sample size and duration

Per variant —

Days to significance —

Detail	Value

How it works

Three steps to a defensible test plan, before you start spending traffic

Plug in baseline, pick the smallest lift you care about, see if the test is feasible at your traffic.

```
baseline = 5%
mde = 10% rel
conf = 95%
power = 80%
```
01
Set the parameters

Baseline rate from your data. Smallest lift that would actually change your roadmap. Standard 95/80 confidence and power.
02
See the math

The calculator runs a two-tailed proportions test. Sample size grows with smaller MDE, lower baseline, higher confidence and power.
N

Test plan, ready

95% conf, 80% power
- Per variant 31,012
- Total 62,024
- Days 62 days
03
Decide whether to run

If the test takes 60+ days, bump MDE or pick a higher-traffic page. Stopping early to "peek" invalidates the math.

Inputs explained

What each input controls and why it matters

Six inputs. The first two drive sample size more than anything else. Power and confidence are usually left at defaults.

Baseline conversion rate

The conversion rate of the control group, taken from your historical data. Use the same period and audience as the planned test, otherwise sample size estimates lie.
- 2% ecom
- 8% landing page
- 20% pricing page
Minimum detectable effect (MDE)

The smallest improvement worth detecting. Smaller MDE means much larger sample size. A 10% relative lift on a 5% baseline detects a move from 5.0% to 5.5%.
- 5%
- 10%
- 20% relative
Statistical power

The probability of detecting a real effect when one exists. 80% is the marketing standard. Higher power needs more sample. 90% is closer to the scientific norm.
- 80% standard
- 90% rigorous
Significance level

The false-positive tolerance. 95% confidence (5% alpha) is standard. 99% is much stricter and triples sample size requirements. Lower confidence is rarely worth the speed.
- 95% standard
- 99% strict
Why two-tailed not one-tailed

Real tests can move in either direction. One-tailed tests halve sample size but assume you know which direction matters, which is rarely true. Two-tailed is the honest default.
- Detect lift or drop
Daily traffic

Visitors arriving per day at the page being tested, split across both variants. Used to compute days-to-significance. Optional, but helps you decide whether the test is feasible.
- 1K/day SMB
- 10K/day

Best practices

Five rules for running A/B tests that actually decide things

01
Calculate sample size before launching

Running a test "until it looks significant" is peeking. The math only works if sample size is fixed in advance. Skipping this step invalidates everything downstream.
02
Pick MDE based on what you'd act on

If a 5% lift wouldn't change your decision, do not size for 5%. Most marketing teams overstate how much they care about small lifts and undersize their tests as a result.
03
Run for full weekly cycles

Conversion rates vary by day of week. A 4-day test can land on a weekend slump. Always round duration up to the next full week.
04
Don't peek at p-values daily

Frequentist tests assume a single look at the data at the predetermined sample size. Daily peeking inflates false-positive rates badly. Either commit to one final read or use Bayesian methods.
05
Track downstream conversions, not just micro-conversions

A "click rate" winner is not necessarily a revenue winner. Tie test results back to closed-won pipeline or net revenue, not just the metric you optimized.

Guide

How A/B test sample size actually works

Why sample size matters at all

A/B tests work by detecting a difference between two variants that's larger than what random noise alone would produce. Without enough sample, real effects look like noise and noise looks like real effects. The sample size formula tells you how many visitors per variant you need before the math can reliably distinguish signal from noise at your chosen confidence and power thresholds.

The math, top to bottom

For a two-tailed proportions test:

n = (z_a + z_b)^2 * (p1*(1-p1) + p2*(1-p2)) / (p1 - p2)^2

where:
  p1 = baseline conversion rate
  p2 = p1 * (1 + mde)        // relative MDE
  z_a = critical value at confidence level (1.96 for 95%)
  z_b = critical value at power (0.84 for 80%)

The numerator (z_a + z_b)^2 grows with confidence and power. The denominator (p1 - p2)^2 shrinks rapidly as MDE shrinks, which is why detecting tiny lifts requires huge samples. Doubling MDE roughly quarters the sample size. Halving the baseline conversion rate roughly doubles the sample size.

Why one-tailed tests are usually wrong

One-tailed tests assume you only care about detecting an effect in one direction. If the variant performs worse than the control, you treat that as "no effect" and move on. That's almost never how marketing tests actually run. You care about both wins and losses (a loss tells you to keep the control). Two-tailed tests are the honest default. They require roughly 25 percent more sample but produce a legitimate read on either direction of effect.

The peeking problem

The most common A/B testing mistake is checking p-values before the predetermined sample is reached. Each peek is essentially another statistical test, and the false positive rate inflates with every look. A test with a real 5% false positive rate becomes 20%+ if you peek 5 times. Either commit to checking exactly once at the predetermined sample, or use methods designed for sequential testing (Bayesian, or frequentist with O'Brien-Fleming style alpha spending).

Realistic test horizons

Landing page CTA test, 5% baseline, 10% MDE: ~31K per variant. At 1K daily traffic, ~62 days.
Pricing page test, 8% baseline, 10% MDE: ~17K per variant. At 500 daily traffic, ~70 days.
Email subject line, 30% open baseline, 5% rel MDE: ~30K per variant. Email volume usually allows 1-week tests.
Ad creative test, 2% CTR baseline, 20% MDE: ~5K per variant. Easy at any decent ad spend.
Onboarding flow, 60% step-completion, 5% MDE: ~2.5K per variant. Fast for high-traffic SaaS.

When the sample size is impossibly large

If the calculator returns 100,000+ per variant and you only get 200 visitors a day, the test is not feasible at your chosen MDE. Three options: increase MDE (often the right answer because you usually wouldn't act on a 2% lift anyway), pool tests across higher-traffic segments, or use Bayesian methods which can stop earlier when the effect is clearly large. Don't run an underpowered test and pretend the result is meaningful.

A worked example

Your landing page converts at 5 percent. You want to detect a 10 percent relative lift (5.0 to 5.5). At 95 percent confidence and 80 percent power, the formula gives you n ≈ 31,012 per variant, 62,024 total. At 1,000 daily visitors split evenly between control and variant, the test runs ~62 days. If 62 days is too long, drop MDE to 20 percent (detect 5.0 to 6.0): sample shrinks to ~7,800 per variant, test runs ~16 days. Same data, different question, very different feasibility.

FAQ

Sample size and A/B testing, FAQ

How does this sample size calculator work?

Plug in your baseline conversion rate, the smallest lift you care about (MDE), and your confidence and power thresholds. The calculator returns the number of visitors you need per variant for the test to detect that lift reliably. If you also enter daily traffic, it computes how many days the test will need to run.

What is minimum detectable effect (MDE)?

MDE is the smallest change you want the test to be able to detect. A relative MDE of 10% on a 5% baseline means you can detect any change to 5.5% or higher (or 4.5% or lower). Smaller MDEs require dramatically larger sample sizes: detecting a 1% relative lift takes roughly 100x the sample of detecting a 10% lift. Pick the smallest lift that would actually change your roadmap.

What confidence level should I use?

95% is the marketing standard and the right default for almost every test. 99% is appropriate for irreversible decisions or pricing changes where false positives are expensive. 90% is acceptable only for early-stage experimentation where speed matters more than rigor. Never use less than 90%.

Should I use 80% or 90% power?

80% is the marketing standard. It means there's a 20% chance you miss a real effect. For most landing page or copy tests this is fine. For high-stakes decisions (pricing, onboarding flow, paid ad creative at scale), bump to 90%. Anything above 90% is overkill outside scientific contexts.

Why is my required sample size so large?

Three reasons in order: smaller MDE requires bigger samples (the relationship is roughly inverse-square), lower baseline conversion rates need bigger samples, and higher confidence and power both increase sample. If your sample size is unreasonable, the lever to pull is usually MDE: ask whether a 5% lift is really worth detecting, or whether you'd act on 10%.

What about Bayesian A/B testing?

Bayesian methods can stop a test as soon as the probability of a winner crosses a threshold (e.g., 95% probability that B beats A). They don't require a fixed sample size in advance, but the math is different and most marketing tools (Optimizely, VWO, Google Optimize legacy) use frequentist methods, which is what this calculator computes.

Is this calculator free?

Yes. No signup, no email gate. We host it because the same teams running A/B tests also need real attribution to know which test winners actually drove revenue, not just which variant got more clicks, which is what SourceLoop does.