Without SourceLoop
UntaggedKayden Floyd
- SourceUnknown
- MediumUnknown
- CampaignUnknown
- Landing pageUnknown
Free tool
Plug in your baseline conversion rate, smallest detectable lift, and confidence level. See exactly how many visitors per variant you need and how long the test will run. Two-tailed proportions test, marketing-standard defaults.
Sample size and duration
| Detail | Value |
|---|
How it works
Plug in baseline, pick the smallest lift you care about, see if the test is feasible at your traffic.
baseline = 5% mde = 10% rel conf = 95% power = 80%
Baseline rate from your data. Smallest lift that would actually change your roadmap. Standard 95/80 confidence and power.
The calculator runs a two-tailed proportions test. Sample size grows with smaller MDE, lower baseline, higher confidence and power.
Test plan, ready
If the test takes 60+ days, bump MDE or pick a higher-traffic page. Stopping early to "peek" invalidates the math.
Inputs explained
Six inputs. The first two drive sample size more than anything else. Power and confidence are usually left at defaults.
Baseline conversion rate
The conversion rate of the control group, taken from your historical data. Use the same period and audience as the planned test, otherwise sample size estimates lie.
Minimum detectable effect (MDE)
The smallest improvement worth detecting. Smaller MDE means much larger sample size. A 10% relative lift on a 5% baseline detects a move from 5.0% to 5.5%.
Statistical power
The probability of detecting a real effect when one exists. 80% is the marketing standard. Higher power needs more sample. 90% is closer to the scientific norm.
Significance level
The false-positive tolerance. 95% confidence (5% alpha) is standard. 99% is much stricter and triples sample size requirements. Lower confidence is rarely worth the speed.
Why two-tailed not one-tailed
Real tests can move in either direction. One-tailed tests halve sample size but assume you know which direction matters, which is rarely true. Two-tailed is the honest default.
Daily traffic
Visitors arriving per day at the page being tested, split across both variants. Used to compute days-to-significance. Optional, but helps you decide whether the test is feasible.
Best practices
Calculate sample size before launching
Running a test "until it looks significant" is peeking. The math only works if sample size is fixed in advance. Skipping this step invalidates everything downstream.
Pick MDE based on what you'd act on
If a 5% lift wouldn't change your decision, do not size for 5%. Most marketing teams overstate how much they care about small lifts and undersize their tests as a result.
Run for full weekly cycles
Conversion rates vary by day of week. A 4-day test can land on a weekend slump. Always round duration up to the next full week.
Don't peek at p-values daily
Frequentist tests assume a single look at the data at the predetermined sample size. Daily peeking inflates false-positive rates badly. Either commit to one final read or use Bayesian methods.
Track downstream conversions, not just micro-conversions
A "click rate" winner is not necessarily a revenue winner. Tie test results back to closed-won pipeline or net revenue, not just the metric you optimized.
Built by the team behind SourceLoop
Guide
A/B tests work by detecting a difference between two variants that's larger than what random noise alone would produce. Without enough sample, real effects look like noise and noise looks like real effects. The sample size formula tells you how many visitors per variant you need before the math can reliably distinguish signal from noise at your chosen confidence and power thresholds.
For a two-tailed proportions test:
n = (z_a + z_b)^2 * (p1*(1-p1) + p2*(1-p2)) / (p1 - p2)^2
where:
p1 = baseline conversion rate
p2 = p1 * (1 + mde) // relative MDE
z_a = critical value at confidence level (1.96 for 95%)
z_b = critical value at power (0.84 for 80%) The numerator (z_a + z_b)^2 grows with confidence and power. The denominator (p1 - p2)^2 shrinks rapidly as MDE shrinks, which is why detecting tiny lifts requires huge samples. Doubling MDE roughly quarters the sample size. Halving the baseline conversion rate roughly doubles the sample size.
One-tailed tests assume you only care about detecting an effect in one direction. If the variant performs worse than the control, you treat that as "no effect" and move on. That's almost never how marketing tests actually run. You care about both wins and losses (a loss tells you to keep the control). Two-tailed tests are the honest default. They require roughly 25 percent more sample but produce a legitimate read on either direction of effect.
The most common A/B testing mistake is checking p-values before the predetermined sample is reached. Each peek is essentially another statistical test, and the false positive rate inflates with every look. A test with a real 5% false positive rate becomes 20%+ if you peek 5 times. Either commit to checking exactly once at the predetermined sample, or use methods designed for sequential testing (Bayesian, or frequentist with O'Brien-Fleming style alpha spending).
If the calculator returns 100,000+ per variant and you only get 200 visitors a day, the test is not feasible at your chosen MDE. Three options: increase MDE (often the right answer because you usually wouldn't act on a 2% lift anyway), pool tests across higher-traffic segments, or use Bayesian methods which can stop earlier when the effect is clearly large. Don't run an underpowered test and pretend the result is meaningful.
Your landing page converts at 5 percent. You want to detect a 10 percent relative lift (5.0 to 5.5). At 95 percent confidence and 80 percent power, the formula gives you n ≈ 31,012 per variant, 62,024 total. At 1,000 daily visitors split evenly between control and variant, the test runs ~62 days. If 62 days is too long, drop MDE to 20 percent (detect 5.0 to 6.0): sample shrinks to ~7,800 per variant, test runs ~16 days. Same data, different question, very different feasibility.
FAQ
Plug in your baseline conversion rate, the smallest lift you care about (MDE), and your confidence and power thresholds. The calculator returns the number of visitors you need per variant for the test to detect that lift reliably. If you also enter daily traffic, it computes how many days the test will need to run.
MDE is the smallest change you want the test to be able to detect. A relative MDE of 10% on a 5% baseline means you can detect any change to 5.5% or higher (or 4.5% or lower). Smaller MDEs require dramatically larger sample sizes: detecting a 1% relative lift takes roughly 100x the sample of detecting a 10% lift. Pick the smallest lift that would actually change your roadmap.
95% is the marketing standard and the right default for almost every test. 99% is appropriate for irreversible decisions or pricing changes where false positives are expensive. 90% is acceptable only for early-stage experimentation where speed matters more than rigor. Never use less than 90%.
80% is the marketing standard. It means there's a 20% chance you miss a real effect. For most landing page or copy tests this is fine. For high-stakes decisions (pricing, onboarding flow, paid ad creative at scale), bump to 90%. Anything above 90% is overkill outside scientific contexts.
Three reasons in order: smaller MDE requires bigger samples (the relationship is roughly inverse-square), lower baseline conversion rates need bigger samples, and higher confidence and power both increase sample. If your sample size is unreasonable, the lever to pull is usually MDE: ask whether a 5% lift is really worth detecting, or whether you'd act on 10%.
Bayesian methods can stop a test as soon as the probability of a winner crosses a threshold (e.g., 95% probability that B beats A). They don't require a fixed sample size in advance, but the math is different and most marketing tools (Optimizely, VWO, Google Optimize legacy) use frequentist methods, which is what this calculator computes.
Yes. No signup, no email gate. We host it because the same teams running A/B tests also need real attribution to know which test winners actually drove revenue, not just which variant got more clicks, which is what SourceLoop does.
Capture and send full attribution data from every signup, lead, booking, and sale to your CRM and ad platforms, so you know exactly what's driving revenue.
Without SourceLoop
UntaggedKayden Floyd
With SourceLoop
Auto-taggedKayden Floyd