Standard Definition
An A/B test (also split test) is the controlled comparison of two variants for statistically robust determination of which performs better on a defined goal. In a classical A/B test, traffic is distributed evenly to variant A (original) and variant B (test); a success metric (conversion rate, click rate, average order value) is measured over a fixed period. Statistical significance is typically determined via chi-squared test or Bayesian methods — typical significance threshold 95 percent (p < 0.05). Tools: Google Optimize (discontinued 2023, partly replaced by GA4 integration), VWO, Optimizely, AB Tasty, Convert. Multivariate tests (MVT) test multiple elements simultaneously and are statistically more demanding.
What this means in mandate practice
A/B tests are the methodological foundation of serious conversion rate optimization — but are often methodologically flawed.
First, statistical power is the most common weakness. Many mandate A/B tests run with too little traffic or too short a duration to make statistically robust statements. Rule of thumb: per variant at least 100 conversions, at least 2 complete weeks duration (to capture weekday fluctuations), ideally 4 weeks. Those who report „significant" results with less data risk systematic misconclusions — known as the „peeking problem" or „false positive finding".
Second, small effects require much more volume than thought. A test meant to detect a conversion rate improvement of 0.5 percentage points (from 2 to 2.5 percent) typically requires several thousand conversions per variant. For many sites, that means several months of test duration. Consequence: A/B tests are only economically sensible when relevant effect sizes are expectable — small cosmetic optimizations (button color, headline phrasing) often lead to non-significant tests and waste of test resources.
Third, the most common Calvarius mandate finding is test over-interpretation. A test with 95 percent significance and 8 percent conversion rate improvement is reported as „clear winner variant" — even when the confidence interval ranges from -1 to +17 percent. Honest test interpretation considers not only the point estimate but the bandwidth. Practice recommendation: communicate A/B test results with confidence intervals, not with point values — this creates expectation realism and avoids disappointing rollout performances after test end.
