Sprt
Sequential Probability Ratio Test (SPRT) Methodology for A/B Testing
Overview
Sequential Probability Ratio Test (SPRT) is a statistical method that enables real-time decision making in A/B tests by continuously evaluating data as it arrives, rather than waiting for a predetermined sample size. This approach offers significant advantages over traditional fixed-sample hypothesis testing.
Mathematical Foundation
Core SPRT Formula
The SPRT is based on the likelihood ratio between two competing hypotheses:
Where:
- is the likelihood ratio after observations
- is the null hypothesis (no difference between variants)
- is the alternative hypothesis (difference of at least )
- is the probability density under hypothesis
Decision Boundaries
SPRT uses two decision boundaries based on desired error rates:
Where:
- = Type I error rate (false positive probability)
- = Type II error rate (false negative probability)
- Power =
Decision Rules:
- If : Reject (declare winner)
- If : Accept (declare no difference)
- If : Continue testing
Log-Likelihood Ratio for Binomial Data
For A/B testing with conversion data, we work with the log-likelihood ratio:
For binomial distributions:
Where:
- = sample sizes for variants A and B
- = conversion counts for variants A and B
- = baseline conversion rate under
- = conversion rates under
Hypothesis Specification
Null Hypothesis ()
Both variants have the same conversion rate:
Alternative Hypothesis ()
Variants differ by at least the minimum detectable effect :
For directional testing based on current leader:
- If variant A leads: ,
- If variant B leads: ,
Baseline Rate Calculation
Pooled Baseline Rate
When no prior baseline is known:
Warmup Period Baseline
For tests with a warmup period of days:
This baseline is then applied to post-warmup data analysis, providing more stable baseline estimates and reducing the impact of early-test variance.
Expected Sample Size
SPRT's expected sample size under each hypothesis can be approximated using Wald's formula:
The conservative estimate uses:
Sample Size and Time Estimation
Remaining Sample Size Estimation
The additional samples needed to reach each decision boundary can be estimated based on current LLR trajectory:
Where the LLR accumulation rate is approximated by the current LLR divided by current sample size.
Leading Conclusion Determination
At any point in the test, the "leading conclusion" is determined by:
This indicates which decision boundary the test is more likely to reach first.
Time-to-Decision Estimation
Given test velocity data, time estimates can be calculated:
Where samples per day is calculated from historical test data:
For tests using warmup periods, velocity should be calculated from post-warmup data for more accurate future projections.
Key Implementation Details
Binomial Log-Likelihood Calculation
For computational stability, we use:
With Stirling's approximation for large factorials:
Decision Boundary Values
With and (80% power):
- Upper boundary:
- Lower boundary:
Advantages Over Traditional Testing
1. Early Stopping
SPRT can detect significant differences much earlier than fixed-sample tests, often requiring 50-70% fewer samples on average.
2. Flexible Sample Sizes
No need to pre-specify sample sizes. Testing continues until sufficient evidence accumulates.
3. Controlled Error Rates
Maintains exact Type I and Type II error control regardless of when the test stops, unlike "peeking" in traditional tests.
4. Real-Time Decisions
Provides continuous monitoring with immediate decision recommendations.
5. Futility Detection
Can detect when variants are practically equivalent, saving resources on inconclusive tests.
6. Actionable Timeline Insights
Provides real-time estimates of when decisions will be reached, enabling better resource planning and stakeholder communication.
Statistical Guarantees
Error Rate Control
Under SPRT, the probability of Type I error is exactly and Type II error is exactly , regardless of:
- When the test stops
- How often decisions are checked
- The actual effect size (if non-zero)
Optimality
SPRT minimizes expected sample size among all tests with the same error rates (Wald-Wolfowitz theorem).
Practical Considerations
Minimum Detectable Effect ()
Choose based on:
- Business impact threshold
- Practical significance
- Available traffic volume
Typical values: 0.01-0.05 (1-5 percentage points)
Power and Significance
Standard choices:
- (95% confidence)
- (80% power)
Higher power reduces false negatives but increases required sample sizes.
Warmup Period Benefits
Using a warmup period:
- Stabilizes baseline estimates
- Reduces variance in early samples
- Improves decision accuracy
- Accounts for novelty effects
- Provides cleaner data for velocity calculations
Time Estimation Considerations
- Time estimates assume current test velocity continues
- Velocity calculations exclude warmup periods when specified
- Estimates are most accurate for tests with consistent traffic patterns
- Weekend/holiday effects may impact actual timelines
- Leading conclusion indicates the most probable outcome direction
Implementation Formula Summary
Decision boundaries:
Current decision:
- If : Declare winner
- If : Declare no difference
- Otherwise: Continue testing
Distance to decision:
- To significance:
- To futility:
Leading conclusion:
- Samples to leading:
- Time to leading:
This methodology provides a robust, efficient framework for sequential A/B testing that adapts to data as it arrives while maintaining rigorous statistical control and providing actionable timeline insights for decision-making.