Sequential Probability Ratio Test (SPRT) Methodology for A/B Testing

Overview

Sequential Probability Ratio Test (SPRT) is a statistical method that enables real-time decision making in A/B tests by continuously evaluating data as it arrives, rather than waiting for a predetermined sample size. This approach offers significant advantages over traditional fixed-sample hypothesis testing.

Mathematical Foundation

Core SPRT Formula

The SPRT is based on the likelihood ratio between two competing hypotheses:

\Lambda_n = \frac{L(H_1 | \text{data})}{L(H_0 | \text{data})} = \prod_{i=1}^{n} \frac{f(x_i | H_1)}{f(x_i | H_0)}

Where:

$\Lambda_n$ is the likelihood ratio after $n$ observations
$H_0$ is the null hypothesis (no difference between variants)
$H_1$ is the alternative hypothesis (difference of at least $\delta$ )
$f(x_i | H)$ is the probability density under hypothesis $H$

Decision Boundaries

SPRT uses two decision boundaries based on desired error rates:

A = \frac{1-\beta}{\alpha} \quad \text{and} \quad B = \frac{\beta}{1-\alpha}

Where:

$\alpha$ = Type I error rate (false positive probability)
$\beta$ = Type II error rate (false negative probability)
Power = $1 - \beta$

Decision Rules:

If $\Lambda_n \geq A$ : Reject $H_0$ (declare winner)
If $\Lambda_n \leq B$ : Accept $H_0$ (declare no difference)
If $B < \Lambda_n < A$ : Continue testing

Log-Likelihood Ratio for Binomial Data

For A/B testing with conversion data, we work with the log-likelihood ratio:

\text{LLR}_n = \ln(\Lambda_n) = \sum_{i=1}^{n_A} \ln\left(\frac{p_{1A}}{p_0}\right) + \sum_{j=1}^{n_B} \ln\left(\frac{p_{1B}}{p_0}\right)

For binomial distributions:

\text{LLR} = x_A \ln\left(\frac{p_{1A}}{p_0}\right) + (n_A - x_A)\ln\left(\frac{1-p_{1A}}{1-p_0}\right) + x_B \ln\left(\frac{p_{1B}}{p_0}\right) + (n_B - x_B)\ln\left(\frac{1-p_{1B}}{1-p_0}\right)

Where:

$n_A, n_B$ = sample sizes for variants A and B
$x_A, x_B$ = conversion counts for variants A and B
$p_0$ = baseline conversion rate under $H_0$
$p_{1A}, p_{1B}$ = conversion rates under $H_1$

Hypothesis Specification

Null Hypothesis ( $H_0$ )

Both variants have the same conversion rate:

p_A = p_B = p_0

Alternative Hypothesis ( $H_1$ )

Variants differ by at least the minimum detectable effect $\delta$ :

|p_A - p_B| \geq \delta

For directional testing based on current leader:

If variant A leads: $p_A = p_0 + \frac{\delta}{2}$ , $p_B = p_0 - \frac{\delta}{2}$
If variant B leads: $p_A = p_0 - \frac{\delta}{2}$ , $p_B = p_0 + \frac{\delta}{2}$

Baseline Rate Calculation

Pooled Baseline Rate

When no prior baseline is known:

p_0 = \frac{x_A + x_B}{n_A + n_B}

Warmup Period Baseline

For tests with a warmup period of $d$ days:

p_0^{\text{warmup}} = \frac{x_A^{\text{warmup}} + x_B^{\text{warmup}}}{n_A^{\text{warmup}} + n_B^{\text{warmup}}}

This baseline is then applied to post-warmup data analysis, providing more stable baseline estimates and reducing the impact of early-test variance.

Expected Sample Size

SPRT's expected sample size under each hypothesis can be approximated using Wald's formula:

E[N | H_0] \approx \frac{\alpha \ln(A) + (1-\alpha)\ln(B)}{E[\text{LLR per observation} | H_0]}

E[N | H_1] \approx \frac{(1-\beta) \ln(A) + \beta\ln(B)}{E[\text{LLR per observation} | H_1]}

The conservative estimate uses: $N_{\text{conservative}} = \max(E[N | H_0], E[N | H_1])$

Sample Size and Time Estimation

Remaining Sample Size Estimation

The additional samples needed to reach each decision boundary can be estimated based on current LLR trajectory:

N_{\text{additional}} = \frac{|\text{Target Boundary} - \text{Current LLR}|}{|\text{LLR}| / N_{\text{current}}}

Where the LLR accumulation rate is approximated by the current LLR divided by current sample size.

Leading Conclusion Determination

At any point in the test, the "leading conclusion" is determined by:

\text{Leading} = \begin{cases} \text{Significance} & \text{if } (\text{Upper} - \text{LLR}) \leq (\text{LLR} - \text{Lower}) \\ \text{Futility} & \text{if } (\text{LLR} - \text{Lower}) < (\text{Upper} - \text{LLR}) \end{cases}

This indicates which decision boundary the test is more likely to reach first.

Time-to-Decision Estimation

Given test velocity data, time estimates can be calculated:

\text{Time}_{\text{decision}} = \frac{N_{\text{additional}}}{\text{Samples per day}}

Where samples per day is calculated from historical test data:

\text{Samples per day} = \frac{\text{Total data points}}{\text{Days elapsed}}

For tests using warmup periods, velocity should be calculated from post-warmup data for more accurate future projections.

Key Implementation Details

Binomial Log-Likelihood Calculation

For computational stability, we use:

\ln \binom{n}{x} = \ln(n!) - \ln(x!) - \ln((n-x)!)

With Stirling's approximation for large factorials:

\ln(n!) \approx n\ln(n) - n + \frac{1}{2}\ln(2\pi n)

Decision Boundary Values

With $\alpha = 0.05$ and $\beta = 0.2$ (80% power):

Upper boundary: $\ln(A) = \ln\left(\frac{0.8}{0.05}\right) = 2.944$
Lower boundary: $\ln(B) = \ln\left(\frac{0.2}{0.95}\right) = -1.559$

Advantages Over Traditional Testing

1. Early Stopping

SPRT can detect significant differences much earlier than fixed-sample tests, often requiring 50-70% fewer samples on average.

2. Flexible Sample Sizes

No need to pre-specify sample sizes. Testing continues until sufficient evidence accumulates.

3. Controlled Error Rates

Maintains exact Type I and Type II error control regardless of when the test stops, unlike "peeking" in traditional tests.

4. Real-Time Decisions

Provides continuous monitoring with immediate decision recommendations.

5. Futility Detection

Can detect when variants are practically equivalent, saving resources on futile tests.

6. Actionable Timeline Insights

Provides real-time estimates of when decisions will be reached, enabling better resource planning and stakeholder communication.

Statistical Guarantees

Error Rate Control

Under SPRT, the probability of Type I error is exactly $\alpha$ and Type II error is exactly $\beta$ , regardless of:

When the test stops
How often decisions are checked
The actual effect size (if non-zero)

Optimality

SPRT minimizes expected sample size among all tests with the same error rates (Wald-Wolfowitz theorem).

Practical Considerations

Minimum Detectable Effect ( $\delta$ )

Choose $\delta$ based on:

Business impact threshold
Practical significance
Available traffic volume

Typical values: 0.01-0.05 (1-5 percentage points)

Power and Significance

Standard choices:

$\alpha = 0.05$ (95% confidence)
$1-\beta = 0.8$ (80% power)

Higher power reduces false negatives but increases required sample sizes.

Warmup Period Benefits

Using a warmup period:

Stabilizes baseline estimates
Reduces variance in early samples
Improves decision accuracy
Accounts for novelty effects
Provides cleaner data for velocity calculations

Time Estimation Considerations

Time estimates assume current test velocity continues
Velocity calculations exclude warmup periods when specified
Estimates are most accurate for tests with consistent traffic patterns
Weekend/holiday effects may impact actual timelines
Leading conclusion indicates the most probable outcome direction

Implementation Formula Summary

Decision boundaries:

\text{Upper} = \ln\left(\frac{1-\beta}{\alpha}\right), \quad \text{Lower} = \ln\left(\frac{\beta}{1-\alpha}\right)

Current decision:

If $\text{LLR} \geq \text{Upper}$ : Declare winner
If $\text{LLR} \leq \text{Lower}$ : Declare no difference
Otherwise: Continue testing

Distance to decision:

To significance: $\max(0, \text{Upper} - \text{LLR})$
To futility: $\max(0, \text{LLR} - \text{Lower})$

Leading conclusion:

Samples to leading: $\min(\text{Samples to significance}, \text{Samples to futility})$
Time to leading: $\frac{\text{Samples to leading}}{\text{Samples per day}}$

This methodology provides a robust, efficient framework for sequential A/B testing that adapts to data as it arrives while maintaining rigorous statistical control and providing actionable timeline insights for decision-making.

Sequential Probability Ratio Test (SPRT) Methodology for A/B Testing

Overview

Mathematical Foundation

Core SPRT Formula

Decision Boundaries

Log-Likelihood Ratio for Binomial Data

Hypothesis Specification

Null Hypothesis (H0H_0H0​)

Alternative Hypothesis (H1H_1H1​)

Baseline Rate Calculation

Pooled Baseline Rate

Warmup Period Baseline

Expected Sample Size

Sample Size and Time Estimation

Remaining Sample Size Estimation

Leading Conclusion Determination

Time-to-Decision Estimation

Key Implementation Details

Binomial Log-Likelihood Calculation

Decision Boundary Values

Advantages Over Traditional Testing

1. Early Stopping

2. Flexible Sample Sizes

3. Controlled Error Rates

4. Real-Time Decisions

5. Futility Detection

6. Actionable Timeline Insights

Statistical Guarantees

Error Rate Control

Optimality

Practical Considerations

Minimum Detectable Effect (δ\deltaδ)

Power and Significance

Warmup Period Benefits

Time Estimation Considerations

Implementation Formula Summary

Null Hypothesis ( $H_0$ )

Alternative Hypothesis ( $H_1$ )

Minimum Detectable Effect ( $\delta$ )