Sprt

Sequential Probability Ratio Test (SPRT) Methodology for A/B Testing

Overview

Sequential Probability Ratio Test (SPRT) is a statistical method that enables real-time decision making in A/B tests by continuously evaluating data as it arrives, rather than waiting for a predetermined sample size. This approach offers significant advantages over traditional fixed-sample hypothesis testing.

Mathematical Foundation

Core SPRT Formula

The SPRT is based on the likelihood ratio between two competing hypotheses:

Λn=L(H1data)L(H0data)=i=1nf(xiH1)f(xiH0)\Lambda_n = \frac{L(H_1 | \text{data})}{L(H_0 | \text{data})} = \prod_{i=1}^{n} \frac{f(x_i | H_1)}{f(x_i | H_0)}

Where:

  • Λn\Lambda_n is the likelihood ratio after nn observations
  • H0H_0 is the null hypothesis (no difference between variants)
  • H1H_1 is the alternative hypothesis (difference of at least δ\delta)
  • f(xiH)f(x_i | H) is the probability density under hypothesis HH

Decision Boundaries

SPRT uses two decision boundaries based on desired error rates:

A=1βαandB=β1αA = \frac{1-\beta}{\alpha} \quad \text{and} \quad B = \frac{\beta}{1-\alpha}

Where:

  • α\alpha = Type I error rate (false positive probability)
  • β\beta = Type II error rate (false negative probability)
  • Power = 1β1 - \beta

Decision Rules:

  • If ΛnA\Lambda_n \geq A: Reject H0H_0 (declare winner)
  • If ΛnB\Lambda_n \leq B: Accept H0H_0 (declare no difference)
  • If B<Λn<AB < \Lambda_n < A: Continue testing

Log-Likelihood Ratio for Binomial Data

For A/B testing with conversion data, we work with the log-likelihood ratio:

LLRn=ln(Λn)=i=1nAln(p1Ap0)+j=1nBln(p1Bp0)\text{LLR}_n = \ln(\Lambda_n) = \sum_{i=1}^{n_A} \ln\left(\frac{p_{1A}}{p_0}\right) + \sum_{j=1}^{n_B} \ln\left(\frac{p_{1B}}{p_0}\right)

For binomial distributions:

LLR=xAln(p1Ap0)+(nAxA)ln(1p1A1p0)+xBln(p1Bp0)+(nBxB)ln(1p1B1p0)\text{LLR} = x_A \ln\left(\frac{p_{1A}}{p_0}\right) + (n_A - x_A)\ln\left(\frac{1-p_{1A}}{1-p_0}\right) + x_B \ln\left(\frac{p_{1B}}{p_0}\right) + (n_B - x_B)\ln\left(\frac{1-p_{1B}}{1-p_0}\right)

Where:

  • nA,nBn_A, n_B = sample sizes for variants A and B
  • xA,xBx_A, x_B = conversion counts for variants A and B
  • p0p_0 = baseline conversion rate under H0H_0
  • p1A,p1Bp_{1A}, p_{1B} = conversion rates under H1H_1

Hypothesis Specification

Null Hypothesis (H0H_0)

Both variants have the same conversion rate:

pA=pB=p0p_A = p_B = p_0

Alternative Hypothesis (H1H_1)

Variants differ by at least the minimum detectable effect δ\delta:

pApBδ|p_A - p_B| \geq \delta

For directional testing based on current leader:

  • If variant A leads: pA=p0+δ2p_A = p_0 + \frac{\delta}{2}, pB=p0δ2p_B = p_0 - \frac{\delta}{2}
  • If variant B leads: pA=p0δ2p_A = p_0 - \frac{\delta}{2}, pB=p0+δ2p_B = p_0 + \frac{\delta}{2}

Baseline Rate Calculation

Pooled Baseline Rate

When no prior baseline is known:

p0=xA+xBnA+nBp_0 = \frac{x_A + x_B}{n_A + n_B}

Warmup Period Baseline

For tests with a warmup period of dd days:

p0warmup=xAwarmup+xBwarmupnAwarmup+nBwarmupp_0^{\text{warmup}} = \frac{x_A^{\text{warmup}} + x_B^{\text{warmup}}}{n_A^{\text{warmup}} + n_B^{\text{warmup}}}

This baseline is then applied to post-warmup data analysis, providing more stable baseline estimates and reducing the impact of early-test variance.

Expected Sample Size

SPRT's expected sample size under each hypothesis can be approximated using Wald's formula:

E[NH0]αln(A)+(1α)ln(B)E[LLR per observationH0]E[N | H_0] \approx \frac{\alpha \ln(A) + (1-\alpha)\ln(B)}{E[\text{LLR per observation} | H_0]} E[NH1](1β)ln(A)+βln(B)E[LLR per observationH1]E[N | H_1] \approx \frac{(1-\beta) \ln(A) + \beta\ln(B)}{E[\text{LLR per observation} | H_1]}

The conservative estimate uses: Nconservative=max(E[NH0],E[NH1])N_{\text{conservative}} = \max(E[N | H_0], E[N | H_1])

Sample Size and Time Estimation

Remaining Sample Size Estimation

The additional samples needed to reach each decision boundary can be estimated based on current LLR trajectory:

Nadditional=Target BoundaryCurrent LLRLLR/NcurrentN_{\text{additional}} = \frac{|\text{Target Boundary} - \text{Current LLR}|}{|\text{LLR}| / N_{\text{current}}}

Where the LLR accumulation rate is approximated by the current LLR divided by current sample size.

Leading Conclusion Determination

At any point in the test, the "leading conclusion" is determined by:

Leading={Significanceif (UpperLLR)(LLRLower)Futilityif (LLRLower)<(UpperLLR)\text{Leading} = \begin{cases} \text{Significance} & \text{if } (\text{Upper} - \text{LLR}) \leq (\text{LLR} - \text{Lower}) \\ \text{Futility} & \text{if } (\text{LLR} - \text{Lower}) < (\text{Upper} - \text{LLR}) \end{cases}

This indicates which decision boundary the test is more likely to reach first.

Time-to-Decision Estimation

Given test velocity data, time estimates can be calculated:

Timedecision=NadditionalSamples per day\text{Time}_{\text{decision}} = \frac{N_{\text{additional}}}{\text{Samples per day}}

Where samples per day is calculated from historical test data:

Samples per day=Total data pointsDays elapsed\text{Samples per day} = \frac{\text{Total data points}}{\text{Days elapsed}}

For tests using warmup periods, velocity should be calculated from post-warmup data for more accurate future projections.

Key Implementation Details

Binomial Log-Likelihood Calculation

For computational stability, we use:

ln(nx)=ln(n!)ln(x!)ln((nx)!)\ln \binom{n}{x} = \ln(n!) - \ln(x!) - \ln((n-x)!)

With Stirling's approximation for large factorials:

ln(n!)nln(n)n+12ln(2πn)\ln(n!) \approx n\ln(n) - n + \frac{1}{2}\ln(2\pi n)

Decision Boundary Values

With α=0.05\alpha = 0.05 and β=0.2\beta = 0.2 (80% power):

  • Upper boundary: ln(A)=ln(0.80.05)=2.944\ln(A) = \ln\left(\frac{0.8}{0.05}\right) = 2.944
  • Lower boundary: ln(B)=ln(0.20.95)=1.559\ln(B) = \ln\left(\frac{0.2}{0.95}\right) = -1.559

Advantages Over Traditional Testing

1. Early Stopping

SPRT can detect significant differences much earlier than fixed-sample tests, often requiring 50-70% fewer samples on average.

2. Flexible Sample Sizes

No need to pre-specify sample sizes. Testing continues until sufficient evidence accumulates.

3. Controlled Error Rates

Maintains exact Type I and Type II error control regardless of when the test stops, unlike "peeking" in traditional tests.

4. Real-Time Decisions

Provides continuous monitoring with immediate decision recommendations.

5. Futility Detection

Can detect when variants are practically equivalent, saving resources on inconclusive tests.

6. Actionable Timeline Insights

Provides real-time estimates of when decisions will be reached, enabling better resource planning and stakeholder communication.

Statistical Guarantees

Error Rate Control

Under SPRT, the probability of Type I error is exactly α\alpha and Type II error is exactly β\beta, regardless of:

  • When the test stops
  • How often decisions are checked
  • The actual effect size (if non-zero)

Optimality

SPRT minimizes expected sample size among all tests with the same error rates (Wald-Wolfowitz theorem).

Practical Considerations

Minimum Detectable Effect (δ\delta)

Choose δ\delta based on:

  • Business impact threshold
  • Practical significance
  • Available traffic volume

Typical values: 0.01-0.05 (1-5 percentage points)

Power and Significance

Standard choices:

  • α=0.05\alpha = 0.05 (95% confidence)
  • 1β=0.81-\beta = 0.8 (80% power)

Higher power reduces false negatives but increases required sample sizes.

Warmup Period Benefits

Using a warmup period:

  • Stabilizes baseline estimates
  • Reduces variance in early samples
  • Improves decision accuracy
  • Accounts for novelty effects
  • Provides cleaner data for velocity calculations

Time Estimation Considerations

  • Time estimates assume current test velocity continues
  • Velocity calculations exclude warmup periods when specified
  • Estimates are most accurate for tests with consistent traffic patterns
  • Weekend/holiday effects may impact actual timelines
  • Leading conclusion indicates the most probable outcome direction

Implementation Formula Summary

Decision boundaries:

Upper=ln(1βα),Lower=ln(β1α)\text{Upper} = \ln\left(\frac{1-\beta}{\alpha}\right), \quad \text{Lower} = \ln\left(\frac{\beta}{1-\alpha}\right)

Current decision:

  • If LLRUpper\text{LLR} \geq \text{Upper}: Declare winner
  • If LLRLower\text{LLR} \leq \text{Lower}: Declare no difference
  • Otherwise: Continue testing

Distance to decision:

  • To significance: max(0,UpperLLR)\max(0, \text{Upper} - \text{LLR})
  • To futility: max(0,LLRLower)\max(0, \text{LLR} - \text{Lower})

Leading conclusion:

  • Samples to leading: min(Samples to significance,Samples to futility)\min(\text{Samples to significance}, \text{Samples to futility})
  • Time to leading: Samples to leadingSamples per day\frac{\text{Samples to leading}}{\text{Samples per day}}

This methodology provides a robust, efficient framework for sequential A/B testing that adapts to data as it arrives while maintaining rigorous statistical control and providing actionable timeline insights for decision-making.