AIDS vaccine and statistical significance

This morning came word that the results of the AIDS vaccine trial, reported as a modest success last month, were, on further analysis, not statistically significant. I’ve been waiting a couple of weeks for this shoe to drop, as even the originally-reported results were pretty weak. I’m not claiming any expertise in vaccine research, but I do know how to do elementary statistics.

According to the LA Times article, the switch from statistically significant to not statistically significant came when seven of the test subjects were reclassified. That’s seven subjects out of over 16,000. How can such a small change flip the results from success to failure? Because the original results were on the edge of statistical significance to begin with, as anyone with a little statistics background could have calculated.

Here are the results as they were first given:

Observed Vaccine No vaccine
HIV positive 51 74
HIV negative 8,146 8,124

This is what’s known as a 2×2 contingency table. There are two treatments, vaccine and no vaccine, and two outcomes, HIV positive and HIV negative. The basic test of statistical significance poses the following question:

Assume that the vaccine has no effect, that there would have been 125 HIV-positive test subjects even if all 16,395 subjects had been untreated. This “pretend there’s no effect” assumption is called the null hypothesis. If one had then arbitrarily split the test population into two groups, 8,197 in one group and 8,198 in the other, what is the probability that the distribution of HIV-positive subjects would be at least as skewed as they turned out?

By “at least as skewed” I mean a distribution of the HIV-positive subjects between the vaccinated and unvaccinated that’s at least as far from an equal distribution as the test results were. Since the test results were 51-74, distributions like 50-75, 49-76, 48-77, etc., all the way to 0-125, would all fall into the “at least as skewed” category. (So, by the way, would results skewed the opposite way: 74-51, 75-50, and so on. Results like this would suggest the vaccine has a negative effect, but that’s also a violation of the null hypothesis.)

If the calculated probability is sufficiently low—in other words, if it’s quite unlikely that the test results were due to chance alone—we say that the results are statistically significant. It’s fairly common to use 5%, one chance in twenty, as the upper limit for statistical significance, but lower values are sometimes used. In fact, it’s often thought to be good practice to provide that probability instead of just reporting whether the results are significant or not.

Which leaves us with the problem of calculating that probability. Fortunately, the 2×2 contingency table is a very well-studied problem, and the procedure is straightforward. First, we rewrite the table of test results (often called the observed values) including the row and column sums.

Observed Vaccine No vaccine Sum
HIV positive 51 74 125
HIV negative 8,146 8,124 16,270
Sum 8,197 8,198 16,395

If the null hypothesis is true, then the probability of any subject in the test group becoming HIV positive is

p=12516,395=0.007624p = \frac{125}{16,395} = 0.007624

or about three-quarters of one percent. We then make a similar table filled, not with the test results, but with the expected values based on the above probability

Expected Vaccine No vaccine Sum
HIV positive 62.496 62.504 125
HIV negative 8134.504 8135.496 16,270
Sum 8,197 8,198 16,395

These are the expected results under the null hypothesis. Note that the row and column sums remain the same.

Now that we have the expected values, we calculate the deviation of the observed values from the expected values, which is just a subtraction. We’ll leave out the row and column sums this time, because they don’t deviate.

Deviation Vaccine No vaccine
HIV positive -11.496 11.496
HIV negative 11.496 -11.496

These deviations are usually called errors, even though no mistakes have been made. Statisticians spend a a lot of time studying the properties of errors. For a 2×2 contingency table with a sufficiently high count in each table cell, statisticians have shown that if we square the errors,

Square error Vaccine No vaccine
HIV positive 132.16 132.16
HIV negative 132.16 132.16

standardize the square errors by dividing by the expected values,

Standardized Vaccine No vaccine
HIV positive 2.11473 2.11447
HIV negative 0.01625 0.01625

and add all the values in the table,

χ2=2.11473+2.11447+0.01625+0.01625=4.2617\chi^2 = 2.11473 + 2.11447 + 0.01625 + 0.01625 = 4.2617

we get a number known as the chi-squared statistic, so called because, under the null hypothesis, the sum of the standardized square errors is a random variable with a chi-squared distribution.

The chi-squared distribution is actually a family of distributions, parameterized by the number of degrees of freedom. For a 2×2 contingency table, the number of degrees of freedom is 1. Why is it called the degrees of freedom and why is it 1 in this case? Consider our 2×2 contingency table with none of the observed values filled in, but with the row and column sums fixed.

Observed Vaccine No vaccine Sum
HIV positive 125
HIV negative 16,270
Sum 8,197 8,198 16,395

If I were to give you just one of the missing values, you’d be able to fill in the rest of the table through subtraction from the row and column sums. In a sense, then, I am free to fill in only one of the values; all the others are contingent on that one. That’s why there’s only one degree of freedom in a 2×2 contingency table.

(If you’re wondering why the row and column sums are fixed, it’s because those sums are bound up in the null hypothesis, and we are doing our calculations under that hypothesis.)

We’re now just one step away from our answer. The probability, under the null hypothesis, that the distribution of HIV-positive test subjects would be at least as skewed as the observed results is equal to the probability that a chi-squared random variable with one degree of freedom would be larger than 4.2617. In the old days, when I was a student, we’d look this number up in a chi-squared table, but today we have more convenient options.

I used the chi2cdf function in Octave to get an answer of 0.03898. You could also use this Wolfram Alpha page to get the same answer.1 So there’s a 4% chance of getting results at least as skewed as those observed in the study. This is close to the usual 5% boundary, so it’s not surprising that reclassifying a few subjects would make the results not statistically significant.

Here’s a transcript of the Octave session in which I made all the calculations. It takes a lot less time to do it than to explain it. Note that in some cases I’m using regular matrix operators (-, *), and in other cases I’m using element-by-element operators (.^, ./).

octave-3.2.3:1> obs = [51, 74; 8146, 8124]
obs =

     51     74
   8146   8124

octave-3.2.3:2> sum(obs,1)
ans =

   8197   8198

octave-3.2.3:3> sum(obs,2)
ans =


octave-3.2.3:4> p = sum(obs,2)(1)/sum(sum(obs))
p =  0.0076243
octave-3.2.3:5> exp = [p; 1-p]*sum(obs,1)
exp =

     62.496     62.504
   8134.504   8135.496

octave-3.2.3:6> err = obs - exp
err =

  -11.496   11.496
   11.496  -11.496

octave-3.2.3:7> sqerr = err.^2
sqerr =

   132.16   132.16
   132.16   132.16

octave-3.2.3:8> stdsqerr = sqerr./exp
stdsqerr =

   2.114726   2.114468
   0.016247   0.016245

octave-3.2.3:9> chi2stat = sum(sum(stdsqerr))
chi2stat =  4.2617
octave-3.2.3:10> 1 - chi2cdf(chi2stat,1)
ans =  0.038981


  1. Note that I used 10,000 as the “right endpoint” on the Wolfram Alpha page. What I really wanted as the right endpoint was infinity, but I couldn’t figure out how to tell Alpha that (neither “infinity” nor “Infinity” worked). Because the chi-squared density function drops off rapidly with increasing values, 10,000 was a good proxy for infinity. Even 100 would have given four digits of accuracy in the calculated probability.