Hypergeometric Obama

On Friday, Barack Obama will speak at the University of Illinois as part of a ceremony at which he will receive the Paul Douglas Award for Ethics in Government. The speech will be at the Auditorium at the south end of the Quad, which doesn’t seat nearly as many people as wanted to see him.

According to this report, 22,611 students signed up for a ticket lottery for just 1,300 openings. My two sons were among the horde, and they learned yesterday, along with 21,309 of their friends, that they didn’t win.

How likely was it that neither would get a seat? The chance of any individual student being picked was

1,30022,611=0.0575

which means that the chance of any individual student not being picked was 10.0575=0.9425. So the probability that neither would be picked was

(0.9425)(0.9425)=0.8883

or just under 90%. So the fact that neither of them got in wasn’t a surprise. We can also calculate the probability that both were lucky,

(0.0575)(0.0575)=0.00331

which is extremely low, and the probability that one got in and one didn’t,

2(0.0575)(0.9425)=0.1084

which isn’t too bad, but unfortunately it didn’t work out.

These calculations are simple and give us answers that are accurate enough for our purposes, given the large numbers (1,300 and 22,611) involved. But we’ve made some approximations that won’t be reasonable if the numbers are small.

For example, let’s assume the same general problem, but this time we’ll assume there are 10 applicants—two of whom are my sons—for 2 tickets. What are the probabilities of zero, one, and two of my sons getting tickets?

If we follow the procedure above, we get

(0.20)(0.20)=0.040

for both sons getting a ticket,

2(0.20)(0.80)=0.32

for one son getting a ticket and one not, and

(0.80)(0.80)=0.64

for neither getting a ticket. These calculations are internally consistent, in that they add up to unity, but they’re wrong.

The problem is that these calculations are based on an assumption that one son winning a ticket is independent of the other son winning, but they are not independent.

The equation for calculating the probability of an intersection of two events, call them A and B, is

P(AB)=P(BA)P(A)=P(AB)P(B)

where P(BA) is the probability of event B given that event A occurs, and similarly, P(AB) is the probability of event A given that event B occurs. If the events are independent, then the conditions don’t matter,

P(BA)=P(B)andP(AB)=P(A)

and

P(AB)=P(A)P(B)

But that’s not the situation here. If Son A gets a ticket, then Son B’s chance of getting a ticket isn’t 2/10=0.20, it’s 1/9=0.1111 because with Son A having a ticket, there’s only one ticket left for the nine remaining applicants. Which means the probability that both sons get a ticket is

19210=145=0.02222

which just over half of our earlier, mistaken, calculation of 0.040.

Similarly, the probability that one son gets a ticket and the other doesn’t is

89210+29810=1645=0.3556

and the probability that neither get a ticket is

79810=2845=0.6222

(And now we see why it was OK to play a little loose with the numbers back in our original calculations with 22,611 applicants for 1,300 tickets. There just isn’t enough difference between

1,30022,6111,30022,611

and

1,30022,6111,29922,610

to make it worth even the minimal effort.)

All of the denomimators in the previous results were 45. It will probably not surprise you that 45 is the number of combinations of ten things taken two at a time. It’s also the binomial coefficient, and is usually written like a fraction but without the horizontal dividing line and surrounded by parentheses. It’s calculated though factorials:

(nk)=n!k!(nk)!

In our particular case of n=10 and k=2,

(102)=10!2!8!=(10)(9)2=45

Imagine ten lottery balls with the numbers 0 through 9 printed on them. Mix them up and draw two. There are 45 different results you can get if you don’t care about the order of the two balls. Here they are:

0-1  0-2  0-3  0-4  0-5  0-6  0-7  0-8  0-9
     1-2  1-3  1-4  1-5  1-6  1-7  1-8  1-9
          2-3  2-4  2-5  2-6  2-7  2-8  2-9
               3-4  3-5  3-6  3-7  3-8  3-9
                    4-5  4-6  4-7  4-8  4-9
                         5-6  5-7  5-8  5-9
                              6-7  6-8  6-9
                                   7-8  7-9
                                        8-9

If two of the balls—say 3 and 7, just as an example—represent winning a ticket, scanning the list shows that there’s one combination with both 3 and 7; 16 combinations with a 3 or a 7 but not both; and 28 combinations with neither a 3 nor a 7. These are the numerators in the answers above.

There’s another way to visualize this that might make more sense to you. Image a line of ten people. Two of them get handed tickets (X), the other eight don’t (O). Here are the 45 ways the two tickets can be distributed:

XXOOOOOOOO  XOXOOOOOOO  XOOXOOOOOO  XOOOXOOOOO  XOOOOXOOOO  XOOOOOXOOO  XOOOOOOXOO  XOOOOOOOXO  XOOOOOOOOX
            OXXOOOOOOO  OXOXOOOOOO  OXOOXOOOOO  OXOOOXOOOO  OXOOOOXOOO  OXOOOOOXOO  OXOOOOOOXO  OXOOOOOOOX
                        OOXXOOOOOO  OOXOXOOOOO  OOXOOXOOOO  OOXOOOXOOO  OOXOOOOXOO  OOXOOOOOXO  OOXOOOOOOX
                                    OOOXXOOOOO  OOOXOXOOOO  OOOXOOXOOO  OOOXOOOXOO  OOOXOOOOXO  OOOXOOOOOX
                                                OOOOXXOOOO  OOOOXOXOOO  OOOOXOOXOO  OOOOXOOOXO  OOOOXOOOOX
                                                            OOOOOXXOOO  OOOOOXOXOO  OOOOOXOOXO  OOOOOXOOOX
                                                                        OOOOOOXXOO  OOOOOOXOXO  OOOOOOXOOX
                                                                                    OOOOOOOXXO  OOOOOOOXOX
                                                                                                OOOOOOOOXX

Imagine now that my two sons are at the left end of the line (they could be anywhere—like positions 3 and 7—the results won’t change). There’s one arrangement where they get both tickets (the upper left corner), 16 arrangements where one of them gets a ticket and the other doesn’t (the top two rows except for the upper left corner), and 28 arrangements where neither of them get a ticket (everywhere else).

There is, as you might imagine, a generalization of this problem to account for any number of applications, tickets, and sons. It’s called the hypergeometric distribution. Using the nomenclature in the linked Wikipedia article, we’ll assume a population of N things (applications), of which K are successes (tickets). If we drawn n samples (sons) from that population (without replacement), then the probability that k of the samples will be successes is

(Kk)(NKnk)(Nn)

The denominator should look familiar, it’s the number of combinations of N things taken n at a time.

The first term in the numerator is the number of ways the successes in the sample (k) can be taken from the successes in the population (K). The second term in the numerator is the number of ways the failures in the sample (nk) can be taken from the failures in the population (NK). This product is a formalization of the counting we did above.

Let’s use this formula to repeat the example with ten applications for tickets (N=10), two tickets (k=2), and two sons (n=2). The probability that both sons will get a ticket is

(22)(80)(102)=(1)(1)45=0.0222

The probability that exactly one son will get a ticket is

(21)(81)(102)(2)(8)45=0.3556

And the probability that neither one will get a ticket is

(20)(82)(102)=(1)(28)45=0.6222

Just like before, but with less counting.

The SciPy library for Python has a sublibrary called stats with a set of functions for handling the hypergeometric distribution. Here’s how to do this last calculation in Python:

python:
>>> from scipy.stats import hypergeom
>>> hypergeom.pmf(0, 10, 2, 2)
0.62222222222222201

The pmf function stands for “probability mass function,” and it represents the formula defined above. The first argument is the number of successes in the sample, the second is the size of the population, the third is the number of successes in the population, and the fourth is the sample size.

I think this ordering of the arguments was an incredibly stupid choice by the designers of the library, as it puts the population numbers in between the sample numbers and the order of the population numbers is the reverse of the order of the sample numbers. It’s hard to imagine a less intuitive way to define the argument order. And don’t get me started on the symbols they use in the documentation. But at least the function works.

We can now go back to our original problem of 1,300 tickets, 22,611 applications, and 2 sons and do it right.

python:
>>> hypergeom.pmf(2, 22611, 1300, 2)
0.0033031794730568834
>>> hypergeom.pmf(1, 22611, 1300, 2)
0.10838192109526341
>>> hypergeom.pmf(0, 22611, 1300, 2)
0.88831489943503728

As expected, no practical difference between these answers and the approximations we started out with. But it’s the journey, not the destination, right?