Hypergeometric Trek

Lately, I’ve been doing probability calculations using the hypergeometric distribution at work, so I when I saw these tweets from Jason Snell,

Some big news from the @Random_Trek universe. The random number generator appears to have finally chosen one of the movies.
Jason Snell (@jsnell) Dec 30 2015 12:26 PM


By my calculations @Random_Trek should do a Star Trek movie every 60 episodes. It’s been 80, but at least we got a good one.
Jason Snell (@jsnell) Jan 3 2016 7:28 PM

I knew I could whip out a quick blog post that would make me look smarter than I am.

First, some background. Random Trek is a podcast in which host Scott McNulty uses a random number generator to pick a Star Trek TV episode or movie, which he and a guest then watch and discuss. According to the show’s FAQ, the TV episodes he draws from are from the original series (i.e. real Star Trek), The Next Generation, Deep Space Nine, Voyager, and Enterprise. I believe the movies he draws from are the ten that came before the J.J. Abrams reboot.

As for the hypergeometric distribution, it’s what arises when we want to calculate the probability of some sample (without replacement) taken from a finite population, where each item in the sample can be described as occupying one of two states, e.g. on/off, pass/fail, or movie/TV episode. It differs from the binomial distribution in that the probability associated each item (or trial) in the sample changes as the sampling proceeds. In a situation governed by the binomial distribution, the probability associated with each trial is constant.

Here’s an example. Suppose we have a jar with ten marbles in it, five white and five red. The jar is shaken to mix the marbles up, and you close your eyes and reach in to pull out three marbles. What’s the probability that all three are white?

Well, the probability that the first marble is white is easy—that’s \(\frac{5}{10}\) or \(\frac{1}{2}\). But the probability that the second marble is white (given that the first was white) is a little different. We now have nine marbles left, of which four are white, so its probability is \(\frac{4}{9}\). Similarly, the probability that the third marble is also white is \(\frac{3}{8}\). Putting this all together, the probability of drawing three white marbles is

\[\frac{5}{10} \times \frac{4}{9}\times \frac{3}{8} = \frac{60}{720} = \frac{1}{12}\]

Compare this to the similar-seeming problem of flipping a coin three times and having it come up heads each time. Here, the probability of a head in each trial is \(\frac{1}{2}\)—the same as the initial probability in the marble example—but because the probability doesn’t change as the trials continue, the probability of three heads in three tosses is

\[\frac{1}{2} \times \frac{1}{2} \times \frac{1}{2} = \frac{1}{8}\]

This is a higher probability than in the marble example, which it should be. In the marble example, each successive marble is less likely to be white than its predecessor, driving the overall probability down.

In general, if we have a population of \(N\) items, of which \(M\) possess some characteristic, and we draw a sample of \(n\) items from that population, the probability that \(m\) items in the sample will have that characteristic is

\[\frac{ \binom{M}{m} \binom{N-M}{n-m} }{ \binom{N}{n} }\]

where those things in parentheses are binomial coefficients,1 defined as

\[\binom{a}{b} = \frac{a!}{b! (a-b)!}\]

where the exclamation points represent factorials,

\[a! = 1 \times 2 \times 3 \times \ldots \times a\]

Frankly, it’s much easier to just use the Python hypergeom submodule from the scipy.stats library. The pmf (probability mass function) method does the trick. For the marble example, we’d do this:

from scipy.stats import hypergeom
hypergeom.pmf(3, 10, 5, 3)

and get the answer in decimal form: 0.08333.

OK, time to do the Random Trek problem. I got the TV episode counts from Wikipedia and followed Scott McNulty’s rule that two-part stories count as one. Here’s the code:

 1:  #!/usr/bin/env python
 3:  from scipy.stats import hypergeom as h
 5:  # Episode counts for each series. Subtract second of two-parters.
 6:  tos = 29 + 26 + 24 - 1
 7:  tng = 22 + 26*6 - 10
 8:  ds9 = 20 + 26*6 - 5
 9:  voy = 16 + 26*6 - 12
10:  ent = 26*2 + 24 + 22 - 4
12:  # All episodes and movies.
13:  episodes = tos + tng + ds9 + voy + ent
14:  movies = 10
15:  population = episodes + movies
17:  # Probability that podcasts 1-80 will be on TV episodes.
18:  print h.pmf(80, population, episodes, 80)

And the answer is: 0.284 or 28.4%. So it isn’t all that unlikely that there’d be 80 TV episodes before a movie came up.

But wait, there’s more! When you’re taking a relatively small number of samples from a relatively large population, you can use the binomial distribution (which is often easier to calculate) to approximate the hypergeometric distribution. When I first read Jason’s tweets, I opened up PCalc and ran this calculation:

\[\left( \frac{59}{60} \right)^{80} = 0.261\]

which is pretty close. This is 80 consecutive trials, each of which has a probability of \(\frac{59}{60}\). Because \(\frac{1}{60}\) was Jason’s estimate of the proportion of movies, \(\frac{59}{60}\) would be the proportion of TV episodes.

As it turns out™, there are 10 movies and 671 TV episodes, so the proportion of movies is more like \(\frac{1}{68}\). Had I used that estimate, the binomial approximation would have been

\[\left( \frac{67}{68} \right)^{80} = 0.306\]

which is only slightly closer to the true answer, but is larger than the true answer, which it should be. Knowing how to bound a result is often just as useful as knowing how to get the exact value.

All of these results, whether approximate or not, are good enough to answer the fundamental question: Is it weird that there hasn’t been a movie episode of Random Trek yet? The answer is that it’s a little weird, but not much.

  1. Yes, binomial coefficients get used even though this isn’t the binomial distribution. Deal with it.