Big numbers

The other day I needed to shuffle a list of about 200 items as part of a random sampling program. The obvious solution in Python is


which shuffles the list in place. But there’s a caveat in the random module’s documentation that gave me pause:

random.shuffle(x[, random])
Shuffle the sequence x in place. The optional argument random is a 0-argument function returning a random float in [0.0, 1.0); by default, this is the function random().

Note that for even rather small len(x), the total number of permutations of x is larger than the period of most random number generators; this implies that most permutations of a long sequence can never be generated.

How small is “even rather small”? Is 200 “a long sequence”? The answer was actually pretty easy to figure out.

First, how many permutations of a 200-item list are there? I fired up IPython and got the number.

In [1]: import math

In [2]: math.factorial(200)
Out[2]: 7886578673647905035523632139321850622951359776871732

Python nicely switches to arbitrary precision arithmetic to give the answer. That’s the kind of number that begs to be put in scientific notation, but unfortunately it’s beyond the capacity of float. So how do I get the exponent? The same way I learned back in junior high: count the number of places the decimal point has to move to the left. I converted it to a string and counted how many digits there were to the right of the first one.

In [3]: len(str(math.factorial(200))[1:])
Out[3]: 374

So the answer is [7.89 \times 10^{374}].

How does this compare to the period of the random number generator Python uses? The documentation says

Python uses the Mersenne Twister as the core generator. It produces 53-bit precision floats and has a period of 2**19937-1.

OK, it should be pretty obvious that [2^{19937}] is a lot bigger than [10^{374}], but I wanted to see the period in scientific notation, too. Applying the same technique, I got

In [4]: len(str(2**19937 - 1)[1:])
Out[4]: 6001

In [5]: str(2**19937 - 1)[:5]
Out[5]: '43154'

In scientific notation, then, the period of the Mersenne Twister is [4.32 \times 10^{6001}], which is about 5600 orders of magnitude bigger than the number of permutations.

Another way to check would be to take the base-2 logarithm of [200!]:

In [6]: math.log(factorial(200), 2)
Out[6]: 1245.3805070592084

This means that [200! \approx 2^{1245}], which can be compared directly to [2^{19937}]. Either way, it’s clearly safe to use random.shuffle() on a 200-item list.

How big can a list be before the number of permutations is greater than the period of the Mersenne Twister? I could have dug into SciPy’s root-finding functions to solve

[\log_2(x!) - 19937 = 0]

but it seemed easier to just use trial and error. I soon found the answer to be 2080:

In [13]: math.log(factorial(2080),2)
Out[13]: 19932.556355821675

In [14]: math.log(factorial(2081),2)
Out[14]: 19943.579417071407

That doesn’t mean it’s safe to use random.shuffle() on a 2080-item list. I’m not combinatorially sophisticated enough to know how much “cushion” is necessary between the number of permutations and the period of the random number generator, but I do feel safe now using random.shuffle() on lists of several hundred items.