Big numbers

The other day I needed to shuffle a list of about 200 items as part of a random sampling program. The obvious solution in Python is

python:
random.shuffle(theList)

which shuffles the list in place. But there’s a caveat in the random module’s documentation that gave me pause:

random.shuffle(x[, random])
Shuffle the sequence x in place. The optional argument random is a 0-argument function returning a random float in [0.0, 1.0); by default, this is the function random().

Note that for even rather small len(x), the total number of permutations of x is larger than the period of most random number generators; this implies that most permutations of a long sequence can never be generated.

How small is “even rather small”? Is 200 “a long sequence”? The answer was actually pretty easy to figure out.

First, how many permutations of a 200-item list are there? I fired up IPython and got the number.

In [1]: import math

In [2]: math.factorial(200)
Out[2]: 7886578673647905035523632139321850622951359776871732
632947425332443594499634033429203042840119846239041772121389
196388302576427902426371050619266249528299311134628572707633
172373969889439224456214516642402540332918641312274282948532
775242424075739032403212574055795686602260319041703240623517
008587961789222227896237038973747200000000000000000000000000
00000000000000000000000L

Python nicely switches to arbitrary precision arithmetic to give the answer. That’s the kind of number that begs to be put in scientific notation, but unfortunately it’s beyond the capacity of float. So how do I get the exponent? The same way I learned back in junior high: count the number of places the decimal point has to move to the left. I converted it to a string and counted how many digits there were to the right of the first one.

In [3]: len(str(math.factorial(200))[1:])
Out[3]: 374

So the answer is 7.89×103747.89 \times 10^{374}.

How does this compare to the period of the random number generator Python uses? The documentation says

Python uses the Mersenne Twister as the core generator. It produces 53-bit precision floats and has a period of 2**19937-1.

OK, it should be pretty obvious that 2199372^{19937} is a lot bigger than 1037410^{374}, but I wanted to see the period in scientific notation, too. Applying the same technique, I got

In [4]: len(str(2**19937 - 1)[1:])
Out[4]: 6001

In [5]: str(2**19937 - 1)[:5]
Out[5]: '43154'

In scientific notation, then, the period of the Mersenne Twister is 4.32×1060014.32 \times 10^{6001}, which is about 5600 orders of magnitude bigger than the number of permutations.

Another way to check would be to take the base-2 logarithm of 200!200!:

In [6]: math.log(factorial(200), 2)
Out[6]: 1245.3805070592084

This means that 200!21245200! \approx 2^{1245}, which can be compared directly to 2199372^{19937}. Either way, it’s clearly safe to use random.shuffle() on a 200-item list.

How big can a list be before the number of permutations is greater than the period of the Mersenne Twister? I could have dug into SciPy’s root-finding functions to solve

log2(x!)19937=0\log_2(x!) - 19937 = 0

but it seemed easier to just use trial and error. I soon found the answer to be 2080:

In [13]: math.log(factorial(2080),2)
Out[13]: 19932.556355821675

In [14]: math.log(factorial(2081),2)
Out[14]: 19943.579417071407

That doesn’t mean it’s safe to use random.shuffle() on a 2080-item list. I’m not combinatorially sophisticated enough to know how much “cushion” is necessary between the number of permutations and the period of the random number generator, but I do feel safe now using random.shuffle() on lists of several hundred items.