Sampling

Recently, I’ve had to inspect, measure, or in some way test sets of devices taken randomly from some larger population. I’ve been using Python and its random library to make the choices for me.1 The library makes these scripts very easy to write.

Say I have a hundred devices, each identified by a unique serial number, and I want to run a test on ten of them. This is how I choose the ten:

python:
 1:  #!/usr/bin/python
 2:  
 3:  from random import sample, seed
 4:  
 5:  # Setting the seed like this will give the same set of samples every
 6:  # time the script is run. Omitting this line will give a different
 7:  # set every time the script is run.
 8:  seed(1)
 9:  
10:  # This is where I'd enter the real serial numbers. For illustration,
11:  # I'm just using 0-99 with some extra leading zeros.
12:  devices = map(lambda x: "%05d" % x, range(300, 400))
13:  
14:  testUnits = sample(devices, 10)
15:  
16:  print '\n'.join(testUnits)

The most difficult part is entering the list of serial numbers. I’m usually given the list in a spreadsheet, so I copy it out of there, paste it into my script, and use a regex find/replace to turn it from a column of numbers into a comma-separated Python list of strings. I don’t want to waste your time with stuff like that here, so I just set the list of serial numbers to 00300 to 00399 in Line 12.

I don’t usually set the seed (Line 8), but in more complicated scripts that rely on random number generation or sampling, setting the seed can be very helpful for debugging because it generates the same set every time the script is run. You can keep the seed line in the script until you know it’s running correctly, then comment it out (or change its argument) to generate a new set of values. The argument to seed can be any immutable Python object: numbers are probably the most commonly used seeds, but you can also use strings:

python:
 8:  seed('corn')

The key line is Line 14, which uses the aptly named sample function to draw a random subset of items from the given list. The output is

00313
00384
00376
00325
00349
00344
00365
00378
00309
00302

As a practical matter, I sometimes generate a few more samples than I plan to test. I do this only if I suspect that some of the devices I’ve been given can’t be tested for one reason or another. If a device in the list turns out to be untestable, it’s nice to have an extra serial number or two to use as a random replacement.


  1. Yes, this is related to the confidence limits calculations I did in the SciPy v. Octave post of a few days ago. But random is part of the Standard Python Library—no need for SciPy.