Fake survey entries, with and without typos
January 11, 2021 at 9:11 AM by Dr. Drang
Here’s a quick one on how I made the synthetic data files used in yesterday’s post.
First, I got the canonical iMac color names from this post by Stephen Hackett. I figured Stephen was about as authoritative a source as any on the subject. I copied all the text of the post, pasted it into a BBEdit document, and used some of the techniques discussed yesterday (how meta!) to boil it down to just the list of thirteen colors.
The list became part of this script:
python:
1: #!/usr/bin/env python
2:
3: import random
4:
5: # Initialize
6: numEntries = 100
7: numTypos = 15
8: colors = ['Bondi Blue', 'Blueberry', 'Lime', 'Tangerine',
9: 'Strawberry', 'Grape', 'Graphite', 'Sage', 'Ruby',
10: 'Indigo', 'Snow', 'Blue Dalmation', 'Flower Power']
11: wgts = [3, 1, 1, 2, 1, 1, 2, 1, 2, 2, 2, .5, .5]
12:
13: # Make list of random iMac colors
14: canEntries = random.choices(colors, k=numEntries, weights=wgts)
15:
16: # Write list to file, one per line
17: ctext = '\n'.join(canEntries)
18: with open('cancolors.txt', 'w') as f:
19: f.write(ctext)
20:
21: # Misspell a few of the entries
22: badEntries = canEntries[:]
23: alphabet = [ chr(97+i) for i in range(26) ]
24: typos = [ random.randrange(numEntries) for i in range(numTypos) ]
25: for e in typos:
26: color = badEntries[e]
27: letter = random.randrange(1, len(color))
28: color = color[:letter] + random.choice(alphabet) + color[letter+1:]
29: badEntries[e] = color
30:
31: # Write misspelled list to file, one per line
32: ctext = '\n'.join(badEntries)
33: with open('colors.txt', 'w') as f:
34: f.write(ctext)
The first section of the script initializes the variables used later. In addition to the list of colors, there’s the number of entries we’ll generate, the number of errors we’ll introduce, and a set of weights we’ll use for the random selection of colors.
The wgts
variable created in Line 11 is a list of relative likelihoods for each color. It’s parallel to the colors
list, so you can see that Bondi Blue is three times as likely to be chosen as Blueberry, Tangerine is twice as likely, Flower Power is half as likely, and so on. The wgts
variable is like a probability mass function for the colors, except that it isn’t normalized to a sum of one. There wasn’t really a need to weight the colors, but I wanted the output to look realistic, and certainly some iMac colors were more popular than others. The weights weren’t based on any data, just my arbitrary choices with a little boost given to Bondi Blue because it was the original.
The next section generates a list of colors through a random selection process. Line 14 uses the choices
function from Python’s random
module to generate a list of numEntries
colors. The list is then joined together with linefeed characters and written out to the cancolors.txt
file.
The next section makes a new list of entries with typos. Line 22 copies the list of properly spelled entries into a new list that we’ll add typos to. Line 23 creates a list of lower case letters from which we’ll choose at random to make the typos, and Line 24 generates a random list of numTypos
integers that represent the entries we’ll be messing up.
The loop in Lines 25–29 replaces random letters at random locations. The index of the letter to be replaced is chosen in Line 27 using the randrange
function. Note that the random index chosen starts at 1 rather than 0. I did this because I thought it was more realistic for typos to come after the first character. Line 28 then inserts a randomly chosen letter at that random index. Finally, the misspelled word is put back in the list of entries with typos. When the loop is done, the list is joined together with linefeeds and written out to the colors.txt
file.
I should note here that my Macs have a recent version of Python 3 (installed via Anaconda) and that my PATH
environment variable is such that the Anaconda Python is the default. If you want to play with this script, you may need to change the shebang line to
#!/usr/bin/python3
or do something else to make sure you’re running it through Python 3 instead of Python 2. The random
module in Python 2 doesn’t have a choices
function (nor do versions of Python 3 before 3.6).
As is usually the case, this little script took longer to explain than it did to write.