January 11, 2021 at 9:11 AM by Dr. Drang
Here’s a quick one on how I made the synthetic data files used in yesterday’s post.
First, I got the canonical iMac color names from this post by Stephen Hackett. I figured Stephen was about as authoritative a source as any on the subject. I copied all the text of the post, pasted it into a BBEdit document, and used some of the techniques discussed yesterday (how meta!) to boil it down to just the list of thirteen colors.
The list became part of this script:
python: 1: #!/usr/bin/env python 2: 3: import random 4: 5: # Initialize 6: numEntries = 100 7: numTypos = 15 8: colors = ['Bondi Blue', 'Blueberry', 'Lime', 'Tangerine', 9: 'Strawberry', 'Grape', 'Graphite', 'Sage', 'Ruby', 10: 'Indigo', 'Snow', 'Blue Dalmation', 'Flower Power'] 11: wgts = [3, 1, 1, 2, 1, 1, 2, 1, 2, 2, 2, .5, .5] 12: 13: # Make list of random iMac colors 14: canEntries = random.choices(colors, k=numEntries, weights=wgts) 15: 16: # Write list to file, one per line 17: ctext = '\n'.join(canEntries) 18: with open('cancolors.txt', 'w') as f: 19: f.write(ctext) 20: 21: # Misspell a few of the entries 22: badEntries = canEntries[:] 23: alphabet = [ chr(97+i) for i in range(26) ] 24: typos = [ random.randrange(numEntries) for i in range(numTypos) ] 25: for e in typos: 26: color = badEntries[e] 27: letter = random.randrange(1, len(color)) 28: color = color[:letter] + random.choice(alphabet) + color[letter+1:] 29: badEntries[e] = color 30: 31: # Write misspelled list to file, one per line 32: ctext = '\n'.join(badEntries) 33: with open('colors.txt', 'w') as f: 34: f.write(ctext)
The first section of the script initializes the variables used later. In addition to the list of colors, there’s the number of entries we’ll generate, the number of errors we’ll introduce, and a set of weights we’ll use for the random selection of colors.
wgts variable created in Line 11 is a list of relative likelihoods for each color. It’s parallel to the
colors list, so you can see that Bondi Blue is three times as likely to be chosen as Blueberry, Tangerine is twice as likely, Flower Power is half as likely, and so on. The
wgts variable is like a probability mass function for the colors, except that it isn’t normalized to a sum of one. There wasn’t really a need to weight the colors, but I wanted the output to look realistic, and certainly some iMac colors were more popular than others. The weights weren’t based on any data, just my arbitrary choices with a little boost given to Bondi Blue because it was the original.
The next section generates a list of colors through a random selection process. Line 14 uses the
choices function from Python’s
random module to generate a list of
numEntries colors. The list is then joined together with linefeed characters and written out to the
The next section makes a new list of entries with typos. Line 22 copies the list of properly spelled entries into a new list that we’ll add typos to. Line 23 creates a list of lower case letters from which we’ll choose at random to make the typos, and Line 24 generates a random list of
numTypos integers that represent the entries we’ll be messing up.
The loop in Lines 25–29 replaces random letters at random locations. The index of the letter to be replaced is chosen in Line 27 using the
randrange function. Note that the random index chosen starts at 1 rather than 0. I did this because I thought it was more realistic for typos to come after the first character. Line 28 then inserts a randomly chosen letter at that random index. Finally, the misspelled word is put back in the list of entries with typos. When the loop is done, the list is joined together with linefeeds and written out to the
I should note here that my Macs have a recent version of Python 3 (installed via Anaconda) and that my
PATH environment variable is such that the Anaconda Python is the default. If you want to play with this script, you may need to change the shebang line to
or do something else to make sure you’re running it through Python 3 instead of Python 2. The
random module in Python 2 doesn’t have a
choices function (nor do versions of Python 3 before 3.6).
As is usually the case, this little script took longer to explain than it did to write.