Another small adventure in data cleaning

In a project at work, I had a folder with dozens of engineering drawings for a piece of construction equipment. As part of a report, I needed to generate a list of all the drawings I had according to their drawing numbers, and I also needed to make sure my list corresponded to the drawings in a list of documents that the client had compiled. This would be simple if

  1. The file names were just the drawing numbers.
  2. The file names had a consistent format.
  3. The drawings from the client weren’t in some awful nested list in a Word document.

Of course, if any of these were true, I wouldn’t be writing this post.

Let’s start with the drawing files I had. They had come to me via email and cloud links, and I had saved them all in a folder cleverly named “drawings.” Although their file names weren’t consistent, they did start with the file number, a string of 7–10 digits, and they were all PDFs. So I was able to put them all on the clipboard with this command:

ls *.pdf | egrep '^\d{7,10}' | pbcopy

I used egrep to get what the grep man page calls “extended” regular expressions and what I call “not brain dead” regular expressions.

If you’re wondering why I had files that weren’t drawings in the drawings folder, I can only say in my defense that the non-drawing files I put in that folder—like material specifications—were close enough to drawings that I thought they belonged there.

I then pasted the list of drawing files into a fresh BBEdit document. Although there are ways to do the next few filtering steps directly from the command line, I find that things typically go faster if I can see the results (and can undo when I make a mistake). The list of drawings looked like this (except there were about 100 drawings, not 7):

0123456_B - Base Weldment.pdf
12345678_C - Base Assembly.pdf
23456789_C_1_2 - Carriage Casting.pdf
3002345678_C_Steel Bushing.pdf
3102345678 Wheel Bearings.pdf
5123456789 rev B.pdf
6987654321_A_2_2.pdf

Most of the file names had an underscore and a capital letter following the drawing number. These were the revision codes1 and were necessary to include in my list. So the first drawing above would go in my list as

0123456 Rev. B

As you can see, though, some files didn’t indicate any revisions and some had an explicit “rev” in their name. I wanted to handle all of them, if possible.

I brought up BBEdit’s Find dialog and built the regular expression necessary to do most of the extraction.

BBEdit Find and replace to extract drawing numbers

The Find regex,

^(\d{7,10})((_| rev )([A-Z]))?(.+)$

collects the drawing number in the first set of parentheses and uses alternation to collect the optional revision code after either an underscore or “rev”. The final set of parentheses collects the rest of the file name, including the extension. Looking back on it, that last bit didn’t need to be in parentheses, but they didn’t hurt.

The Replace string,

\1 Rev. \4

puts the drawing number and revision code in the format I want.

After clicking the Replace All button, I got this,

0123456 Rev. B
12345678 Rev. C
23456789 Rev. C
3002345678 Rev. C
3102345678 Rev. 
5123456789 Rev. B
6987654321 Rev. A

which is obviously wrong for the drawings that had no revision code but worked for all the others. I’m sure there’s a clever way to make a regex that avoids the “dangling Rev,” but clever takes time, and BBEdit already has a simple way to delete strings at the end of a line. It’s the Prefix/Suffix Lines… command in the Text menu.

Prefix Suffix Lines setup

This removes the ” Rev.” from all the lines that end with that, leaving the others untouched.

Now we move on to the unfortunate Word document with multilevel lists that contained, in addition to all the drawings the client thought I had, a long list of other documents. I opened the document, copied all the text, and pasted it into a new BBEdit document. To get rid of all the lines that didn’t include a drawing number, I used BBEdit’s Process Lines Containing… command and set it up to delete lines that did not contain a string of 7 to 10 digits.

Process Lines setup

Because Word creates nested lists by changing the left margin and indentation, there were no leading tabs in any of the remaining lines. But every line had some combination of leading numbers, letters, periods, and closing parentheses followed by a tab. Like this:

1)  0123456B
d)  12345678C - Base Assembly.pdf
3.  Carriage Casting Drawing 23456789C
a.  3002345678C
t.  Drawing 3102345678
aa. Dwg 5123456789B
cc. 6987654321A

While this is a mess of inconsistency, it’s not that hard to pull out the drawing number and revision code.

Find and replace for list from client

The Find regex is

^.+?(\d{7,10})([A-Z])?.*$

and the Replace string is

\1 Rev. \2

As before, this left several “dangling Revs,” but I was able to use the Prefix/Suffix Lines… command to get rid of them.

With two files of clean drawing lists in hand, I used the comm command, which has become a favorite, to get the list of drawings that I had but were not in the client’s list,

comm -23 in-folder.txt client-list.txt

and vice-versa,

comm -13 in-folder.txt client-list.txt

BBEdit is great at this sort of interactive data cleaning. It puts the power of traditional command-line text filtering tools into a nice GUI. For example, the Process Lines Containing… command I used above is basically a sed command, but with an easy way to undo my mistakes.


  1. Engineering drawings get changed as a design evolves and those changes are tracked through revision codes. Sometimes the revision codes are numbers, but in my experience letters are more common. 


Outlining and Bike

I’ve been planning to write this post for some time, but things kept coming up. Today, though, Jesse Grosjean is running a sale on Bike, his outlining app for the Mac, and it seemed like the right day to finally push this out. The sale is a pretty good one—$10 instead of the usual $30—and all you have to do to get that price is reply to this Mastodon toot with a photo of your hometown.

I’ve had a long an difficult relationship with outlining and outlining apps. I much prefer outlining to mind-mapping (so don’t write to me with mind-mapping recommendations), but I keep running into practical problems when using outlining apps. I categorize these problems as me problems and them problems.

The me problems have to do with converting an outline into a finished piece of writing. I’ve always had this silly belief that I should be able to convert an outline into the skeleton of a report (or a blog post or whatever, but it’s usually a report) more or less automatically and then flesh it out into a final product. This doesn’t work because, except for the items at the top-level, the various items and subitems in outlines don’t correspond perfectly to sections and subsections of a report. Some outline items are subsections, but most are paragraphs or lists within a subsection. There’s no general way of knowing what an outline item is; its level doesn’t offer enough information to slot it into the proper place in the report.

My solution to the me problem has been to stop trying to do the conversion automatically. I now write my reports from scratch starting with a blank text file while referring to my outline. The outline could be in a window on my Mac or open on my iPad propped up next to the Mac. If the outline happens to have paragraphs or lists that would fit nicely in the final report, I copy and paste them. Otherwise, I just type away, following the outline’s structure.

I confess this way of working still nags at me. Surely, the back of my brain says, there must be a way to avoid the repetition. But the front of my brain argues back that years of trying have never led to that magical solution. There’s no way to avoid the actual work of writing.

The them problems are about sharing my outlines with the people I’m working with. Quite often, when a report is in its early stages, sharing an outline with a colleague and going through its structure is a good way to organize and divvy up the work. But the people I collaborate with are seldom Mac users, and even if they were, they’re unlikely to have the same outlining software I have. When I was using OmniOutliner, I’d print my outline to a PDF document and share that. But getting my outline into a form I liked for review and sharing was never as easy as I thought it should be. I like my outlines to be very spare and unadorned as I’m working on them, but to have numbered sections and specific types of spacing when printed or displayed for review. OmniOutliner, being a WYSIWYG app, forced me to change my outline before printing to PDF and then change it back afterward. I suppose I could have automated a lot of this, but it just seemed wrong to have to do so.

In some ways, Bike seems worse than OmniOutliner for both the me and them problems. It’s Mac-only, so I can’t just open a Bike outline on my iPad to look at while I write. It doesn’t even have a Print menu item, so I can’t turn my outlines into PDFs for reviewing and sharing. But what it does have is a file format that makes it easy to get around these deficiencies.

A Bike outline in its native format is just an HTML file. Here’s a screenshot of a simple example:

Example Bike outline

And here’s the source of the Example.bike file:

xml:
<?xml version="1.0" encoding="UTF-8"?>
<html>
    <head>
        <meta charset="utf-8"/>
    </head>
    <body>
        <ul id="le4V0ize">
            <li id="a0">
                <p>First item</p>
                <ul>
                    <li id="7R9">
                        <p>Subitem 1</p>
                    </li>
                    <li id="cCF">
                        <p>Subitem 2</p>
                        <ul>
                            <li id="J8w">
                                <p>Subsubitem A</p>
                            </li>
                            <li id="A-L">
                                <p>Subsubitem B</p>
                            </li>
                        </ul>
                    </li>
                    <li id="WxM">
                        <p>Subitem 3</p>
                    </li>
                </ul>
            </li>
            <li id="iU">
                <p>Second item</p>
                <ul>
                    <li id="8C3">
                        <p>Subitem 1</p>
                    </li>
                    <li id="jIH">
                        <p>Subitem 2</p>
                    </li>
                </ul>
            </li>
            <li id="aaD">
                <p>Third item</p>
            </li>
        </ul>
    </body>
</html>

As you can see, it’s basically just a bunch of nested unordered lists. You can open a Bike file in a browser, and it’s perfectly readable, albeit a bit on the vanilla side.

Bike outline opened in Safari

Since vanilla is not what I want, I wrote a short script to add a CSS section to the file that gives me the style I want.

Example outline in Safari after styling

Here’s the script, called bike2html:

python:
 1:  #!/usr/bin/env python3
 2:  
 3:  import sys
 4:  from docopt import docopt
 5:  import sys
 6:  
 7:  usage = """Usage:
 8:    bike2html [options] BIKEFILE
 9:  
10:  Convert a Bike outline to HTML with hierarchically numbered items.
11:  
12:  Options:
13:    -t TTTT   title [default: Outline]
14:    -h        show this help message
15:  
16:  """
17:  
18:  # Handle the command line option.
19:  args = docopt(usage)
20:  title = args['-t']
21:  bike = args['BIKEFILE']
22:  
23:  # CSS to insert after <head>
24:  css = '''    <style type="text/css">
25:        body {
26:          font-family: Helvetica, Arial, Sans-Serif;
27:          font-weight: normal;
28:          font-size: 12pt;
29:          line-height: 1.8em;
30:          margin: 0;
31:        }
32:        h1 {
33:          font-weight: bold;
34:          font-size: 20pt;
35:          line-height: 2em;
36:          text-align: center;
37:        }
38:        ul {
39:          list-style-type: none;
40:          counter-reset: item;
41:        }
42:        li {
43:          margin-top: .9em;
44:          counter-increment: item;
45:        }
46:        li::before {
47:          display: inline;
48:          content: counters(item, ".");
49:          padding-right: .75em;
50:        }
51:        li > p {
52:          display: inline;
53:        }
54:        @page {
55:          size: Letter;
56:          margin: 1in 1in .75in .5in;
57:        }
58:      </style>
59:  '''.splitlines(keepends=True)
60:  
61:  # Convert the input (first argument) to HTML with CSS
62:  with open(bike) as f:
63:    htmlLines = f.readlines()
64:  
65:  # Don't include the <?xml> line
66:  del htmlLines[:1]
67:  
68:  # Put the <style> section after <head> and the title after <body>
69:  headLine = htmlLines.index('  <head>\n')
70:  htmlLines[headLine+1:headLine+1] = css
71:  bodyLine = htmlLines.index('  <body>\n')
72:  htmlLines[bodyLine+1:bodyLine+1] = [f'    <h1>{title}</h1>\n']
73:  
74:  print(''.join(htmlLines), end='')

As you can see, much of the script is the CSS <style> section (Lines 24–59) that gets inserted into the file in Line 70. I use docopt to handle command-line options; currently, the only option is -t, which I use to set the title of the outline (with an <h1> tag) in Line 72. The script also deletes the <xml> line at the top of the original Bike file.

The only clever part of the script is the CSS that does the item numbering. That’s in Lines 38–50. I’m not sure where I learned how to do nested counters, but it was probably this Mozilla Developer Network page. You’ll note that even though Bike defines its outline items with <ul> tags, you can still assign numbers to them without changing them to <ol> tags.

Using bike2html is easy:

bike2html -t 'Example' Example.bike > Example.html

I suppose I should make the script smarter by using the filename as the default title.

I can send Example.html to anyone, and they’ll be able to open it. The nice thing about the “1.2.3” style of numbering the items is that it makes it easy for everyone who has the outline to refer to particular items on the phone or in an email.

You may be wondering how I can show Example.html on my iPad as I’m writing a report. Unlike Safari on the Mac, Safari on the iPad cannot open local files. There are two three ways to get around this:

  1. I can upload it to a server I control and open that file in Safari.
  2. I can open it in the Documents app from Readdle, which knows how to display HTML.
  3. I can start a local web server on my iPad via WorldWideWeb from The Iconfactory and view the page in Safari.

Update 11/25/2022 4:46 PM
Thanks to Andrew Kerr (on Mastodon!) for reminding me of WorldWideWeb. I bought WWW when it came out and used it for this very purpose a couple of months ago. Not sure why I stopped using it; it’s ideal for viewing this sort of static page.

Continuity allows me to select and copy text on the iPad and paste it on my Mac. It’s a nice way to work.

I should mention that I do enjoy outlining in Bike. It doesn’t have a huge number of features, but the features it has are what I need. I can see why other people might find it offputting for a writing app to not have a Print command, but it’s just right for me.


Hair splitting

A few days ago, Brady Haran at Numberphile released a new video with Ben Sparks. Like most of Ben’s videos, it’s interesting without being super heavy with math. In this case, I think the psychology behind the video is the most interesting part.

The puzzle is this:

What are the chances that there are two people in London with the same number of hairs on their head?

There’s a bit of misdirection in posing the problem in probability terms, as it might lead the listener to think he’s asked about the chances of two randomly selected people in London having the same number of hairs on their heads. A more straightforward—and therefore less tricky—question would be

Are there two people in London with the same number of hairs on their heads?

The answer comes from being able to estimate, within an order of magnitude or so, the number of people in London and the number of hairs on people’s heads. The former is a specific number that’s continually changing, so no one knows it except as a range. And the latter is a range by definition.

I think most people know that the population of London is at least several million. The tougher estimate is of the range of hairs on people’s heads. Ben suggests up to around 100,000, based on a hair density of 100 hairs per square centimeter over a 30 cm × 30 cm area. This hair density is equivalent to a hair spacing of about 1 mm, which seems reasonable to me.

(I’m no hair follicle expert, but any parent who’s gotten a message from their child’s school about instances of head lice showing up in class knows what it’s like to go through their kid’s scalp hair by hair.)

So with the range of hair counts at least an order of magnitude less than the population of London, there have to be at least two people with the same number of hairs. The “chances” asked for is 100%. Puzzle solved.

But Ben didn’t really ask for the solution itself. He wanted your gut reaction—before you did any hair-density calculations. And it’s probably not immediately obvious to most people that the number of hairs on someone’s is well under a million.

This is where the psychology comes in. While watching the video, I thought of a mathematically very similar question, but one that would, I believe, get instant correct answers from almost everyone:

What are the chances that there are two people in London who were born on the same day?

The number of possible birth dates of living people has to be around 40,000, which is within an order of magnitude of the hairs-on-head number. So the answer to this question is also 100%, but I bet most people would answer it correctly without hesitation and without calculation.

Even people who’ve never seen a birth notice in a newspaper probably know they exist. And they know that it’s common for there to be multiple notices every day in a big city. And those who don’t know about birth notices probably know that it’s common for large hospitals to have more than birth per day—and that big cities have many hospitals.

It’s the combination of the familiar—the hair on our heads—with the unfamiliar—how many hairs are there?—that makes Ben’s question interesting. My question, because it’s so easy, wouldn’t be interesting, even though it’s mathematically the same. Good thing Brady has Ben instead of me.


Announcing new posts on Mastodon

With lots of people, including me, either starting up new Mastodon accounts or dusting off the old ones they created during a previous Twitter crisis, I decided to write a script that would autotoot to Mastodon whenever I publish a new post here. It didn’t take long, as the Mastodon API is pretty easy to use. I put everything together while watching college football.1

If you want to write your own autotooting script, you’d do well to start by reading this post at DEV instead of following the Getting Started section of the API docs. The DEV post’s author, Joseph, makes the very useful suggestion to get your script’s authorization credentials through the Mastodon web interface instead of interacting with the API directly. Had I done that to begin with, it would have saved me about a quarter’s worth of time and frustration.

You get the authorization credentials by navigating to the Development page of your Mastodon account profile, clicking the NEW APPLICATION button, and filling in a couple of fields. Don’t bother changing the redirects or the scopes; the default entries will be fine.

User account Development page

With that done, you’ll have an access token to use in your script. The part of the script that creates a new status is really short, thanks to the wonders of Kenneth Reitz’s Requests module. Here are the key parts of my autotooting script:

python:
 1:  #!/usr/bin/env python3
 2:  
 3:  import requests
 4:  
 5:  [ more imports ]
 6:  
 7:  # Mastodon information.
 8:  murl = 'https://mastodon.cloud/api/v1/statuses'
 9:  auth = {'Authorization': 'Bearer XXXXXXXXXXXXXXXXXXXX'}
10:  
11:  [ Stuff to collect the info I want in the toot. ]
12:  [ The pieces of text I need are the summary ]
13:  [ and URL of the post. They're stored in variables ]
14:  [ named "summary" and "url." ]
15:  
16:  toot = {'status': f'''☃️ {summary}\n{url}'''}
17:  
18:  # Send the toot and return its URL.
19:  r = requests.post(murl, data=toot, headers=auth)
20:  print(r.json()['uri'])

The Mastodon URL you post to (Line 8) will depend on which server you’re hosted at. The bunch of Xs in Line 9 is replaced by the access token described above. The text of your toot goes into the status field of a dictionary (Line 16) that’s passed as the data parameter to the Requests post command (Line 19). That’s all.

Well, there is one other thing you may want to know. As you’re writing and debugging your script, you probably don’t want your followers to keep seeing your test toots. But you do want to see them yourself so you can make sure they’re working. To do that, add a visibility field to the toot dictionary.

16:  toot = {'status': f'''☃️ {summary}\n{url}''', 'visibility': 'direct'}

By setting the visibility to direct, it will act like a direct message to no one. Only you will see it. After debugging, just remove that part. Thanks to mdhughes for the tip.


  1. I will not mention the game, because the results were disgusting. I guess schools that are academically inferior need to win football games to make their alumni feel better about themselves.