Another small adventure in data cleaning
December 4, 2022 at 10:27 PM by Dr. Drang
In a project at work, I had a folder with dozens of engineering drawings for a piece of construction equipment. As part of a report, I needed to generate a list of all the drawings I had according to their drawing numbers, and I also needed to make sure my list corresponded to the drawings in a list of documents that the client had compiled. This would be simple if
- The file names were just the drawing numbers.
- The file names had a consistent format.
- The drawings from the client weren’t in some awful nested list in a Word document.
Of course, if any of these were true, I wouldn’t be writing this post.
Let’s start with the drawing files I had. They had come to me via email and cloud links, and I had saved them all in a folder cleverly named “drawings.” Although their file names weren’t consistent, they did start with the file number, a string of 7–10 digits, and they were all PDFs. So I was able to put them all on the clipboard with this command:
ls *.pdf | egrep '^\d{7,10}' | pbcopy
I used egrep
to get what the grep
man page calls “extended” regular expressions and what I call “not brain dead” regular expressions.
If you’re wondering why I had files that weren’t drawings in the drawings
folder, I can only say in my defense that the non-drawing files I put in that folder—like material specifications—were close enough to drawings that I thought they belonged there.
I then pasted the list of drawing files into a fresh BBEdit document. Although there are ways to do the next few filtering steps directly from the command line, I find that things typically go faster if I can see the results (and can undo when I make a mistake). The list of drawings looked like this (except there were about 100 drawings, not 7):
0123456_B - Base Weldment.pdf
12345678_C - Base Assembly.pdf
23456789_C_1_2 - Carriage Casting.pdf
3002345678_C_Steel Bushing.pdf
3102345678 Wheel Bearings.pdf
5123456789 rev B.pdf
6987654321_A_2_2.pdf
Most of the file names had an underscore and a capital letter following the drawing number. These were the revision codes1 and were necessary to include in my list. So the first drawing above would go in my list as
0123456 Rev. B
As you can see, though, some files didn’t indicate any revisions and some had an explicit “rev” in their name. I wanted to handle all of them, if possible.
I brought up BBEdit’s Find dialog and built the regular expression necessary to do most of the extraction.
The Find regex,
^(\d{7,10})((_| rev )([A-Z]))?(.+)$
collects the drawing number in the first set of parentheses and uses alternation to collect the optional revision code after either an underscore or “rev”. The final set of parentheses collects the rest of the file name, including the extension. Looking back on it, that last bit didn’t need to be in parentheses, but they didn’t hurt.
The Replace string,
\1 Rev. \4
puts the drawing number and revision code in the format I want.
After clicking the Replace All button, I got this,
0123456 Rev. B
12345678 Rev. C
23456789 Rev. C
3002345678 Rev. C
3102345678 Rev.
5123456789 Rev. B
6987654321 Rev. A
which is obviously wrong for the drawings that had no revision code but worked for all the others. I’m sure there’s a clever way to make a regex that avoids the “dangling Rev,” but clever takes time, and BBEdit already has a simple way to delete strings at the end of a line. It’s the Prefix/Suffix Lines… command in the Text menu.
This removes the ” Rev.” from all the lines that end with that, leaving the others untouched.
Now we move on to the unfortunate Word document with multilevel lists that contained, in addition to all the drawings the client thought I had, a long list of other documents. I opened the document, copied all the text, and pasted it into a new BBEdit document. To get rid of all the lines that didn’t include a drawing number, I used BBEdit’s Process Lines Containing… command and set it up to delete lines that did not contain a string of 7 to 10 digits.
Because Word creates nested lists by changing the left margin and indentation, there were no leading tabs in any of the remaining lines. But every line had some combination of leading numbers, letters, periods, and closing parentheses followed by a tab. Like this:
1) 0123456B
d) 12345678C - Base Assembly.pdf
3. Carriage Casting Drawing 23456789C
a. 3002345678C
t. Drawing 3102345678
aa. Dwg 5123456789B
cc. 6987654321A
While this is a mess of inconsistency, it’s not that hard to pull out the drawing number and revision code.
The Find regex is
^.+?(\d{7,10})([A-Z])?.*$
and the Replace string is
\1 Rev. \2
As before, this left several “dangling Revs,” but I was able to use the Prefix/Suffix Lines… command to get rid of them.
With two files of clean drawing lists in hand, I used the comm
command, which has become a favorite, to get the list of drawings that I had but were not in the client’s list,
comm -23 in-folder.txt client-list.txt
and vice-versa,
comm -13 in-folder.txt client-list.txt
BBEdit is great at this sort of interactive data cleaning. It puts the power of traditional command-line text filtering tools into a nice GUI. For example, the Process Lines Containing… command I used above is basically a sed
command, but with an easy way to undo my mistakes.
-
Engineering drawings get changed as a design evolves and those changes are tracked through revision codes. Sometimes the revision codes are numbers, but in my experience letters are more common. ↩