February 15, 2017 at 11:18 PM by Dr. Drang
I believe it was on this episode of The Talk Show from back in November that John Gruber and Jason Snell talked about quickly editing large spreadsheets that need their cell data reformatted or otherwise adjusted. The trick is to export the spreadsheet as either tab-separated or comma-separated text, then open it in BBEdit, and use the regular expression facility1 in BBEdit’s Find window to edit the whole table in one fell swoop (or a few fell swoops if you have to edit a few columns). Then import the edited TSV or CSV file back into the spreadsheet app and resave it. Boom. Potentially an hour or more of tedious, error-prone work done in a minute or two.
I’ve done the same thing many times, and because it’s such a useful technique, I have to fight the temptation to go straight to BBEdit when I need to make wholesale changes to a spreadsheet.2 Often, it’s more efficient to use a variety of techniques to transform your data. Today, for example, I used a combination of BBEdit, internal spreadsheet functions, and Keyboard Maestro to whip a spreadsheet into shape.
An engineer from another firm sent me a largish (500+ rows, 10–20 columns) spreadsheet of test results that I wanted to analyze using Pandas. The spreadsheet was in Excel format (naturally), so my first thought was to open it in Excel and immediately export it to CSV, which Pandas prefers. Luckily for me, Excel screwed up the export terribly,3 which forced me to slow down and do a little thinking about how I wanted the data arranged before importing it to Pandas.
I opened the
.xlsx file in Numbers and started the transformation there. Here’s a much simplified version of the sheet:
The first item of business was to change entries in the A column from single numbers that span multiple rows (what Numbers calls merged cells) into repeated entries for each row. The simplest way to do that is to select the merged cell, choose thecommand to turn it into a single entry, and then choose to fill that single entry into all the rows that used to be merged.
While this is the “simplest way,” it’s incredibly tedious to keep selecting cells and then choosing menu items, both of which are near the end of a long menu and one of which is in a submenu. Much faster and more reliable to use Keyboard Maestro to do the menu selection for me. Here’s the macro I made:
With this macro in place, I just clicked and hit ⌃⌥⌘N. After doing this for every merged cell in Column A, I had a spreadsheet that looked like this:
The next thing to tackle was the redundancy of Columns D and E. There’s no need to have two columns for what is, at heart, a single piece of information. I opened up a new column to the right of E for the item’s orientation:
The cells of the new column were filled with this simple formula, which checked the neighboring cells to the left and entered the appropriate text:
=IF(D2="x", "Horizontal", IF(E2="x", "Vertical", ""))
This new column contains all the information of the two columns to its left, which can be deleted. Well, not quite yet. If I delete them now, the formula won’t work. I first need to select this new column, copy it, and use thecommand to change the content of these cells from the formula to the results of the formula. Now I can delete the two redundant cells.
This may seem like a worthless optimization, but it ends up being much easier analyze the data in Pandas if there’s only one field governing an item’s orientation instead of two. Also, the real spreadsheet I was working with today had three pairs of redundant columns like this, so the improvement was even greater.
The last thing I wanted to do was split the address field. The data come from a multi-tower apartment complex, and the address field includes both the apartment number and the tower number. 305-T1 is apartment 305 in Tower 1. Since I want to be able to analyze the results on a tower-by-tower basis, a separate field for the tower number will make the Pandas commands much easier to write.
Although Numbers has text manipulation formulas, I find them clumsy and hard to use compared to the spare elegance of regular expressions. So now is the time to emulate Gruber and Snell and export the spreadsheet as a CSV file for manipulation in BBEdit.
Here’s the exported file and the Find window, ready to transform the Address field. You may notice that I’ve already edited the header line at the top of the file in anticipation of the new fields.
The Find and Replace regexes are
respectively, and here’s the data in CSV format after the replacement.
Index,Address,Unit,Tower,Item number,Orientation,Notes 1,305-T1,305,1,1,Horizontal, 1,305-T1,305,1,2,Horizontal,"25"" offset with brass extension." 1,305-T1,305,1,3,Vertical, 1,305-T1,305,1,4,Vertical, 1,305-T1,305,1,5,Vertical, 1,305-T1,305,1,6,Horizontal,"Offset: 13.5""" 1,305-T1,305,1,7,Vertical, 1,305-T1,305,1,8,Vertical, 1,305-T1,305,1,9,Vertical, 2,702-T2,702,2,1,Horizontal, 2,702-T2,702,2,2,Horizontal, 2,702-T2,702,2,3,Horizontal, 2,702-T2,702,2,4,Vertical,"No knurling. Left side height reduced by 2.6""." 2,702-T2,702,2,5,Horizontal, 2,702-T2,702,2,6,Vertical,
Yes, I decided to keep the full address field even though it’s redundant and I just went through a set of manipulations to eliminate the orientation redundancy.
Do I contradict myself?
Very well then I contradict myself,
(I am large, I contain multitudes.)
Unlike the orientation redundancy, this address redundancy I just created is not going to get in my way when issuing Pandas commands. In fact, it’s likely to make certain summary reporting easier because I won’t have to recompose the full address from the tower and apartment numbers. As with all things, a little experience with the frustrations of doing things wrong helps you do them right the next time.
The key to efficiently whipping your data into shape—especially when it comes in different forms from different sources—is to stay flexible and think about all the tools at your disposal before attacking the problem with your current favorite. It’s a lesson I have to keep reteaching myself.
These are spreadsheets of data, of course. Formulas get lost when you export a spreadsheet to TSV or CSV. ↩︎
Although I’m not a fan of Excel’s, I was very surprised at this. The row count of the exported file didn’t match that of the original and many of the fields were just wrong. I think it got confused by quotation marks and/or long entries in one of the fields. ↩︎