October 31, 2019 at 8:02 PM by Dr. Drang
If you/be been reading John D. Cook’s blog recently, you know he’s been writing about “computational survivalism,” using the basic Unix text manipulation tools to process data. I know Kieran Healy has also written about using these tools instead of his beloved R for preprocessing data; this article talks a little about it, but I feel certain there’s a longer discussion somewhere else. You should probably just read all his posts. And John Cook’s, too. I’ll be here when you’re done.
I use Unix tools to manipulate data all the time. Even though a lot of my analysis is done via Python and Pandas, those old command-line programs just can’t be beat when it comes to wrangling a mass of text-formatted data into a convenient shape for analysis. I have a particular example from work I was doing last week.
I needed to analyze dozens of data files from a series of tests run on a piece of equipment. The data consisted of strain gauge and pressure gauge readings taken as the machine was run. After a brief but painful stint of having the test data provided as a set of Excel files, I got the technician in charge of the data acquisition to send me the results as CSV. That’s where we’ll start.
Each data file starts with 17 lines of preamble information about the software, the data acquisition channels, the calibration, and so on. This is important information for documenting the tests—and I make sure the raw files are saved for later reference—but it gets in the way of importing the data as a table.
I figured a simple
sed command would delete these preamble lines, but I figured wrong. For God knows what reason, the CSV file that came from the computer that collected the data was in UTF-16 format (is this common in Windows?), even though there wasn’t a single non-ASCII character in the file. UTF-16 is not something
So I took a quick look at the
iconv man page and wrote this one-liner to get the files into a format I could use:
for f in *.csv; do iconv -f UTF-16 -t UTF-8 "$f" > "$f.new"; done
I suppose I could have chosen ASCII as the “to” (
-t) format, but I’m in the habit of calling my files UTF-8 even when there’s nothing outside of the ASCII range.
After a quick check with BBEdit to confirm that the files had indeed converted, I got rid of the UTF-16 versions and replaced them with the UTF-8 versions
rm *.csv rename 's/\.new//' *.new
With the files in the proper format, it was time to delete the 17-line preamble. The easiest way to do this is with the GNU version
sed that can be installed through Homebrew:
gsed -i '1,17d' *.csv
sed that comes with macOS requires a bit more typing:
for f in *.csv; do sed -i.bak '1,17d' "$f"; done
sed forces you to do two things:
1. Include an extension with the `-i` switch so you have backups of the original files. 2. Use a loop to go through the files. A command like `sed -i.bak '1,17d' *.csv` would concatenate all the CSV files together and delete the first 17 lines of that. The upshot is that only the first CSV file would have its preamble removed.
Hence my preference for
With the preambles gone, I now had a set of real CSV files. Each one had a header line followed by many lines of data. I could have stopped the preprocessing here, but there were a couple more things I wanted to change.
First, the data acquisition software inserts an “Alarm” item after every data channel, effectively giving me twice as many columns as necessary. To get rid of the Alarm columns, I needed to know which ones they were. I could have opened one of the files in BBEdit and started counting through the header line, but it was more conveniently done via John Cook’s header numbering one-liner:
head -1 example.csv | gsed 's/,/\n/g' | nl
Because all the CSVs have the same header line, I could run this on any one of them. The
head -1 part extracts just the first line of the file. That gets piped to the
sed substitution command, which converts every comma into a newline. Finally,
nl prefixes each of the lines sent to it with its line number. What I got was this:
1 Scan 2 Time 3 101 <SG1> (VDC) 4 Alarm 101 5 102 <SG2> (VDC) 6 Alarm 102 7 103 <SG3> (VDC) 8 Alarm 103 9 104 <SG4> (VDC) 10 Alarm 104 11 105 <SG5> (VDC) 12 Alarm 105 13 106 <SG6> (VDC) 14 Alarm 106 15 107 <SG7> (VDC) 16 Alarm 107 17 108 <SG8> (VDC) 18 Alarm 108 19 109 <Pressure> (VDC) 20 Alarm 109
With the headers laid out in in numbered rows, it was easy to construct a
cut command to pull out only the columns I needed:
for f in *.csv; do cut -d, -f 1-3,5,7,9,11,13,15,17,19 "$f" > "$f.new"; done
I’d like to tell you I did something clever like
seq -s, 3 2 19 | pbcopy
to get the list of odd numbers from 3 to 19 and then pasted them into the
cut command, but I just typed them out like an animal.
Once again, I ran
rm *.csv rename 's/\.new//' *.new
to get rid of the old versions of the files.
The last task was to change the header line to a set of short names that I’d find more convenient when processing the data in Pandas. For that, I used
sed’s “change” command:
gsed -i '1cScan,Datetime,G1,G2,G3,G4,G5,G6,G7,P' *.csv
With this change, I could access the data in Pandas using simple references like
instead of long messes like
df['101 <SG1> (VDC)']
All told, I used nine commands to clean up the data. If I had just a few data files, I might have done the editing by hand. It is, after all, quite easy to delete rows and columns of a spreadsheet. But I had well over a hundred of these files, and even if I had the patience to edit them all by hand, there’s no way I could have done it without making mistakes here and there.
Which isn’t to say I wrote each command correctly on the first try. But after testing each command on a single file, I could then apply it with confidence to the full set.1 The tests were run in several batches over a few days. Once I had the commands set up, I could clean each new set of data in no time.
And it took a lot less time to develop and test the commands than it did to write this post. ↩