A small hint of big data

Shortly before Christmas, I got a few gigabytes of test data from a client and had to make sense of it. The first step was being able to read it.

The data came from a series of sensors installed in the some equipment manufactured by the client but owned by one of its customers. It was the customer who had collected the data, and precise information about it was limited at best. Basically, all I knew going in was that I had a handful of very large files, most of them about half a gigabyte, and that they were almost certainly text files of some sort.

One of the files was much smaller than the other, only about 50 MB. I decided to start there and opened it in BBEdit, which took a little time to suck it all in but handled it flawlessly. Scrolling through it, I learned that the first several dozen lines described the data that was being collected and the units of that data. At the end of the header section was a line with just the string

[data]


and after that came line after line of numbers. Each line was about 250 characters long and used DOS-style CRLF line endings. All the fields were numeric and were separated by single spaces. The timestamp field for each data record looked like a floating point number, but after some review, I came to understand that it was an encoding of the clock time in hhmmss.ssss format. This also explained why the files were so big: the records were 0.002 seconds apart, meaning the data had been collected at 500 Hz, much faster than was necessary for the type of information being gathered.

Anyway, despite its excessive volume, the data seemed pretty straightforward, a simple format that I could do a little editing of to get it into shape for importing into Pandas. So I confidently right-clicked one of the larger files to open it in BBEdit, figuring I’d see the same thing. But BBEdit wouldn’t open it.

As the computer I was using has 32 GB of RAM, physical memory didn’t seem like the cause of this error. I had never before run into a text file that BBEdit couldn’t handle, but then I’d never tried to open a 500+ MB file before. I don’t blame BBEdit for the failure—data files like this aren’t what it was designed to edit—but it was surprising. I had to come up with Plan B.

Plan B started with running head -100 on the files to make sure they were all formatted the same way. I learned that although the lengths of the header sections were different, they were collecting same type of data and using the same space-separated format for the data itself. Also, in each file the header and data were separated by a [data] line.

The next step was stripping out the header lines and transforming the data into CSV format. Pandas can certainly read space-separated data, but I figured that as long as I had to do some editing of the files, I might as well put them into a form that lots of software can read. I considered using a pipeline of standard Unix utilities and maybe Perl to do the transformation, but settled on a writing a Python script. Even though such a script was likely to be longer than the equivalent pipeline, my familiarity with Python would make it easier to write.

Here’s the script:

python:
1: #!/usr/bin/env python
2:
3: import sys
4:
5: f = open(sys.argv[1], 'r')
6: for line in f:
7:   if line.rstrip() == '[data]':
8:     break
9:
11:
12: for line in f:
13:   print line.rstrip().replace(' ', ',')


(You can see from the print commands that this was done back before I switched to Python 3.)

The script, data2csv, was run from the command line like this for each data file in turn:

data2csv file01.dat > file01.csv


The script takes advantage of the way Python iterates through an open file line-by-line, keeping track of where it left off. The first loop, Lines 6–8, runs through the header lines, doing nothing and then breaking out of the loop when the [data] line is encountered.

Line 10 prints a CSV header line of my own devising. This information was in the original file, but its field names weren’t useful, so it made more sense for me to create my own.

Finally, the loop in Lines 12–13 picks up the file iteration where the previous loop left off and runs through to the end of the file, stripping off the DOS-style line endings and replacing the spaces with commas before printing each line in turn.

Even on my old 2012 iMac, this script took less than five seconds to process the large files, generating CSV files with over two million lines.

I realize my paltry half-gigabyte files don’t really qualify as big data, but they were big to me. I’m usually not foolish enough to run high frequency data collection processes on low frequency equipment for long periods of time. Since the usual definition of big data is something like “too voluminous for traditional software to handle,” and my traditional software is BBEdit, this data set fit the definition for me.