July 26, 2016 at 12:23 AM by Dr. Drang
The best thing about writing posts like last night’s is what I learn from the people who read it. Let’s do a little followup.
First, I can always count on Aristotle Pagaltzis to improve my Perl. His suggestion for the one-liner that turns the data copied from the PDF table into a TSV,
perl -pe 's/\n/\t/ if $. % 21'
is much cleaner than my clunky solution.
Similarly, Nathan Grigg improved my Python by pointing out that
enumerate can take an optional
start parameter, so the index term doesn’t have to start at 0. In my original Python script, I could’ve written
python: for i, b in enumerate(a, start=1): if i % 21 > 0:
python: for i, b in enumerate(a): if (i + 1) % 21 > 0:
making the script just a little easier to understand. The
start parameter was added in Python 2.6, which is probably why I didn’t know about it. I’d been using Python so long by the time 2.6 came along, I thought I knew everything there was to know about
enumerate. A cautionary tale.
The best tip I got was this one:
.@drdrang Perhaps Tabula would help for extracting tables from PDFs. I’ve used it with great success before: tabula.technology
— Wired-up Wrong (@BrokenWiring) Jul 24 2016 11:16 PM
- Select one or more PDFs with the data you want.
- Identify the area of the page from which to extract the data.
- Save the data in CSV, TSV, or JSON format.
I gave Tabula a try on the same PDF tables I wrote about last night, and it worked perfectly. You may recall that I didn’t like the column headings in the original table. Well, Tabula let me drag a rectangle to select just the data portion of the table, leaving the stuff I didn’t want out of the extracted CSV file.
I can imagine using this same technique to grab just a few rows or a few columns of data from a large table.
The Tabula people don’t oversell its capabilities. They recognize that some table layouts give its parser fits, and they offer suggestions to work around its limitations. I found it to work very quickly and accurately and look forward to using it more. Thanks to both Wired-up Wrong and Bill Eccles for suggesting it.
David Emmons suggested using PDFpen Pro’s Export to Excel feature, which is similar to Tabula in that it scans the PDF and tries to parse the table layout. Although I own PDFpen Pro, I hadn’t tried that, mainly because my few experiences with its Export to Word had been disappointing. But after David’s tweet, I decided to give it a try.
At first, things didn’t go well. The exported files raised error messages when opened in Excel and were littered with extra columns. But this was when I let PDFpen Pro try to parse the entire table. I then used the rectangle selection tool to crop the table down to just the data portion. The files created after that were perfect.
Still, I don’t actually want an Excel file—it’s only a means to an end. And I especially don’t want to see stupid warning messages like this when I quit an application after copying no more than a few hundred cells of a spreadsheet.
What I really want is a CSV or TSV, and because Tabula can give me those directly, it’s the tool I’ll be using.
What this means is that Tabula starts up a web server on your computer and you interact with it through a browser. While you might think this leads to an awkward user interface, it actually works quite smoothly and allows Tabula to have a consistent user interface on all platforms. ↩︎