PDFtk
January 29, 2017 at 12:35 PM by Dr. Drang
Although there are several good GUI applications for working with PDFs on the Mac (including the built-in Preview until the recent unpleasantness), sometimes a command-line tool is just more efficient. For years, I’ve been using PDFtk Server. Despite its name, there’s no networking aspect to it—it’s just a regular command-line program.
Unfortunately, the “Mac OS X” version of PDFtk Server you can download straight from the PDFtk site stopped working recently. Based on this thread at Stack Overflow, I don’t think the problem is associated with Sierra’s buggy PDF framework. It doesn’t ruin your PDFs, it just sits there after you invoke it and doesn’t do anything at all.
But if you use Homebrew, there is a solution, and its given by obh’s contribution to that Stack Overflow thread. This Homebrew formula on GitHub for building PDFtk Server installs a version that works in Sierra. The instructions on GitHub don’t work for me, but obh’s do:
brew install https://raw.githubusercontent.com/turforlag/homebrew-cervezas/master/pdftk.rb
With that out of the way, here’s a problem I ran into recently that PDFtk solved quickly. I was given a very large PDF, over a thousand pages, that had photographs and test results for samples of material taken from dozens of buildings. The PDF pages were in a logical order inasmuch as all the photos and results for a particular building were on consecutive pages, but the file was so large that it bogged down my computer whenever I opened it to review the data.
What I wanted was a set of separate PDFs, one for each building, named after the address of the building. I could have done this by opening the file in Preview or PDFpen Pro and creating new files by
- selecting all the pages associated with a particular building in the thumbnail sidebar;
- dragging them out into a Finder window to create a new file of just those pages; and
- renaming the new file with the building’s address.
This was, I soon found, a terribly slow process, because the selecting the thumbnails required scrolling, and that took a lot of time because of the size of the PDF. It was also easy to deselect pages by mistake and have to scroll back and start the selection over again.
A more robust way to use PDFtk’s cat
feature to extract the pages and save them into a new file. The form of the command is this:
pdftk big-file.pdf cat 1-26 output 'building address 1.pdf'
What I needed to do was repeat a command like this for each building. Obviously, I had no intention of typing this out by hand dozens of times, but if I could go through the big file once, getting the starting page and address of each building in turn, I could create an input file that looked like this,
1 178 Alameda Court
23 3235 Kings Point Court
52 57 Byerrum Street
71 3513 Brown Court
81 2579 South Whispering Hills Drive
108 2611 Sunburst Lane
134 1101 Nanak Court
153 507 Rosewood Avenue
163 124 Perth Drive
173 2363 Salt Meadow Road
195 1574 Monmouth Avenue
212 2666 Westbrook Circle
233 428 Selby Road
262 1661 Indian Knoll Road
292 1291 Cripple Creek Court
319 443 Cordula Circle
340 3108 Pipestone Court
362 355 Roanoake Court
389 553 Buckley Court
408 505 Keim Drive
with each line consisting of the starting page followed by the address, and then write a short program that generated all the PDFtk commands I needed.
Now it’s true that I still had to work my way entirely through the huge PDF, but this proved much faster than the scrolling back and forth that was needed for the dragging solution. Once I had the list of page numbers and addresses, it was easy to write a short Python script to print out all the PDFtk commands needed to create the new files:
pdftk input.pdf cat 1-22 output '178 Alameda Court.pdf'
pdftk input.pdf cat 23-51 output '3235 Kings Point Court.pdf'
pdftk input.pdf cat 52-70 output '57 Byerrum Street.pdf'
pdftk input.pdf cat 71-80 output '3513 Brown Court.pdf'
pdftk input.pdf cat 81-107 output '2579 South Whispering Hills Drive.pdf'
pdftk input.pdf cat 108-133 output '2611 Sunburst Lane.pdf'
pdftk input.pdf cat 134-152 output '1101 Nanak Court.pdf'
pdftk input.pdf cat 153-162 output '507 Rosewood Avenue.pdf'
pdftk input.pdf cat 163-172 output '124 Perth Drive.pdf'
pdftk input.pdf cat 173-194 output '2363 Salt Meadow Road.pdf'
pdftk input.pdf cat 195-211 output '1574 Monmouth Avenue.pdf'
pdftk input.pdf cat 212-232 output '2666 Westbrook Circle.pdf'
pdftk input.pdf cat 233-261 output '428 Selby Road.pdf'
pdftk input.pdf cat 262-291 output '1661 Indian Knoll Road.pdf'
pdftk input.pdf cat 292-318 output '1291 Cripple Creek Court.pdf'
pdftk input.pdf cat 319-339 output '443 Cordula Circle.pdf'
pdftk input.pdf cat 340-361 output '3108 Pipestone Court.pdf'
pdftk input.pdf cat 362-388 output '355 Roanoake Court.pdf'
pdftk input.pdf cat 389-407 output '553 Buckley Court.pdf'
pdftk input.pdf cat 408-end output '505 Keim Drive.pdf'
Here’s the script:
python:
1: #!/usr/bin/env python
2:
3: from fileinput import input
4:
5: pages = []
6: for line in input():
7: pagestart, address = line.rstrip().split(' ', 1)
8: pages.append([int(pagestart), 0, address])
9:
10: for i in range(len(pages) - 1):
11: pages[i][1] = pages[i+1][0] - 1
12: pages[len(pages)-1][1] = 'end'
13:
14: for i in range(len(pages)):
15: print "pdftk input.pdf cat {}-{} output '{}.pdf'".format(*pages[i])
There are probably cleverer ways to do this, but I didn’t want to spend time trying to be clever. The loop in Lines 6–8 goes through the input list, getting the starting page number and address for each building. The pages
list is a list of lists that looks like this after the first loop:
[[1, 0, '178 Alameda Court'],
[23, 0, '3235 Kings Point Court'],
[52, 0, '57 Byerrum Street'],
…
[408, 0, '505 Keim Drive']]
The second item in each sublist is going to be the ending page number for that building, but because that has to come from the next line in the input list (which I don’t know the first time through), I stuck a dummy value of 0 in there.
Lines 10–11 is a second loop through the data. Now I can look ahead and figure out the end page for each building. I don’t know the last page number, but I don’t have to. PDFtk understands the word end
to mean the last page number. After Line 12, pages
looks like this,
[[1, 22, '178 Alameda Court'],
[23, 51, '3235 Kings Point Court'],
[52, 70, '57 Byerrum Street'],
…
[408, 'end', '505 Keim Drive']]
and it’s ready to be used in the final loop of Lines 14–15 to print out the PDFtk commands.
There are a couple of things in the script that relatively new Python programmers may not have seen before. First, you can limit the number of substrings created by split
by giving it a second argument. On Line 7, I tell it to split the line on only the first space encountered. That preserves all the spaces in the address part of the line and keeps it all together as a single string.
Second, putting a *
before a list expands it into its component parts. In Line 15, pages[i]
would mean the single entity
[1, 0, '178 Alameda Court']
the first time through the loop. But *pages[i]
means the three separate items,
1, 0, '178 Alameda Court'
which is what the format
method wants.
I suppose I could’ve run the PDFtk commands directly from the script using the subprocess
library, but I wanted to print out the commands as I was writing the script so I could debug it. Once I had the script creating the correct commands, adding a new section for running them seemed like a waste of time. It was faster to just copy the output, paste it into its own file, and source
it from the command line to run all the individual commands. If this weren’t a one-off script, I might have taken the time to change the print
in Line 15 into a subprocess
command.