PDFtk

Although there are several good GUI applications for working with PDFs on the Mac (including the built-in Preview until the recent unpleasantness), sometimes a command-line tool is just more efficient. For years, I’ve been using PDFtk Server. Despite its name, there’s no networking aspect to it—it’s just a regular command-line program.

Unfortunately, the “Mac OS X” version of PDFtk Server you can download straight from the PDFtk site stopped working recently. Based on this thread at Stack Overflow, I don’t think the problem is associated with Sierra’s buggy PDF framework. It doesn’t ruin your PDFs, it just sits there after you invoke it and doesn’t do anything at all.

But if you use Homebrew, there is a solution, and its given by obh’s contribution to that Stack Overflow thread. This Homebrew formula on GitHub for building PDFtk Server installs a version that works in Sierra. The instructions on GitHub don’t work for me, but obh’s do:

brew install https://raw.githubusercontent.com/turforlag/homebrew-cervezas/master/pdftk.rb

With that out of the way, here’s a problem I ran into recently that PDFtk solved quickly. I was given a very large PDF, over a thousand pages, that had photographs and test results for samples of material taken from dozens of buildings. The PDF pages were in a logical order inasmuch as all the photos and results for a particular building were on consecutive pages, but the file was so large that it bogged down my computer whenever I opened it to review the data.

What I wanted was a set of separate PDFs, one for each building, named after the address of the building. I could have done this by opening the file in Preview or PDFpen Pro and creating new files by

  1. selecting all the pages associated with a particular building in the thumbnail sidebar;
  2. dragging them out into a Finder window to create a new file of just those pages; and
  3. renaming the new file with the building’s address.

This was, I soon found, a terribly slow process, because the selecting the thumbnails required scrolling, and that took a lot of time because of the size of the PDF. It was also easy to deselect pages by mistake and have to scroll back and start the selection over again.

A more robust way to use PDFtk’s cat feature to extract the pages and save them into a new file. The form of the command is this:

pdftk big-file.pdf cat 1-26 output 'building address 1.pdf'

What I needed to do was repeat a command like this for each building. Obviously, I had no intention of typing this out by hand dozens of times, but if I could go through the big file once, getting the starting page and address of each building in turn, I could create an input file that looked like this,

1 178 Alameda Court
23 3235 Kings Point Court
52 57 Byerrum Street
71 3513 Brown Court
81 2579 South Whispering Hills Drive
108 2611 Sunburst Lane
134 1101 Nanak Court
153 507 Rosewood Avenue
163 124 Perth Drive
173 2363 Salt Meadow Road
195 1574 Monmouth Avenue
212 2666 Westbrook Circle
233 428 Selby Road
262 1661 Indian Knoll Road
292 1291 Cripple Creek Court
319 443 Cordula Circle
340 3108 Pipestone Court
362 355 Roanoake Court
389 553 Buckley Court
408 505 Keim Drive

with each line consisting of the starting page followed by the address, and then write a short program that generated all the PDFtk commands I needed.

Now it’s true that I still had to work my way entirely through the huge PDF, but this proved much faster than the scrolling back and forth that was needed for the dragging solution. Once I had the list of page numbers and addresses, it was easy to write a short Python script to print out all the PDFtk commands needed to create the new files:

pdftk input.pdf cat 1-22 output '178 Alameda Court.pdf'
pdftk input.pdf cat 23-51 output '3235 Kings Point Court.pdf'
pdftk input.pdf cat 52-70 output '57 Byerrum Street.pdf'
pdftk input.pdf cat 71-80 output '3513 Brown Court.pdf'
pdftk input.pdf cat 81-107 output '2579 South Whispering Hills Drive.pdf'
pdftk input.pdf cat 108-133 output '2611 Sunburst Lane.pdf'
pdftk input.pdf cat 134-152 output '1101 Nanak Court.pdf'
pdftk input.pdf cat 153-162 output '507 Rosewood Avenue.pdf'
pdftk input.pdf cat 163-172 output '124 Perth Drive.pdf'
pdftk input.pdf cat 173-194 output '2363 Salt Meadow Road.pdf'
pdftk input.pdf cat 195-211 output '1574 Monmouth Avenue.pdf'
pdftk input.pdf cat 212-232 output '2666 Westbrook Circle.pdf'
pdftk input.pdf cat 233-261 output '428 Selby Road.pdf'
pdftk input.pdf cat 262-291 output '1661 Indian Knoll Road.pdf'
pdftk input.pdf cat 292-318 output '1291 Cripple Creek Court.pdf'
pdftk input.pdf cat 319-339 output '443 Cordula Circle.pdf'
pdftk input.pdf cat 340-361 output '3108 Pipestone Court.pdf'
pdftk input.pdf cat 362-388 output '355 Roanoake Court.pdf'
pdftk input.pdf cat 389-407 output '553 Buckley Court.pdf'
pdftk input.pdf cat 408-end output '505 Keim Drive.pdf'

Here’s the script:

python:
 1:  #!/usr/bin/env python
 2:  
 3:  from fileinput import input
 4:  
 5:  pages = []
 6:  for line in input():
 7:    pagestart, address = line.rstrip().split(' ', 1)
 8:    pages.append([int(pagestart), 0, address])
 9:  
10:  for i in range(len(pages) - 1):
11:    pages[i][1] = pages[i+1][0] - 1
12:  pages[len(pages)-1][1] = 'end'
13:  
14:  for i in range(len(pages)):
15:    print "pdftk input.pdf cat {}-{} output '{}.pdf'".format(*pages[i])

There are probably cleverer ways to do this, but I didn’t want to spend time trying to be clever. The loop in Lines 6–8 goes through the input list, getting the starting page number and address for each building. The pages list is a list of lists that looks like this after the first loop:

[[1, 0, '178 Alameda Court'],
 [23, 0, '3235 Kings Point Court'], 
 [52, 0, '57 Byerrum Street'],
  …
 [408, 0, '505 Keim Drive']]

The second item in each sublist is going to be the ending page number for that building, but because that has to come from the next line in the input list (which I don’t know the first time through), I stuck a dummy value of 0 in there.

Lines 10–11 is a second loop through the data. Now I can look ahead and figure out the end page for each building. I don’t know the last page number, but I don’t have to. PDFtk understands the word end to mean the last page number. After Line 12, pages looks like this,

[[1, 22, '178 Alameda Court'],
 [23, 51, '3235 Kings Point Court'], 
 [52, 70, '57 Byerrum Street'],
  …
 [408, 'end', '505 Keim Drive']]

and it’s ready to be used in the final loop of Lines 14–15 to print out the PDFtk commands.

There are a couple of things in the script that relatively new Python programmers may not have seen before. First, you can limit the number of substrings created by split by giving it a second argument. On Line 7, I tell it to split the line on only the first space encountered. That preserves all the spaces in the address part of the line and keeps it all together as a single string.

Second, putting a * before a list expands it into its component parts. In Line 15, pages[i] would mean the single entity

[1, 0, '178 Alameda Court']

the first time through the loop. But *pages[i] means the three separate items,

1, 0, '178 Alameda Court'

which is what the format method wants.

I suppose I could’ve run the PDFtk commands directly from the script using the subprocess library, but I wanted to print out the commands as I was writing the script so I could debug it. Once I had the script creating the correct commands, adding a new section for running them seemed like a waste of time. It was faster to just copy the output, paste it into its own file, and source it from the command line to run all the individual commands. If this weren’t a one-off script, I might have taken the time to change the print in Line 15 into a subprocess command.


The refugee ban

As I type this, my Twitter timeline is filled with photos of protests of Trump’s latest executive order, Protecting the Nation From Foreign Terrorist Entry Into the United States.1 Like me, you’ve probably read descriptions of it, but it often helps to read the actual text for yourself. Even though documents like this are often difficult to read because they’re filled with references to references to references, it’s still worth the effort because it’s a way of checking on how straight the news you read is being with you. And sometimes you’ll see things that aren’t emphasized in the news reports.

As I read the order, a couple of things (apart from the obvious and what’s already been well-reported) stood out:

  1. Section 5(g) says

    It is the policy of the executive branch that, to the extent permitted by law and as practicable, State and local jurisdictions be granted a role in the process of determining the placement or settlement in their jurisdictions of aliens eligible to be admitted to the United States as refugees. To that end, the Secretary of Homeland Security shall examine existing law to determine the extent to which, consistent with applicable law, State and local jurisdictions may have greater involvement in the process of determining the placement or resettlement of refugees in their jurisdictions, and shall devise a proposal to lawfully promote such involvement.

    A “role in the process of determining the placement or settlement” should be read as “the right of refusal.” In other words, the administration that is dead set against cities using the principles of federalism to act as sanctuaries for refugees—as shown in this executive order from just two days earlier—is perfectly happy to allow those same principles to be used by other cities to refuse any and all refugees that the federal government may allow in.

  2. Section 10 sets up a program for the reporting of crimes committed by foreign nationals. I can’t tell if this is different from the provision of that earlier executive order that creates an Office for Victims of Crimes Committed by Removable Aliens, but it’s the justification of the program that I find interesting:

    To be more transparent with the American people, and to more effectively implement policies and practices that serve the national interest…

    Yes, the president who refuses to release his tax returns and whose Cabinet appointees have repeatedly failed to fill out complete disclosure forms is establishing a new bureaucracy for reporting those special crimes committed by foreigners. And he’s doing it because of his commitment to transparency.

I see there’s a new executive order that came out today on Ethics Commitments by Executive Branch Appointees. That should be a good one.

Update 01/28/2017 9:47 PM
While I was writing this, the ACLU got a stay on deportations via the executive order. If you follow that link, you can see the news and donate.


  1. I’d prefer to provide an official White House link to the executive order, but it’s not there. Maybe it’ll show up in a day or two. ↩︎


Nothing new

America has a poor collective memory and it’s getting worse. It took about 50 years for us to forget the lessons of the Great Depression and decide that deregulating financial institutions was a good idea. That led to the savings and loan crisis of the 80s and—because we didn’t learn from that—the huge meltdown of 2007–8. The lessons of Vietnam took only 30 years to forget, leading to Iraq. The irony there was that Colin Powell, of the Vietnam-inspired Powell Doctrine, was one of the people ushering us into exactly what he had warned against.

Now we’ve gone and elected a manifestly unqualified dope as president, just as we did 16 years ago, and we are for some reason acting surprised at the terrible and stupid things he’s doing. But they all have parallels with the actions of the last Republican administration.

We haven’t seen a plan for privatizing Social Security yet, but that’s only because destroying the ACA is a higher priority.


Mary, Queen of Scots

In Our Time has not been enjoying a particularly good series. Melvyn Bragg seems more interested in arguing with his guests than bringing out their best, and when he isn’t arguing, he’s barely interested at all. But this week’s episode on Mary, Queen of Scots brought the show back in all its glory.

It’s unsurprising, I suppose, that a very British topic would revived both Melvyn and the show. Not only is it the sort of thing that gets him going, but the guests are exactly what fans of the show want. There’s the Cambridge don who stammers when he gets excited talking about the plot to kill Darnley, and the curator from a Scottish museum whose accent is so thick you expect him to burst out with an “Och, laddy!” What could be better?

I’ll tell you what could be better. The third guest, from the University of Edinburgh, while introducing us to Mary’s second husband, the Earl of Bothwell, says he’s “a hard man from the borders” and “not keen on the English.” I nearly drove off the road.

And yet, despite how fun this episode was, it’s still only the second best BBC radio show on Mary, Queen of Scots.