Rewriting

After a couple of weeks of using my pdfpages script and after seeing this tweet from John Gruber Sunday night,

I’d been putting it off for years, but I’ve switched to fish as my shell, and I’m loving it. Scripting syntax is beautiful:

/cc @drdrang

John Gruber (@gruber) May 7 2017 7:37 PM

I decided it was time to change some things that had been bugging me about how the script worked. Despite the cleanliness of Gruber’s version (you can click on the image to see it bigger), I didn’t rewrite it as a fish script because I just don’t like the idea of writing complicated scripts in any shell language. As I said in the first pdfpages post

This is about as complicated a shell script as I would ever want to write. If I find myself wanting to add features, I’ll probably rewrite it as a Python script using the subprocess module to run the mdls command.

The first change I wanted to make was to align the right edges of the page counts. The original version of pdfpages just dumped out the numbers, which left them left-aligned, like this:

$ pdfpages *.pdf
330     File 1.pdf
53      File 2.pdf
527     File 3.pdf
420     File 4.pdf
1330    Total

This is OK, but I prefer this:

$ pdfpages *.pdf
 330    File 1.pdf
  53    File 2.pdf
 527    File 3.pdf
 420    File 4.pdf
1330    Total

I also wanted an option that would sort the list of files according to their page counts:

$ pdfpages -s *.pdf
  53    File 2.pdf
 330    File 1.pdf
 420    File 4.pdf
 527    File 5.pdf
1330    Total

As I rewrote the script in Python, I tested it against a new set of PDFs recently sent to me by a client. During the test, I learned that there are PDFs that just don’t reveal their page counts to mdls. I don’t know why, but when you run

mdls -name kMDItemNumberOfPages

on these files, you get

kMDItemNumberOfPages = (null)

which isn’t very helpful. This didn’t lead to errors in my original pdfpages shell script, or in Gruber’s, because they’d just print out the “(null)” and treat it as a zero when calculating the page sum.

That didn’t make the original pdfpages output correct, though. I could open these problem files in Preview and see how many pages they had, and it wasn’t zero. Luckily, pdftk could figure out how many pages they had.

Should I just replace the calls to mdls with calls to pdftk? No, because pdftk is r e a l l y s l o w, especially for large files. I didn’t want a script that ran slowly on all files just to handle the rare files that mdls failed on.

The solution was to run mdls first and then run pdftk only if mdls failed. My first shot at this worked fine on my iMac at work, but crapped out on my ancient MacBook Air. I needed to change the way the output of pdftk was buffered to keep the lesser machine from being overwhelmed.

These improvements to pdfpages had ballooned its size, so I figured it wouldn’t hurt to add a few more features:

When I was done (am I done yet?), my simple 15-line script had turned into this:

python:
 1:  #!/usr/bin/env python
 2:  
 3:  from docopt import docopt
 4:  import subprocess
 5:  
 6:  usage = """Usage:
 7:    pdfpages [-srth] FILE...
 8:  
 9:  Get the number of pages in one or more PDF files.
10:  
11:  Options:
12:    -s    sort the file list in ascending order of page count
13:    -r    reverse the file list
14:    -t    don't report the total numbers of pages and files
15:    -h    show this help message and exit
16:  
17:  If one file is given, return just the count of pages in that file.
18:  If more than one file is given, return a list of page counts and
19:  file names followed by the total and the number of files.
20:  
21:  The options are useful only when more than one file is given."""
22:  
23:  # Process the options and arguments.
24:  args = docopt(usage)
25:  nfiles = len(args['FILE'])
26:  
27:  # Get page count for PDF file f. Use mdls by default because it's
28:  # fast. Fall back to pdftk if mdls fails. Return 0 if nothing works.
29:  def pcount(f):
30:    mdlscmd = ['mdls', '-name', 'kMDItemNumberOfPages', '-raw', f] 
31:    try:
32:      count = int(subprocess.check_output(mdlscmd))
33:    except ValueError:
34:      try:
35:        pdftkcmd = ['pdftk', f, 'dump_data']
36:        proc = subprocess.Popen(pdftkcmd, stdout=subprocess.PIPE)
37:        for line in proc.stdout:
38:          if 'NumberOfPages' in line:
39:            count = int(line.split()[1])
40:            proc.terminate()
41:            break
42:      except:
43:        count = 0
44:    return count
45:  
46:  # Print just the page count for one argument. 
47:  if nfiles == 1:
48:    print pcount(args['FILE'][0])
49:  
50:  # List all the page counts and file names otherwise.
51:  else:
52:    # initialize
53:    sum = 0
54:    pages = []
55:    
56:    # collect all file info
57:    for f in args['FILE']:
58:      count = pcount(f)
59:      pages.append((count, f))
60:      sum += count
61:  
62:    # handle sorting
63:    if args['-s']:
64:      if args['-r']:
65:        pages.sort(reverse=True)
66:      else:
67:        pages.sort()
68:    else:
69:      if args['-r']:
70:        pages.reverse()
71:    
72:    # prepare for printing
73:    width = len(str(sum))
74:    fmt = '{{:{}d}}  {{}}'.format(width)
75:    
76:    # print results
77:    for p in pages:
78:      print fmt.format(*p)
79:    if not args['-t']:
80:      print fmt.format(sum, 'Pages')
81:      print fmt.format(nfiles, 'Files')

Despite the 5X increase in size, I think this version of pdfpages is easier to read than the original. The cute tricks and cryptic variables are gone. It does require installation of the nonstandard docopt library, but to me that just makes it clearer. An explicit usage message at the top of the source file (Lines 6–21) gives context to all the code below it.

Line 24 runs docopt on the usage message to create the args dictionary. The keys of args are

The formatting of the usage message tells docopt that all the switches are optional and at least one filename argument is required. If pdfpages is invoked incorrectly, docopt halts the script and prints the first two lines of the usage message. If pdfpages is invoked with the -h switch, the entire usage message is printed.

The pcount function on Lines 29–44 is the workhorse of the script. It takes a single filename and returns the page count for that file. It uses the subprocess library to run mdls on the file and get its page count. If mdls returns a string that can’t be converted into an integer, a ValueError is raised and the script moves into the except block that starts on Line 33. This is where it runs the pdftk command and tests its output one line at a time. When it finds a line with “NumberOfPages” in it, it puts that value into the count variable and stops the pdftk process. I don’t think it’s necessary to have both the terminate on Line 40 and the break on Line 41, but the script seems to run faster when they’re both there.

If both mdls and pdftk fail to provide a page count, we call it a zero and move on. In the testing I’ve done so far, the only files that returned zero weren’t actually PDFs.

With pcount defined, the rest of the script is simple. If only one file argument is given to pdfpages, Line 48 runs pcount on that filename and prints the pagecount that’s returned.

If more than one file argument is given, Lines 57–60 run pcount on all of them, collect the output in a list of tuples, and keep a running sum of the total page count. Each tuple in the list consists of the page count and name of a file. Putting the page count first in the tuple means the sorting commands in Lines 63–70 will sort on that. The presence or absence of the -s and -r switches determine whether and how the list gets sorted.

The total number of pages, sum, will be the largest number we have to print. Line 73 determines its width, and Line 74 creates a formatting string we use to print out all the lines of output. Because we’re using the format function to create a string that will be used in a later format command, some of the braces have to be doubled in order to show up as single braces in the fmt string.

Finally, Lines 77–78 print out all the file info and Lines 79–81 print out the page and file counts if they weren’t suppressed by the -t switch.

What I expect will be my typical use of pdfpages is to cd into the top of a complicated directory tree of documents and run

find . -iname "*.pdf" -print0 | xargs -0 pdfpages

A few seconds later, I’ll have the list and the totals. I just did it for a set of documents sent to me last week: 17,200 pages in just over 600 PDF files.

Should I have started pdfpages as a Python script instead of a shell script? Maybe, but I don’t think the effort I put into the original pdfpages was wasted. I had to learn how mdls and pdftk worked, and that’s more efficiently done in a shell than in a subprocess call. Rewriting a script in a different language isn’t all that hard, especially when the language you’re moving to is more familiar to you. What’s hard—what’s always hard—is deciding what the script should do and then, after you’ve used it a while, deciding that your first decision was wrong.