Rewriting
May 11, 2017 at 10:33 AM by Dr. Drang
After a couple of weeks of using my pdfpages
script and after seeing this tweet from John Gruber Sunday night,
I’d been putting it off for years, but I’ve switched to fish as my shell, and I’m loving it. Scripting syntax is beautiful:
/cc @drdrang
— John Gruber (@gruber) May 7 2017 7:37 PM
I decided it was time to change some things that had been bugging me about how the script worked. Despite the cleanliness of Gruber’s version (you can click on the image to see it bigger), I didn’t rewrite it as a fish script because I just don’t like the idea of writing complicated scripts in any shell language. As I said in the first pdfpages
post
This is about as complicated a shell script as I would ever want to write. If I find myself wanting to add features, I’ll probably rewrite it as a Python script using the
subprocess
module to run themdls
command.
The first change I wanted to make was to align the right edges of the page counts. The original version of pdfpages
just dumped out the numbers, which left them left-aligned, like this:
$ pdfpages *.pdf
330 File 1.pdf
53 File 2.pdf
527 File 3.pdf
420 File 4.pdf
1330 Total
This is OK, but I prefer this:
$ pdfpages *.pdf
330 File 1.pdf
53 File 2.pdf
527 File 3.pdf
420 File 4.pdf
1330 Total
I also wanted an option that would sort the list of files according to their page counts:
$ pdfpages -s *.pdf
53 File 2.pdf
330 File 1.pdf
420 File 4.pdf
527 File 5.pdf
1330 Total
As I rewrote the script in Python, I tested it against a new set of PDFs recently sent to me by a client. During the test, I learned that there are PDFs that just don’t reveal their page counts to mdls
. I don’t know why, but when you run
mdls -name kMDItemNumberOfPages
on these files, you get
kMDItemNumberOfPages = (null)
which isn’t very helpful. This didn’t lead to errors in my original pdfpages
shell script, or in Gruber’s, because they’d just print out the “(null)” and treat it as a zero when calculating the page sum.
That didn’t make the original pdfpages
output correct, though. I could open these problem files in Preview and see how many pages they had, and it wasn’t zero. Luckily, pdftk
could figure out how many pages they had.
Should I just replace the calls to mdls
with calls to pdftk
? No, because pdftk
is r e a l l y s l o w, especially for large files. I didn’t want a script that ran slowly on all files just to handle the rare files that mdls
failed on.
The solution was to run mdls
first and then run pdftk
only if mdls
failed. My first shot at this worked fine on my iMac at work, but crapped out on my ancient MacBook Air. I needed to change the way the output of pdftk
was buffered to keep the lesser machine from being overwhelmed.
These improvements to pdfpages
had ballooned its size, so I figured it wouldn’t hurt to add a few more features:
- An option to reverse the sort order of the files.
- A line of output giving the number of files at the end of the file list.
- An option to suppress the reporting of the file and page count totals.
When I was done (am I done yet?), my simple 15-line script had turned into this:
python:
1: #!/usr/bin/env python
2:
3: from docopt import docopt
4: import subprocess
5:
6: usage = """Usage:
7: pdfpages [-srth] FILE...
8:
9: Get the number of pages in one or more PDF files.
10:
11: Options:
12: -s sort the file list in ascending order of page count
13: -r reverse the file list
14: -t don't report the total numbers of pages and files
15: -h show this help message and exit
16:
17: If one file is given, return just the count of pages in that file.
18: If more than one file is given, return a list of page counts and
19: file names followed by the total and the number of files.
20:
21: The options are useful only when more than one file is given."""
22:
23: # Process the options and arguments.
24: args = docopt(usage)
25: nfiles = len(args['FILE'])
26:
27: # Get page count for PDF file f. Use mdls by default because it's
28: # fast. Fall back to pdftk if mdls fails. Return 0 if nothing works.
29: def pcount(f):
30: mdlscmd = ['mdls', '-name', 'kMDItemNumberOfPages', '-raw', f]
31: try:
32: count = int(subprocess.check_output(mdlscmd))
33: except ValueError:
34: try:
35: pdftkcmd = ['pdftk', f, 'dump_data']
36: proc = subprocess.Popen(pdftkcmd, stdout=subprocess.PIPE)
37: for line in proc.stdout:
38: if 'NumberOfPages' in line:
39: count = int(line.split()[1])
40: proc.terminate()
41: break
42: except:
43: count = 0
44: return count
45:
46: # Print just the page count for one argument.
47: if nfiles == 1:
48: print pcount(args['FILE'][0])
49:
50: # List all the page counts and file names otherwise.
51: else:
52: # initialize
53: sum = 0
54: pages = []
55:
56: # collect all file info
57: for f in args['FILE']:
58: count = pcount(f)
59: pages.append((count, f))
60: sum += count
61:
62: # handle sorting
63: if args['-s']:
64: if args['-r']:
65: pages.sort(reverse=True)
66: else:
67: pages.sort()
68: else:
69: if args['-r']:
70: pages.reverse()
71:
72: # prepare for printing
73: width = len(str(sum))
74: fmt = '{{:{}d}} {{}}'.format(width)
75:
76: # print results
77: for p in pages:
78: print fmt.format(*p)
79: if not args['-t']:
80: print fmt.format(sum, 'Pages')
81: print fmt.format(nfiles, 'Files')
Despite the 5X increase in size, I think this version of pdfpages
is easier to read than the original. The cute tricks and cryptic variables are gone. It does require installation of the nonstandard docopt
library, but to me that just makes it clearer. An explicit usage message at the top of the source file (Lines 6–21) gives context to all the code below it.
Line 24 runs docopt
on the usage message to create the args
dictionary. The keys of args
are
- the various switches, the values of which are
True
orFalse
, depending on whether the switch was invoked on the command line; and FILE
, the value of which is a list of all the filename arguments.
The formatting of the usage
message tells docopt
that all the switches are optional and at least one filename argument is required. If pdfpages
is invoked incorrectly, docopt
halts the script and prints the first two lines of the usage
message. If pdfpages
is invoked with the -h
switch, the entire usage
message is printed.
The pcount
function on Lines 29–44 is the workhorse of the script. It takes a single filename and returns the page count for that file. It uses the subprocess
library to run mdls
on the file and get its page count. If mdls
returns a string that can’t be converted into an integer, a ValueError
is raised and the script moves into the except
block that starts on Line 33. This is where it runs the pdftk
command and tests its output one line at a time. When it finds a line with “NumberOfPages” in it, it puts that value into the count
variable and stops the pdftk
process. I don’t think it’s necessary to have both the terminate
on Line 40 and the break
on Line 41, but the script seems to run faster when they’re both there.
If both mdls
and pdftk
fail to provide a page count, we call it a zero and move on. In the testing I’ve done so far, the only files that returned zero weren’t actually PDFs.
With pcount
defined, the rest of the script is simple. If only one file argument is given to pdfpages
, Line 48 runs pcount
on that filename and prints the pagecount that’s returned.
If more than one file argument is given, Lines 57–60 run pcount
on all of them, collect the output in a list of tuples, and keep a running sum of the total page count. Each tuple in the list consists of the page count and name of a file. Putting the page count first in the tuple means the sorting commands in Lines 63–70 will sort on that. The presence or absence of the -s
and -r
switches determine whether and how the list gets sorted.
The total number of pages, sum
, will be the largest number we have to print. Line 73 determines its width, and Line 74 creates a formatting string we use to print out all the lines of output. Because we’re using the format
function to create a string that will be used in a later format
command, some of the braces have to be doubled in order to show up as single braces in the fmt
string.
Finally, Lines 77–78 print out all the file info and Lines 79–81 print out the page and file counts if they weren’t suppressed by the -t
switch.
What I expect will be my typical use of pdfpages
is to cd
into the top of a complicated directory tree of documents and run
find . -iname "*.pdf" -print0 | xargs -0 pdfpages
A few seconds later, I’ll have the list and the totals. I just did it for a set of documents sent to me last week: 17,200 pages in just over 600 PDF files.
Should I have started pdfpages
as a Python script instead of a shell script? Maybe, but I don’t think the effort I put into the original pdfpages
was wasted. I had to learn how mdls
and pdftk
worked, and that’s more efficiently done in a shell than in a subprocess
call. Rewriting a script in a different language isn’t all that hard, especially when the language you’re moving to is more familiar to you. What’s hard—what’s always hard—is deciding what the script should do and then, after you’ve used it a while, deciding that your first decision was wrong.