CMD-D

By now you’ve read about CMD-D (or ⌘D), a conference on Mac and iOS automation created and hosted by Sal Soghoian and scheduled for August 9. There’s also a “scripting boot camp” the day before.

I like the idea of this and hope it’s successful, but I really dislike where it’s being held. I understand that Santa Clara is close to where most of the speakers live, and it’s probably easy to get an audience for this type of event there, but if there’s a place on Earth that doesn’t need to be taught the value of automation, it’s the Valley, where you can’t swing a cat without hitting a programmer. It’s like sending Catholic missionaries to Vatican City.

Maybe if this first one is successful, Sal will take it on the road and spread the Good News to the heathen.


Tables again

A couple of months ago, I wrote a couple of posts about my frustrations with including complex tables in my reports for work. I write my reports in Markdown (MultiMarkdown, actually) and export them to LaTeX before generating the PDFs I send to my clients. As I said in the first post, even though MultiMarkdown has facilities for handling certain complexities, like column spans, I often find myself editing the LaTeX of my tables directly to get the look I want. This gets the Markdown and LaTeX source out of sync with each other—not the worst problem in the world, but something I’d rather avoid.

The bigger problem is that LaTeX’s table syntax is awful, so filled with ampersands and braces that it’s nearly unreadable. And something I like to include in most of my tables—a column of right-aligned numbers centered under its header—is not part of standard LaTeX table syntax. Even packages like dcolumn and siunitx, which are supposed to handle this situation, have given me trouble, often moving columns well away from where they should be.

The solution I began toying with was to consider tables as graphic elements, to build them outside of the report text and import them with the \includegraphics command from LaTeX’s graphicx package. I outlined a few possibilities for building the table graphics in the first post, and presented an OmniGraffle example in the second. In that example, I used a rather fragile Keyboard Maestro macro to copy the data from a spreadsheet into an OmniGraffle table.

To be honest, this solution sucked. Not the formatting of the table in OmniGraffle; that was fine. But the importing macro was just too delicate, too easily thrown off by small changes in the original data format. I never felt comfortable using it in production.

Last week, I sent out a report with several tables using a faster and more robust technique. The steps are:

  1. Build the table in Numbers. Now, I use Jupyter and Pandas for most of my analytical work, so “building a table in Numbers” really means exporting the data from Jupyter as a CSV file and then opening it in Numbers.
  2. Don’t try to add too much formatting to the table in Numbers, but make sure the font is what I want in my report (Times 12, typically) and that the column alignment matches what I want for the bulk of the table. Like this:

    Table in Numbers

  3. Generate a PDF from the table using the standard Save as PDF… popup feature in the Print sheet.

    Save as PDF

  4. Open the new PDF in OmniGraffle, ungroup the elements, and delete all the crap: the gridlines, the page number, the big white background rectangle. Everything must go except the table data itself.

    Spreadsheet imported into OmniGraffle

  5. Now start grouping and spacing the data. At this point, I’m still working mostly by eye, but I’m keeping notes on what seems to look good for row spacing and the placement of horizontal rules.

    Final table in OmniGraffle

Eventually, I’ll turn these notes into macros or JXA scripts for automating some of the process, but I don’t expect everything to be automated. One of the reasons I’m doing this is to make tables that are tweaked to be better, more communicative, than the usual rigid grids.

Here’s the example in final form:

Example table

Note that the numeric columns are right-justified but are centered under their headings. The three percentage columns are clustered together, set apart from the Count column. Also, the percentage columns are evenly spaced in the body of the table but not in the headings. In a normal LaTeX table, the extra width of “estimate” would push its column away from the other two.1 Finally, the first five rows are spaced uniformly and the last row is set off a bit from the others because the first five are individual defect critera and the last is the union of the five.

In my recent report, I had several tables structured like the one you see above. Once I had the skeleton worked out, I could duplicate it and change out the data very quickly. Even the first table didn’t take very long to make, no longer than it would have taken to tweak and retweak a comparable table in LaTeX.

I haven’t quite worked out how I’m going to specify tables like this in Markdown, but the LaTeX for including one is easy:

tex:
\begin{table}[htbp]
\centering
\vspace{.125in}
\includegraphics{building-a-defects.pdf}
\vspace{-.0625in}
\caption{Defect summary for Building A.}
\label{counts}
\end{table}

The vspace above the graphic separates the table from the text above it by an amount about the same as what I get with a normal tabular table. Similarly, the vspace below the graphic shifts the caption to more or less the position it would have below a tabular table.

An obvious downside to including tables as graphic elements is that the data are no longer in the source of the report. I don’t see this as a serious problem, as I’ll still have the data set in a spreadsheet and, even better, a CSV file. It’ll still be accessible to me and easy to share with others.

Now I need to work on a Markdown-like syntax for table graphics.


  1. Yes, I know how to get those columns spaced evenly in LaTeX, but it adds even more code to something that’s already messy. 


Scene from a marriage

Early morning in a bedroom in Kalorama. A cell phone lights up and vibrates on one of the nightstands.

I: Uhhh. Jared… Jared? He’s up already. Can you go over and settle him back down? Take the phone away.

J: Mmmf. I went yesterday. Your turn.

I: Pleeease? I was up late bookmarking Wikiquotes.

J: He’s your father.

I: Let’s not talk about fathers. At least mine never went to jail.

J: (mumbles) Not yet.

I: What?

J: I said “I’ll get it.” Call the car while I get dressed.


Rewriting

After a couple of weeks of using my pdfpages script and after seeing this tweet from John Gruber Sunday night,

I’d been putting it off for years, but I’ve switched to fish as my shell, and I’m loving it. Scripting syntax is beautiful:

/cc @drdrang

John Gruber (@gruber) May 7 2017 7:37 PM

I decided it was time to change some things that had been bugging me about how the script worked. Despite the cleanliness of Gruber’s version (you can click on the image to see it bigger), I didn’t rewrite it as a fish script because I just don’t like the idea of writing complicated scripts in any shell language. As I said in the first pdfpages post

This is about as complicated a shell script as I would ever want to write. If I find myself wanting to add features, I’ll probably rewrite it as a Python script using the subprocess module to run the mdls command.

The first change I wanted to make was to align the right edges of the page counts. The original version of pdfpages just dumped out the numbers, which left them left-aligned, like this:

$ pdfpages *.pdf
330     File 1.pdf
53      File 2.pdf
527     File 3.pdf
420     File 4.pdf
1330    Total

This is OK, but I prefer this:

$ pdfpages *.pdf
 330    File 1.pdf
  53    File 2.pdf
 527    File 3.pdf
 420    File 4.pdf
1330    Total

I also wanted an option that would sort the list of files according to their page counts:

$ pdfpages -s *.pdf
  53    File 2.pdf
 330    File 1.pdf
 420    File 4.pdf
 527    File 5.pdf
1330    Total

As I rewrote the script in Python, I tested it against a new set of PDFs recently sent to me by a client. During the test, I learned that there are PDFs that just don’t reveal their page counts to mdls. I don’t know why, but when you run

mdls -name kMDItemNumberOfPages

on these files, you get

kMDItemNumberOfPages = (null)

which isn’t very helpful. This didn’t lead to errors in my original pdfpages shell script, or in Gruber’s, because they’d just print out the “(null)” and treat it as a zero when calculating the page sum.

That didn’t make the original pdfpages output correct, though. I could open these problem files in Preview and see how many pages they had, and it wasn’t zero. Luckily, pdftk could figure out how many pages they had.

Should I just replace the calls to mdls with calls to pdftk? No, because pdftk is r e a l l y s l o w, especially for large files. I didn’t want a script that ran slowly on all files just to handle the rare files that mdls failed on.

The solution was to run mdls first and then run pdftk only if mdls failed. My first shot at this worked fine on my iMac at work, but crapped out on my ancient MacBook Air. I needed to change the way the output of pdftk was buffered to keep the lesser machine from being overwhelmed.

These improvements to pdfpages had ballooned its size, so I figured it wouldn’t hurt to add a few more features:

When I was done (am I done yet?), my simple 15-line script had turned into this:

python:
 1:  #!/usr/bin/env python
 2:  
 3:  from docopt import docopt
 4:  import subprocess
 5:  
 6:  usage = """Usage:
 7:    pdfpages [-srth] FILE...
 8:  
 9:  Get the number of pages in one or more PDF files.
10:  
11:  Options:
12:    -s    sort the file list in ascending order of page count
13:    -r    reverse the file list
14:    -t    don't report the total numbers of pages and files
15:    -h    show this help message and exit
16:  
17:  If one file is given, return just the count of pages in that file.
18:  If more than one file is given, return a list of page counts and
19:  file names followed by the total and the number of files.
20:  
21:  The options are useful only when more than one file is given."""
22:  
23:  # Process the options and arguments.
24:  args = docopt(usage)
25:  nfiles = len(args['FILE'])
26:  
27:  # Get page count for PDF file f. Use mdls by default because it's
28:  # fast. Fall back to pdftk if mdls fails. Return 0 if nothing works.
29:  def pcount(f):
30:    mdlscmd = ['mdls', '-name', 'kMDItemNumberOfPages', '-raw', f] 
31:    try:
32:      count = int(subprocess.check_output(mdlscmd))
33:    except ValueError:
34:      try:
35:        pdftkcmd = ['pdftk', f, 'dump_data']
36:        proc = subprocess.Popen(pdftkcmd, stdout=subprocess.PIPE)
37:        for line in proc.stdout:
38:          if 'NumberOfPages' in line:
39:            count = int(line.split()[1])
40:            proc.terminate()
41:            break
42:      except:
43:        count = 0
44:    return count
45:  
46:  # Print just the page count for one argument. 
47:  if nfiles == 1:
48:    print pcount(args['FILE'][0])
49:  
50:  # List all the page counts and file names otherwise.
51:  else:
52:    # initialize
53:    sum = 0
54:    pages = []
55:    
56:    # collect all file info
57:    for f in args['FILE']:
58:      count = pcount(f)
59:      pages.append((count, f))
60:      sum += count
61:  
62:    # handle sorting
63:    if args['-s']:
64:      if args['-r']:
65:        pages.sort(reverse=True)
66:      else:
67:        pages.sort()
68:    else:
69:      if args['-r']:
70:        pages.reverse()
71:    
72:    # prepare for printing
73:    width = len(str(sum))
74:    fmt = '{{:{}d}}  {{}}'.format(width)
75:    
76:    # print results
77:    for p in pages:
78:      print fmt.format(*p)
79:    if not args['-t']:
80:      print fmt.format(sum, 'Pages')
81:      print fmt.format(nfiles, 'Files')

Despite the 5X increase in size, I think this version of pdfpages is easier to read than the original. The cute tricks and cryptic variables are gone. It does require installation of the nonstandard docopt library, but to me that just makes it clearer. An explicit usage message at the top of the source file (Lines 6–21) gives context to all the code below it.

Line 24 runs docopt on the usage message to create the args dictionary. The keys of args are

The formatting of the usage message tells docopt that all the switches are optional and at least one filename argument is required. If pdfpages is invoked incorrectly, docopt halts the script and prints the first two lines of the usage message. If pdfpages is invoked with the -h switch, the entire usage message is printed.

The pcount function on Lines 29–44 is the workhorse of the script. It takes a single filename and returns the page count for that file. It uses the subprocess library to run mdls on the file and get its page count. If mdls returns a string that can’t be converted into an integer, a ValueError is raised and the script moves into the except block that starts on Line 33. This is where it runs the pdftk command and tests its output one line at a time. When it finds a line with “NumberOfPages” in it, it puts that value into the count variable and stops the pdftk process. I don’t think it’s necessary to have both the terminate on Line 40 and the break on Line 41, but the script seems to run faster when they’re both there.

If both mdls and pdftk fail to provide a page count, we call it a zero and move on. In the testing I’ve done so far, the only files that returned zero weren’t actually PDFs.

With pcount defined, the rest of the script is simple. If only one file argument is given to pdfpages, Line 48 runs pcount on that filename and prints the pagecount that’s returned.

If more than one file argument is given, Lines 57–60 run pcount on all of them, collect the output in a list of tuples, and keep a running sum of the total page count. Each tuple in the list consists of the page count and name of a file. Putting the page count first in the tuple means the sorting commands in Lines 63–70 will sort on that. The presence or absence of the -s and -r switches determine whether and how the list gets sorted.

The total number of pages, sum, will be the largest number we have to print. Line 73 determines its width, and Line 74 creates a formatting string we use to print out all the lines of output. Because we’re using the format function to create a string that will be used in a later format command, some of the braces have to be doubled in order to show up as single braces in the fmt string.

Finally, Lines 77–78 print out all the file info and Lines 79–81 print out the page and file counts if they weren’t suppressed by the -t switch.

What I expect will be my typical use of pdfpages is to cd into the top of a complicated directory tree of documents and run

find . -iname "*.pdf" -print0 | xargs -0 pdfpages

A few seconds later, I’ll have the list and the totals. I just did it for a set of documents sent to me last week: 17,200 pages in just over 600 PDF files.

Should I have started pdfpages as a Python script instead of a shell script? Maybe, but I don’t think the effort I put into the original pdfpages was wasted. I had to learn how mdls and pdftk worked, and that’s more efficiently done in a shell than in a subprocess call. Rewriting a script in a different language isn’t all that hard, especially when the language you’re moving to is more familiar to you. What’s hard—what’s always hard—is deciding what the script should do and then, after you’ve used it a while, deciding that your first decision was wrong.