Reducing the size of large PDFs

This morning I was writing a report and realized that it was going to be too big to email to the client on Monday. The problem was that it contained 15–20 graphs, all of which were in the neighborhood of 2 MB. When this has happened before, I’d just send the client a Dropbox link or use whatever Apple Mail does to deal with large attachments. But today I decided to fix the problem.

Like all of the graphs I make for work, these were built in Python using Matplotlib. And although there is a fair amount of data being plotted in each graph, it’s always seemed to me that the PDF files produced were a lot bigger than they should be. My search for ways to reduce their size returned lots of web pages that will thin your PDFs for you, but I had no interest in uploading my files to some possibly sketchy site. The trick to getting the kind of answer I wanted was adding “ghostscript” to my search terms.

The solution came from adapting a decade-old Gist from Guilherme Rodrigues. The result was this shell script, which I named reduceMPL:

 1:  #!/usr/bin/env bash
 3:  # Reduces the size of PDF plots created by Matplotlib.
 4:  # Assumes that the files to be reduced are named mpl[Something].pdf
 5:  # and that it's called via
 6:  #
 7:  #     reduceMPL mpl*.pdf
 8:  #
 9:  # The results are a set of smaller files named [Something].pdf
10:  # The original files are *not* deleted.
12:  for mpl do
13:    new=$(cut -c 4- <<< "$mpl")
14:    gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile="$new" "$mpl"
15:  done

The key, of course, is Line 14, which opens the fat file in Ghostscript and spits out a skinny version. How skinny? My 1.9 MB inputs were turned into 23–25 KB outputs. That’s 25 kilobytes, the kind of file size you see only in plain text files nowadays. I could easily store all my graphs for this report on a 3½″ diskette—if I still had any 3½″ diskettes. And I don’t see any difference between the original and smaller version.

The script assumes the input files will be prefixed with “mpl,” and the output files will have the same name but with the “mpl” prefix stripped off. For example, mplHoopStress.pdf will lead to the much smaller HoopStress.pdf. All I had to do to meet this assumption was alter one line in the Python code that generates the graphs. So when I have a bunch of graphs that need thinning, a simple

reduceMPL mpl*.pdf

is all I need to get them converted. As the comment block at the top of the script says, reduceMPL does not delete the original files.

There are a couple of other interesting things in the script. First, you’ll note that Line 12 is

for mpl do

instead of

for mpl in "$@"; do

I learned how to do this just a few days ago from a hint on the Bash Pitfalls page of Greg’s Wiki.1 What makes it nice is that it loops through the arguments without my having to remember whether the variable to use there is $@ or $*. And it handles the quoting of the variable automatically, too.

The other trick was using cut with a here-string to get the new file name from the original. I wouldn’t have thought to use a here-string if not for Jason Snell’s recent post reminding me of how useful they can be.

I’m still not sure why Matplotlib produces such overstuffed PDFs, but at least I now have a way to fix them.

  1. It was linked to in this tweet from John Cook’s excellent Unix tool tips Twitter account.