Dealing with a recalcitrant PDF

As part of the project at work that’s been keeping me from blogging recently, I received a multipage PDF containing the scans of a set of drawings for a building. The building was designed back in the 80s, so there are no native electronic versions of these drawings—they were all done by hand, so they had to be scanned. Unfortunately, Preview choked on the file. It would open it, but any attempt I made to zoom, pan, or flip to another page would set the beachball spinning for 30 seconds or more. This made it impossible to work with, so I set about fixing it.

My first thought was that the number of pages in the document—nearly 100—was the problem. Breaking it into a bunch of single-page files seemed like a good idea, so I put it in a folder of its own and ran the command-line version of PDFtk with the burst operation:

pdftk drawings.pdf burst

This gave me a set of new files named pg_0001.pdf, pg_0002.pdf, and so on, one for every page in the original document. But even with the document broken up like this, trying to pan and zoom within any of these single-page documents also brought up the spinning beachball. The drawings were just black-and-white, but they were pretty big: about 6400×4500 pixels. My guess is that whatever compression technique was used to create the PDF is something that Preview isn’t very good at dealing with at that size.

So I decided to turn the files into JPEGs, which I knew Preview could handle smoothly. My tool of choice for this is sips, Apple’s scriptable image processing system. But because the default behavior of sips is to change files in-place, and because it doesn’t rename the files to reflect their new format, I first renamed the files to give them a .jpg extension, using an old Perl script I’d cribbed from Larry Wall:

rename 's/pdf/jpg/' pg*

With the files’ extensions set to their future format, I turned them into JPEGs with

sips -s format jpeg pg*

This took a bit of time to convert all those files, but when it was done I had a bunch of files in which I could pan and zoom with no delays.

There was just one problem left to be solved: the file names gave no indication of what each file contained. I wanted the file names to match the names given in the drawings’ title blocks.

Building drawings tend to follow a common naming/numbering convention. Each discipline has its own prefix: Architectural drawings start with A, structural drawings with S, mechanical drawings with M, plumbing drawings with P, electrical drawings with E, and so on. Each prefix is then followed by a drawing number, usually separated from the prefix by a hyphen, but not always. I knew how many drawings of each type I had, and I knew their order in the original PDF, so I used series of jot commands to generate a text file with all the names that the JPEG files should have:

jot -w 'A-%02d.jpg' 28 > names.txt
jot -w 'P-%02d.jpg' 8 >> names.txt
jot -w 'M-%02d.jpg' 10 >> names.txt
jot -w 'E-%02d.jpg' 21 >> names.txt
jot -w 'FP-%02d.jpg' 4 >> names.txt
jot -w 'S-%02d.jpg' 22 >> names.txt
jot -w 'C-%02d.jpg' 3 >> names.txt

The %02d is a printf directive to make the number that comes after the hyphen zero-padded and two digits long. That wasn’t strictly necessary, but it makes for a consistent naming scheme that alphabetizes correctly.

The next step was to create a shell script that renamed each file in turn. I’m no shell scripter, so I brute-forced it using jot and the paste command:

jot -w 'mv pg_%04d.jpg' 96 | paste -d ' ' - names.txt > mv.sh 

This gave me a 96-line file with lines that looked like

mv pg_0001.jpg A-01.jpg
mv pg_0002.jpg A-02.jpg
mv pg_0003.jpg A-03.jpg
mv pg_0004.jpg A-04.jpg
mv pg_0005.jpg A-05.jpg
.
.
.
mv pg_0092.jpg S-21.jpg
mv pg_0093.jpg S-22.jpg
mv pg_0094.jpg C-01.jpg
mv pg_0095.jpg C-02.jpg
mv pg_0096.jpg C-03.jpg

After making it executable,

chmod +x mv.sh

I ran it.

./mv.sh

and all the files were renamed.

I’m sure a shell wizard could do this in a far more elegant and compact way, but I needed to get back to work, so I used the commands I already knew.