April 25, 2017 at 10:49 PM by Dr. Drang
I deal with lots of PDFs at work, and sometimes I want to report on the total number of pages I’ve reviewed on a given project. I’ve used various ad hoc methods to collect and sum up these page counts but have never come up with an automated technique. When there’s deadline pressure to get a report out, there’s no time—or it seems like there’s no time—to put together a decent automated solution. Today I decided to make time.
There are several ways to get the number of pages in a PDF. You can open it in Preview and see the page count in the title bar.
You can do a Get Info on the file.
If you have PDFtk installed, you can run it from the command line using the
dump_data operation. That’ll get you a crapload of info on the file (over 2,000 lines for the file I’m using as an example), but you can limit it to just the number of pages by filtering the output.1
$ pdftk File\ 3.pdf dump_data | grep NumberOfPages NumberOfPages: 420
I decided to use one of the command-line tools Apple provides to access the metadata used by Spotlight. These tools all begin with “md,” and the one that can output the page count is
mdls (metadata list).
mdls will, by default, spit out a lot of information on a file.
$ mdls File\ 3.pdf _kMDItemOwnerUserID = 501 kMDItemContentCreationDate = 2017-04-26 01:29:28 +0000 kMDItemContentModificationDate = 2017-04-26 01:29:42 +0000 kMDItemContentType = "com.adobe.pdf" kMDItemContentTypeTree = ( "com.adobe.pdf", "public.item", "com.adobe.pdf", "public.data", "public.composite-content", "public.content" ) kMDItemDateAdded = 2017-04-26 01:29:28 +0000 kMDItemDisplayName = "File 3.pdf" kMDItemEncodingApplications = ( "Pixel Translations (PIXPDF 22.214.171.1242)" ) kMDItemFSContentChangeDate = 2017-04-26 01:29:42 +0000 kMDItemFSCreationDate = 2017-04-26 01:29:28 +0000 kMDItemFSCreatorCode = "" kMDItemFSFinderFlags = 0 kMDItemFSHasCustomIcon = (null) kMDItemFSInvisible = 0 kMDItemFSIsExtensionHidden = 0 kMDItemFSIsStationery = (null) kMDItemFSLabel = 0 kMDItemFSName = "File 3.pdf" kMDItemFSNodeCount = (null) kMDItemFSOwnerGroupID = 20 kMDItemFSOwnerUserID = 501 kMDItemFSSize = 41753844 kMDItemFSTypeCode = "" kMDItemKind = "Portable Document Format (PDF)" kMDItemLastUsedDate = 2017-04-26 01:42:08 +0000 kMDItemLogicalSize = 41753844 kMDItemNumberOfPages = 420 kMDItemPageHeight = 807.36 kMDItemPageWidth = 606 kMDItemPhysicalSize = 41754624 kMDItemSecurityMethod = "None" kMDItemUseCount = 1 kMDItemUsedDates = ( "2017-04-25 05:00:00 +0000" ) kMDItemVersion = "1.4"
The only line we want is the one that starts with
kMDItemNumberOfPages. We can limit the output to just that line by using the
-name option and get rid of the label with the
$ mdls -name kMDItemNumberOfPages -raw File\ 3.pdf 420
The thing about
-raw is that it doesn’t put a newline after the output, which is sometimes good and sometimes bad. We’ll deal with that in a bit.
Now we’re ready to build a shell script, called
pdfpages, that does the following:
- Prints a usage string if you don’t give it any arguments.
- Prints just the number of pages in the file if you give it one argument.
- Prints the number of pages and the name for each file and the grand total of pages if you give it more than one argument.
$ pdfpages Usage: pdfpages <files> $ pdfpages File\ 3.pdf 420 $ pdfpages *.pdf 330 File 1.pdf 532 File 2.pdf 420 File 3.pdf 1282 Total
Here’s the source code for
bash: 1: #!/bin/bash 2: 3: if [ $# == 0 ]; then 4: echo "Usage: pdfpages <files>" 5: elif [ $# == 1 ]; then 6: echo `mdls -name kMDItemNumberOfPages -raw "$1"` 7: else 8: sum=0 9: for f in "$@"; do 10: count=`mdls -name kMDItemNumberOfPages -raw "$f"` 11: echo -e "$count\t$f" 12: (( sum += count )) 13: done 14: echo -e "$sum\tTotal" 15: fi
$# provides the number of arguments to the script, so Lines 3 and 5 test for the no argument and one argument conditions, respectively. Line 4 is the usage message printed if there are no arguments. Line 6 runs when there’s one argument and is basically what we showed above. Putting the
mdls command in backticks runs it and feeds the output to
echo, which adds the trailing newline and prevents the next command prompt from appearing on the same line as the output.2
Lines 8–14 handle the case of more than one argument. We start by initializing the running sum in Line 8. Then we loop through all the arguments with the
for on Line 9. For each file, Line 10 runs the
mdls command and puts the output in the variable
count. Line 11 prints the page count and file name for the current file, and Line 12 increments the running sum.3 When the loop is finished, Line 14 prints out the total.
Throughout the script, please note the use of double quotes around the
$f variables. This keeps the script from shitting the bed when file paths include spaces.
This is about as complicated a shell script as I would ever want to write. If I find myself wanting to add features, I’ll probably rewrite it as a Python script using the
subprocess module to run the
pdfpages, I wondered how it would have worked on a older project in which I gave up trying to count all the PDF pages I was sent because there were just too many spread over too many files. Getting the number of PDF files (just over 1,000) in a nested folder structure was easy using standard tools:
find . -iname *.pdf | wc -l
Now I could get the number of pages in those files with
find . -iname *.pdf -print0 | xargs -0 pdfpages
It chugged away for about half a minute and told me I’d been given nearly 16,000 pages to review. Maybe it was better I didn’t know.
This seems more complicated than necessary, but I’m not enough of a shell scripter to know a better way. Unlike the
grepsituation above, suggestions for improvement here are welcome. ↩︎
There are other ways to increment the
sum. I chose this one because it struck me as weird. How often do you see shell variables dereferenced without a dollar sign. This line reminds me of how perverse shell scripting syntax can be. ↩︎