PDF page counts and metadata

I deal with lots of PDFs at work, and sometimes I want to report on the total number of pages I’ve reviewed on a given project. I’ve used various ad hoc methods to collect and sum up these page counts but have never come up with an automated technique. When there’s deadline pressure to get a report out, there’s no time—or it seems like there’s no time—to put together a decent automated solution. Today I decided to make time.

There are several ways to get the number of pages in a PDF. You can open it in Preview and see the page count in the title bar.

Title bar of PDF in Preview

You can do a Get Info on the file.

Get Info

If you have PDFtk installed, you can run it from the command line using the dump_data operation. That’ll get you a crapload of info on the file (over 2,000 lines for the file I’m using as an example), but you can limit it to just the number of pages by filtering the output.1

$ pdftk File\ 3.pdf dump_data | grep NumberOfPages
NumberOfPages: 420

I decided to use one of the command-line tools Apple provides to access the metadata used by Spotlight. These tools all begin with “md,” and the one that can output the page count is mdls (metadata list).

Like pdftk, mdls will, by default, spit out a lot of information on a file.

$ mdls File\ 3.pdf 
_kMDItemOwnerUserID            = 501
kMDItemContentCreationDate     = 2017-04-26 01:29:28 +0000
kMDItemContentModificationDate = 2017-04-26 01:29:42 +0000
kMDItemContentType             = "com.adobe.pdf"
kMDItemContentTypeTree         = (
    "com.adobe.pdf",
    "public.item",
    "com.adobe.pdf",
    "public.data",
    "public.composite-content",
    "public.content"
)
kMDItemDateAdded               = 2017-04-26 01:29:28 +0000
kMDItemDisplayName             = "File 3.pdf"
kMDItemEncodingApplications    = (
    "Pixel Translations (PIXPDF 58.5.1.1422)"
)
kMDItemFSContentChangeDate     = 2017-04-26 01:29:42 +0000
kMDItemFSCreationDate          = 2017-04-26 01:29:28 +0000
kMDItemFSCreatorCode           = ""
kMDItemFSFinderFlags           = 0
kMDItemFSHasCustomIcon         = (null)
kMDItemFSInvisible             = 0
kMDItemFSIsExtensionHidden     = 0
kMDItemFSIsStationery          = (null)
kMDItemFSLabel                 = 0
kMDItemFSName                  = "File 3.pdf"
kMDItemFSNodeCount             = (null)
kMDItemFSOwnerGroupID          = 20
kMDItemFSOwnerUserID           = 501
kMDItemFSSize                  = 41753844
kMDItemFSTypeCode              = ""
kMDItemKind                    = "Portable Document Format (PDF)"
kMDItemLastUsedDate            = 2017-04-26 01:42:08 +0000
kMDItemLogicalSize             = 41753844
kMDItemNumberOfPages           = 420
kMDItemPageHeight              = 807.36
kMDItemPageWidth               = 606
kMDItemPhysicalSize            = 41754624
kMDItemSecurityMethod          = "None"
kMDItemUseCount                = 1
kMDItemUsedDates               = (
    "2017-04-25 05:00:00 +0000"
)
kMDItemVersion                 = "1.4"

The only line we want is the one that starts with kMDItemNumberOfPages. We can limit the output to just that line by using the -name option and get rid of the label with the -raw option:

$ mdls -name kMDItemNumberOfPages -raw File\ 3.pdf 
420

The thing about -raw is that it doesn’t put a newline after the output, which is sometimes good and sometimes bad. We’ll deal with that in a bit.

Now we’re ready to build a shell script, called pdfpages, that does the following:

  1. Prints a usage string if you don’t give it any arguments.
  2. Prints just the number of pages in the file if you give it one argument.
  3. Prints the number of pages and the name for each file and the grand total of pages if you give it more than one argument.

To demonstrate:

$ pdfpages
Usage: pdfpages <files>

$ pdfpages File\ 3.pdf 
420

$ pdfpages *.pdf
330     File 1.pdf
532     File 2.pdf
420     File 3.pdf
1282    Total

Here’s the source code for pdfpages:

bash:
 1:  #!/bin/bash
 2:  
 3:  if [ $# == 0 ]; then
 4:    echo "Usage: pdfpages <files>"
 5:  elif [ $# == 1 ]; then
 6:    echo `mdls -name kMDItemNumberOfPages -raw "$1"`
 7:  else
 8:    sum=0
 9:    for f in "$@"; do
10:      count=`mdls -name kMDItemNumberOfPages -raw "$f"`
11:      echo -e "$count\t$f"
12:      (( sum += count ))
13:    done
14:    echo -e "$sum\tTotal"
15:  fi

Recall that $# provides the number of arguments to the script, so Lines 3 and 5 test for the no argument and one argument conditions, respectively. Line 4 is the usage message printed if there are no arguments. Line 6 runs when there’s one argument and is basically what we showed above. Putting the mdls command in backticks runs it and feeds the output to echo, which adds the trailing newline and prevents the next command prompt from appearing on the same line as the output.2

Lines 8–14 handle the case of more than one argument. We start by initializing the running sum in Line 8. Then we loop through all the arguments with the for on Line 9. For each file, Line 10 runs the mdls command and puts the output in the variable count. Line 11 prints the page count and file name for the current file, and Line 12 increments the running sum.3 When the loop is finished, Line 14 prints out the total.

Throughout the script, please note the use of double quotes around the $@, $1, and $f variables. This keeps the script from shitting the bed when file paths include spaces.

This is about as complicated a shell script as I would ever want to write. If I find myself wanting to add features, I’ll probably rewrite it as a Python script using the subprocess module to run the mdls command.

After writing pdfpages, I wondered how it would have worked on a older project in which I gave up trying to count all the PDF pages I was sent because there were just too many spread over too many files. Getting the number of PDF files (just over 1,000) in a nested folder structure was easy using standard tools:

find . -iname *.pdf | wc -l

Now I could get the number of pages in those files with

find . -iname *.pdf -print0 | xargs -0 pdfpages

It chugged away for about half a minute and told me I’d been given nearly 16,000 pages to review. Maybe it was better I didn’t know.


  1. Don’t write to tell me I should be using ack or ag or rg or any of the other grep replacements that have sprung up in recent years. This is just an example and speed really doesn’t matter here. ↩︎

  2. This seems more complicated than necessary, but I’m not enough of a shell scripter to know a better way. Unlike the grep situation above, suggestions for improvement here are welcome. ↩︎

  3. There are other ways to increment the sum. I chose this one because it struck me as weird. How often do you see shell variables dereferenced without a dollar sign. This line reminds me of how perverse shell scripting syntax can be. ↩︎