Photos, dates, and xargs

I ran into an interesting problem earlier this week. I was given a hard disk with a jumble of digital photo files buried at various subdirectory levels, and I had to come up with a way to determine which, if any, of the photos had been taken on a particular day. My solution was a three-part pipeline using find, exiftool, grep, and xargs.

The disk came from a client and, as is often the case, had a very messy directory structure with file and folder names that were unhelpful at best and misleading at worst. I couldn’t dig in to reorganize the files, as I would later need to communicate file locations with that client and others who had identical copies of the hard disk. We all had the same mess, and it had to be maintained.

The first step was to find all the photo files. They came from digital cameras of various makes and models, but I knew they all had file extensions of either JPG or jpg. Finding them all, then, was just a matter of using find’s -iname switch to do a case-insensitive search on the file names. I navigated to the top level directory of the mess and ran this command in Terminal:

find . -iname "*.jpg"

This spewed out a ridiculously long list of files, one per line with names like

./Brian Kernighan Photographs/July - BWK/DSCN0161.JPG
./Brian Kernighan Photographs/July - BWK/DSCN0162.JPG
./Brian Kernighan Photographs/July - BWK/DSCN0163.JPG
./Brian Kernighan Photographs/July - BWK/DSCN0164.JPG
./Brian Kernighan Photographs/July - BWK/DSCN0165.JPG

I cut off the output with a quick ⌃C. To learn just how many files I was dealing with I piped the output of find to the wc command with the -l switch:

find . -iname "*.jpg" | wc -l

Over 15,000 photos.

Having established a system for getting all the photo files, I turned to extracting the dates on which they were taken. The best utility I know for this is Phil Harvey’s amazingly comprehensive Perl program, exiftool. Exiftool normally prints out every bit of metadata it can find, but you can limit it to just the information you want by adding switches named after the metadata fields. In my case, I was looking for the EXIF field named DateTimeOriginal, so my command for an individual file would look like this:

exiftool -DateTimeOriginal DSCN0161.JPG

(Assuming I execute the command within the directory that contains the file.)

The one complaint I have against exiftool is that its default output is a little verbose, especially when it’s fed a list of files. For example, the output of

exiftool -DateTimeOriginal */*/DSCN016*

is

======== Brian Kernighan Photographs/July - BWK/DSCN0161.JPG
Date/Time Original              : 2011:07:09 11:47:17
======== Brian Kernighan Photographs/July - BWK/DSCN0162.JPG
Date/Time Original              : 2011:07:09 12:12:38
======== Brian Kernighan Photographs/July - BWK/DSCN0163.JPG
Date/Time Original              : 2011:07:09 12:12:42
======== Brian Kernighan Photographs/July - BWK/DSCN0164.JPG
Date/Time Original              : 2011:07:09 12:12:49
======== Brian Kernighan Photographs/July - BWK/DSCN0165.JPG
Date/Time Original              : 2011:07:09 12:12:55
======== Brian Kernighan Photographs/July - BWK/DSCN0166.JPG
Date/Time Original              : 2011:07:09 12:13:00
======== Brian Kernighan Photographs/July - BWK/DSCN0167.JPG
Date/Time Original              : 2011:07:09 12:13:07
======== Brian Kernighan Photographs/July - BWK/DSCN0168.JPG
Date/Time Original              : 2011:07:09 12:13:11
======== Brian Kernighan Photographs/July - BWK/DSCN0169.JPG
Date/Time Original              : 2011:07:09 12:13:14
9 image files read

with each file name on its own line and the info requested put underneath. This is a good output format when you’re asking for lots of metadata, but it takes up more space than necessary when you want only one piece of information per file.

Fortunately, exiftool has a option, -p, that lets you specify the format of the output using tags. For example,

exiftool -p '$Directory/$Filename  $DateTimeOriginal' */*/DSCN016*

gives this output

Brian Kernighan Photographs/July - BWK/DSCN0161.JPG  2011:07:09 11:47:17
Brian Kernighan Photographs/July - BWK/DSCN0162.JPG  2011:07:09 12:12:38
Brian Kernighan Photographs/July - BWK/DSCN0163.JPG  2011:07:09 12:12:42
Brian Kernighan Photographs/July - BWK/DSCN0164.JPG  2011:07:09 12:12:49
Brian Kernighan Photographs/July - BWK/DSCN0165.JPG  2011:07:09 12:12:55
Brian Kernighan Photographs/July - BWK/DSCN0166.JPG  2011:07:09 12:13:00
Brian Kernighan Photographs/July - BWK/DSCN0167.JPG  2011:07:09 12:13:07
Brian Kernighan Photographs/July - BWK/DSCN0168.JPG  2011:07:09 12:13:11
Brian Kernighan Photographs/July - BWK/DSCN0169.JPG  2011:07:09 12:13:14
9 image files read

I combined the finding and exiftooling using xargs, a command that lets you use the output of one command as the argument list (not standard input) of another. The way this should work is

find . -iname "*.jpg" | xargs exiftool -q -m -p '$Directory/$Filename  $DateTimeOriginal'

where the -q suppresses the “n image files read” message at the end and the -m suppresses warnings for minor errors found in the metadata.

Unfortunately, xargs is a little too liberal in what it considers to be list item separators. The default delimiter is any form of whitespace, which works when the file and folder names have no spaces in them, but not when you have the kind of dog’s breakfast I was given.

The GNU version of xargs lets you specific one particular character to be the delimiter—which would be great, as I could tell it to use only newlines—but OS X’s xargs isn’t as smart. It does, however let you specify the null character (also known as NUL or \0) as the delimiter by including the -0 switch. This works in conjunction with the find command’s -print0 switch, which separates find’s output using null characters instead of newlines.

The upshot is that my pipeline gets a little longer:

find . -iname "*.jpg" -print0 | xargs -0 exiftool -q -m -p '$Directory/$Filename  $DateTimeOriginal'

The final step is to add the filter to the end of the pipeline so that only photos taken on a particular day are printed. I know there are faster tools like ack and ag—I have both of them installed—but I just can’t break the grep habit. My fingers type it even when my brain knows better.

The ultimate pipeline, then, is

find . -iname "*.jpg" -print0 | xargs -0 exiftool -q -m -p '$Directory/$Filename  $DateTimeOriginal' | grep '2011:07:09'

which gives me the file name and directory path to every photo taken on July 9, 2011.

Of course, this assumes that the clocks in all the cameras that took the photos were set correctly. But that’s another problem.