New module for old

My iMac recently got a new hard drive from Apple through the 3TB Hard Drive Replacement Program. Because my old hard drive was still working fine, I thought Apple was going to clone it onto the new one, but that didn’t happen. Probably a privacy concern. Then I thought I’d be able to use either my Time Machine or SuperDuper backup to make the new drive like the old one, but for reasons I’m still not sure of—and which are too painful to recount, anyway—those options didn’t work out, either. So I’m in the process of rebuilding my system starting with a stock version of Yosemite 10.10.4. Copying over my home directory was simple enough, and I’m slowly reinstalling the various applications, utilities, and other digital toolsets my work computer has accreted over the years.

Python modules are usually pretty easy to install: pip install modname almost always does the trick. But I decided not to install the pyexiv2 library that a few of my photo-related scripts use. Pyexiv2 accesses the EXIF metadata embedded in JPEG files and is a wrapper around the exiv2 C++ library, which in turn relies on Boost and SCons. I’ve been successful in installing all these in the past, but I really didn’t want to go through that rigamarole again. I’d much rather have a pure Python solution, because it’s easier to install now and easier to reinstall later.

I decided to go with Ben Leslie’s pexif module. It’s under active development, and unlike the exifread package, it allows for both reading and writing of the EXIF metadata. Although I don’t need writing for the script presented below, I do for another photo script. So although I think exifread is a bit easier to use, I’m going with pexif because I’ll need it eventually.

The first script I’m rewriting to use pexif is called canonize, which renames photo files according to the date on which they were taken. This is a script I use almost every week at work, so it’s important that I get a working version of it up and running right away. I’m sure there are ways to use Hazel or some other utility to do the same thing, but I’ve been using one form of canonize or another for about 15 years, and I’m comfortable with it. The basic logic has remained the same for all that time—starting with a Perl version and now through three Python versions—so rewriting it to use a new EXIF library was no big deal.

And after writing that option parsing article a couple of weeks ago, it seemed like I should also rip out the getopt parsing code and replace it with equivalent code that uses docopt.

Here’s the latest version of canonize:

python:
 1:  #!/usr/bin/env python
 2:  
 3:  import docopt
 4:  import pexif
 5:  import os
 6:  import os.path
 7:  import sys
 8:  
 9:  usage = """Usage:
10:    canonize [options] FILE...
11:    canonize [options] -f
12:  
13:  Rename JPEG photo files according to the date taken.
14:  
15:  Options:
16:    -f        get filenames from STDIN instead of command line
17:    -s SSS    optional suffix [default: drang]
18:    -n NNN    start with this number [default: 1]
19:    -t        show the renaming but don't do it
20:    -h        show this help message
21:  
22:  The format for the file name is yyyymmddsss-nnn.jpg, where
23:  yyyy is the year, mm is the month number, dd is the day, sss
24:  is the optional suffix (which can be any length), and nnn is
25:  the (zero-padded) photo number for that day. By default, the
26:  original file names are given on the command line; if the -f
27:  option is used, the original file names are taken from
28:  STDIN."""
29:  
30:  # Handle the command line options.
31:  args = docopt.docopt(usage)
32:  suffix = args['-s']
33:  start = int(args['-n'])
34:  test = args['-t']
35:  filtrate = args['-f']
36:  
37:  # Get the file list and create a list of (filedate, filename) tuples.
38:  if filtrate:
39:    filenames = sys.stdin.read().split()
40:  else:
41:    filenames = args['FILE']
42:  filedates = []
43:  for f in filenames:
44:    info = pexif.JpegFile.fromFile(f).exif.primary
45:    try:
46:      d = info.ExtendedEXIF.DateTimeOriginal
47:      filedates.append((d, f))
48:    except AttributeError:          # skip over files without EXIF info
49:      continue
50:  
51:  # Don't bother going on if there aren't any files in the list.
52:  if len(filedates) == 0:
53:    sys.exit()
54:  
55:  # Some background info:
56:  # DateTimeOriginal is a string in the form 'yyyy:mm:dd hh:mm:ss'.
57:  # All the numbers use leading zeros if necessary; the hours use a
58:  # 24-hour clock format. An alphabetic sort on strings in this form
59:  # also sorts on date and time. Running split() on this string yields
60:  # a (date, time) tuple.
61:  
62:  # Sort the files according to date and time taken.
63:  filedates.sort()
64:  
65:  # Create a list of (oldfilename, newfilename) tuples.
66:  newnames = []
67:  i = start - 1                       # initialize the sequence number
68:  prev = filedates[0][0].split()[0]   # initialize the date
69:  for date, old in filedates:
70:    current = date.split()[0]
71:    if current == prev:               # still on same date
72:      i += 1
73:    else:                             # starting new date
74:      i = 1
75:      prev = current
76:    path = os.path.dirname(old)
77:    new = os.path.join(path,
78:      "%s%s-%03d.jpg" % (current.replace(':', ''), suffix, i))
79:    if new in filenames:
80:      sys.stderr.write("Error: %s is already being used\n" % new)
81:      sys.exit()
82:    else:
83:      newnames.append((old, new))
84:  
85:  # Rename the files or print out how they would be renamed.
86:  if test:
87:    for o,n in newnames:
88:      print "%s -> %s" % (o, n)
89:  else:
90:    for o,n in newnames:
91:      os.rename(o,n)

What canonize does is pretty simple and is, I think, fully explained in the usage string on Lines 9–28. It renames each photo file according to the date on which it was taken and the order in which it was taken on that date. (I know some people like to include the time in their photo names, but I’ve always preferred a simple counter.) There are some options that I seldom use for changing the details of the renaming or the source of the filenames.

On Line 31, docopt uses the usage string to parse the options and arguments and returns them all in a dictionary named args. Lines 32–35 turn the items of args into simple variables with nicer names.

The pexif module gets used in the loop in Lines 43–49. For each file, the EXIF information is read in Line 44 and the DateTimeOriginal field is plucked out in Line 46. This is the date and time, in yyyy:mm:dd hh:mm:ss format, at which the photo was taken. That date/time stamp is then used to sort the files in chronological order in Line 63 and determine their new names in Lines 66–83.

Lines 44 and 46 show why I’m not entirely thrilled with pexif. While it makes perfect sense that the attribute I’m after is called DateTimeOriginal, it’s not at all obvious that it should be buried under the three layers of exif, primary and ExtendedEXIF. I understand that this structure comes the EXIF spec, but I shouldn’t have to dig though the spec (or pexif’s source code) to find clues to the module’s attribute hierarchy. That’s what documentation is for, something pexif is sorely lacking.

On the other hand, pexif has some very nice helper functions for getting and setting GPS data, and they’ll be very helpful when I rewrite my other photo-handling script.

And docopt really does work as well as Rob Wells said. The only thing I’m uncomfortable with is its name—I’m not sure why.

Spider-man 157 cover

Image from Any Eventuality.