Links as filesystem database

I returned to my office this past week after nearly two weeks away doing building inspections. I came back with nearly 7,000 photos—over 30 GB—that needed to be reviewed, and I faced a file organization problem: I wanted to organize the photos in two distinct and separate ways, but I didn’t want to take up 60 GB of disk space to do so. I solved the problem with a little Unix knowledge and a simple Python script.

Let me start by noting that I’m a little surprised to be in this situation. Ten years ago, “disk space is cheap” was the mantra and it seemed like that would be true forever. And if you’re talking about disks as we knew them back then—spinning metal platters—the mantra certainly is still true. But many of our computers stopped using spinning platters in favor of SSDs, and “disk space” suddenly became more precious. The MacBook Air on which I’m typing this—and on which I intend to do some of the above-mentioned photo review—has 128 GB of “disk space.” It doesn’t have a free 30 GB to spare for a second copy of every photo.

The solution is to use an old Unix trick, developed back when even spinning disk space was valuable. The hard link is way that several files can all point to the same chunk of data on a disk. The usual way to make a hard link from the command line is the ln command, which creates a new file that points to the same data as an existing file. But because the two directory structures I wanted to use were a little more complicated than my feeble shell scripting skills could handle, I ended up using Python’s link command from the os library, which works essentially the same way.

Here are the two directory structures I wanted to use.

Directory structure before linking

(Everything I’m showing here is fake, of course, made up for the purposes of this post. But it parallels the much larger set of real data.)

The photos were initially stored in a two-tiered system, with a folder for each day (using a yyyymmdd naming structure), and a set of subfolders for each building inspected on that day. The subfolders were named according to the building addresses. The photos of each building were saved in the subfolders.

I needed to keep this system, because it would be important at times to be able to quickly get at the photos taken on a certain day. But I also wanted to organize the photos according to building classifications. Here in the fake data, I’ve called those classifications small, medium, and large. I created the “small,” “medium,” and “large” folders and populated them with empty subfolders named after the building addresses—the same names used in the subfolders of the dated directories.

At this point, I needed a way to make hard links to all the JPEG files, putting them into the proper subfolders. As I said, the normal way to do this would be a shell script, but the thought of writing a shell script that walked through the dated directories, manipulating strings and making sure directory names were properly quoted to handle the embedded spaces, was too daunting. So I fell back on the familiar: Python.

Here’s the script, makelinks, that did the job.

python:
 1:  #!/usr/bin/env python
 2:  
 3:  import os
 4:  import glob
 5:  
 6:  sourcedirs = []
 7:  dateddirs = glob.glob('201611*')
 8:  for d in dateddirs:
 9:    sourcedirs += [ d + '/' + x for x in os.listdir(d) ]
10:  
11:  destdirs = []
12:  sizedirs = 'small medium large'.split()
13:  for d in sizedirs:
14:    destdirs += [ d + '/' + x for x in os.listdir(d) ]
15:  
16:  for s in sourcedirs:
17:    addr = s.split('/')[1]
18:    jpgs = os.listdir(s)
19:    for d in destdirs:
20:      if d.endswith(addr):
21:        destdir = d
22:        exit
23:    for j in jpgs:
24:      os.link(s+'/'+j, destdir+'/'+j)

The source directories are gathered into a list in Lines 6–9. I used the standard glob library to first get the dated directories themselves, and then used the listdir command from the os library to make the full path to each of the subdirectories.

Lines 11–14 do a similar thing to get a list of paths to all the destination directories.

Lines 16–24 loop through all the source directories, getting the paths to all the JPEG files within them and then searching for the appropriate destination directory using the inner loop in Lines 19–22. There’s probably a more elegant way to do that, but this was easy to write and I was more interested in getting the job done than doing it elegantly. Finally, the link command on Line 24 creates the hard links.

After makelinks, the directory structure looks like this.

Directory structure after linking

The small, medium, and large directories are now fully populated. In the Finder, it looks for all the world as if I’ve made copies of all the JPEGs, but it’s easy to prove that’s not the case. The du command displays disk usage for files and/or directories, and we can use some command line switches to get just the summary of the contents of the overall Photos directory. Doing this before and after running makelinks,

$ du -sh .
4.2M    .
$ ./makelinks
$ du -sh .
4.2M    .

shows that the disk usage remained steady at 4.2 MB.

Why, you might be asking, didn’t I just make aliases for each of the subfolders and move those aliases into the small, medium, and large folders? Wouldn’t that have been much quicker?

Yes, it would have been quicker, but folder aliases don’t work the way I want to be able to work. Let’s say, for example, I made an alias of the “123 Main St” subfolder in “20161101” and moved it into the “small” folder. Double-clicking on the alias folder would open the original folder, which means that when I hit ⌘↑ to go up one directory level, I’d be in the “20161101” folder instead of the “small” folder. That’s not how I want to navigate these folders—in fact, it’s exactly what I want to avoid.

The hard link solution, although more difficult to set up, will be a time-saver in the long run. And if I find myself needing to use another classification system for the buildings, I can add another directory structure with hard links using the same technique. In this way, I can use the file system as a sort of database, organizing my files in multiple ways without using up more space.

Update 11/20/2016 7:55 AM
A couple of readers on Twitter have suggested using tags. Tags are now the common way to organize files and folders according to multiple criteria, but they have the same navigation problem as aliases.

I could, for example, have given the “213 Elm St” and “789 North Ave” subfolders in the dated folders the “medium” tag and then created a Smart Folder that contained all the folders tagged “medium.” Had I done that, as with aliases, double-clicking on the “213 Elm St” folder in “medium” would open the original “213 Elm St” folder, which means that when I hit ⌘↑ to go up one level, I’d be in the “20161101” folder instead of the “medium” folder—not what I want when I’m trying to investigate all the medium buildings.

Like aliases, tags are almost what I want, but not quite. Hard links give me exactly what I want.