Sorting by parts

A few days ago, T.J. Luoma wrote a short, clever post on how he used the rev command in a pipeline to sort a list of domains and subdomains by domain. It’s a good solution—brief and easy to remember—and introduced me to a command I’d never seen before.1 It also got me thinking about other ways to solve the problem.

T.J.’s original list is

foo.tjluoma.com
a.luo.ma
bar.luo.ma
b.tjluoma.com

and his one-liner produces

a.luo.ma
bar.luo.ma
b.tjluoma.com
foo.tjluoma.com

This is sorted by top-level domain, then regular domain, then subdomain—which is what T.J. wanted—but it isn’t alphabetically sorted. I wondered how hard an alphabetical sorting by parts would be.

For T.J.’s list, or any list in which every domain has three parts, it’s not hard at all. The sort command has options to break each line into fields and use these fields to sort on multiple keys. With the list of domains on the clipboard, the command

pbpaste | sort -t. -k3 -k2 -k1

returns

b.tjluoma.com
foo.tjluoma.com
a.luo.ma
bar.luo.ma

which is exactly what I was looking for. The -t. option tells sort to use the period as a field separator. The three -k options tells it to sort first on field 3, then on field 2, and then on field 1. Boom.

Unfortunately, this command won’t work if some of the domains don’t have a subdomain. If, for example, our input is

foo.tjluoma.com
a.luo.ma
bar.luo.ma
b.tjluoma.com
leancrew.com
drdrang.com
daringfireball.net
6by6.5by5.fm
4by4.5by5.fm
5by5.fm
tjluoma.com
atp.fm
wordpress.com
wordpress.net
wordpress.co

the above command gives us

wordpress.co
drdrang.com
leancrew.com
tjluoma.com
wordpress.com
5by5.fm
atp.fm
daringfireball.net
wordpress.net
b.tjluoma.com
foo.tjluoma.com
4by4.5by5.fm
6by6.5by5.fm
a.luo.ma
bar.luo.ma

which is a mess. Because nine of the domains have no subdomain—and therefore only two fields—field 3 is blank for those domains and they get sorted before all the others. Using T.J.’s solution on this mixed list, we get

a.luo.ma
bar.luo.ma
5by5.fm
4by4.5by5.fm
6by6.5by5.fm
atp.fm
tjluoma.com
b.tjluoma.com
foo.tjluoma.com
drdrang.com
wordpress.com
leancrew.com
wordpress.co
daringfireball.net
wordpress.net

which (again) isn’t alphabetical but is clearly much better because it keeps the TLDs and the regular domains together.

To alphabetize in top level domain/regular domain/subdomain order, we’ll need to

  1. Split each line into fields at the periods.
  2. Reorder the fields from back to front.
  3. Sort the lines according to the reordered fields.
  4. Put the fields back in their original order.

Because awk automatically splits lines into fields, it’s the natural tool for the first two steps. Step 3 is obviously sort’s job. And with a simple trick, I got awk to handle Step 4 without much fuss. Here’s the command:

pbpaste |
awk -F . '{for(i=NF;i>0;i--)printf("%s ",$i);printf(" %s\n",$0)}' |
sort |
awk '{print $NF}'

where I’ve split it across four lines to make it easier to read. Running this with the longer list of domains on the clipboard gives

wordpress.co
drdrang.com
leancrew.com
tjluoma.com
b.tjluoma.com
foo.tjluoma.com
wordpress.com
5by5.fm
4by4.5by5.fm
6by6.5by5.fm
atp.fm
a.luo.ma
bar.luo.ma
daringfireball.net
wordpress.net

which is what I was looking for. The sorting is a little easier to see when the list is aligned according to the last dot:

     wordpress.co
       drdrang.com
      leancrew.com
       tjluoma.com
     b.tjluoma.com
   foo.tjluoma.com
     wordpress.com
          5by5.fm
     4by4.5by5.fm
     6by6.5by5.fm
           atp.fm
         a.luo.ma
       bar.luo.ma
daringfireball.net
     wordpress.net

The heavy lifting is done by the first awk command. The -F . option tells it to split fields on the period instead of (the default) whitespace. On each line, the for loop counts down from the number of fields (NF) to one. Awk identifies each field with a symbol like $1, $2, $3, etc. The $i in the first printf statement takes on one of these values each time through, so we print each field in reverse order with a space after it. This reversing loop is one of Eric Pement’s handy awk hints.

Then (and here’s the trick), the next printf, which is outside the for loop, prints another space and then the full text of original line ($0). The advantage of adding this to the end of the line is that we won’t have to reconstruct the original domain from its parts.

The output of the first awk command is

com tjluoma foo  foo.tjluoma.com
ma luo a  a.luo.ma
ma luo bar  bar.luo.ma
com tjluoma b  b.tjluoma.com
com leancrew  leancrew.com
com drdrang  drdrang.com
net daringfireball  daringfireball.net
fm 5by5 6by6  6by6.5by5.fm
fm 5by5 4by4  4by4.5by5.fm
fm 5by5  5by5.fm
com tjluoma  tjluoma.com
fm atp  atp.fm
com wordpress  wordpress.com
net wordpress  wordpress.net
co wordpress  wordpress.co

You can see how sort will put this in the order we want:

co wordpress  wordpress.co
com drdrang  drdrang.com
com leancrew  leancrew.com
com tjluoma  tjluoma.com
com tjluoma b  b.tjluoma.com
com tjluoma foo  foo.tjluoma.com
com wordpress  wordpress.com
fm 5by5  5by5.fm
fm 5by5 4by4  4by4.5by5.fm
fm 5by5 6by6  6by6.5by5.fm
fm atp  atp.fm
ma luo a  a.luo.ma
ma luo bar  bar.luo.ma
net daringfireball  daringfireball.net
net wordpress  wordpress.net

Note that there are two spaces before the domain. The second space keeps the domain from interfering with the sort because the space character is lower in ASCII order than any character that might be in a domain name.

Because of our trick of adding the domain to the end of each line, we can now get just the part we want by using awk to print the last (whitespace-separated) field of each line ($NF).

I’d like to end the explanation with “boom,” but this solution, while it works, has too many parts for “boom.” In fact, when a one-liner starts to stretch out like this, I usually abandon it before finishing and just write a Python or Perl script. Here’s a short Python script that’s longer but easier to read:

python:
 1:  #!/usr/bin/python
 2:  
 3:  import fileinput
 4:  
 5:  # Read in the domains and create a list of lists
 6:  # with the domain parts in reverse order.
 7:  domainparts = []
 8:  for d in fileinput.input():
 9:    dlist = d.strip().split('.')
10:    dlist.reverse()
11:    domainparts.append(dlist)
12:  
13:  domainparts.sort()
14:  
15:  # Print out the sorted domains.
16:  for d in domainparts:
17:    d.reverse()
18:    print '.'.join(d)

There’s no need for any tricks in this script, because in Python it’s easy to reconstruct the original domain from its parts. I should mention that the advantage of using the fileinput module is that the script can fed its input either through standard input (e.g., from a pipe) or by giving it a file name argument.

I can’t honestly say this is any better than T.J.’s solution, but I had fun working it out.


  1. It reminds me of tac, which, unfortunately, Apple doesn’t include with OS X, but which can be installed through homebrew as part of the coreutils package. ↩︎