Sorting by parts
May 4, 2014 at 12:56 AM by Dr. Drang
A few days ago, T.J. Luoma wrote a short, clever post on how he used the rev
command in a pipeline to sort a list of domains and subdomains by domain. It’s a good solution—brief and easy to remember—and introduced me to a command I’d never seen before.1 It also got me thinking about other ways to solve the problem.
T.J.’s original list is
foo.tjluoma.com
a.luo.ma
bar.luo.ma
b.tjluoma.com
and his one-liner produces
a.luo.ma
bar.luo.ma
b.tjluoma.com
foo.tjluoma.com
This is sorted by top-level domain, then regular domain, then subdomain—which is what T.J. wanted—but it isn’t alphabetically sorted. I wondered how hard an alphabetical sorting by parts would be.
For T.J.’s list, or any list in which every domain has three parts, it’s not hard at all. The sort
command has options to break each line into fields and use these fields to sort on multiple keys. With the list of domains on the clipboard, the command
pbpaste | sort -t. -k3 -k2 -k1
returns
b.tjluoma.com
foo.tjluoma.com
a.luo.ma
bar.luo.ma
which is exactly what I was looking for. The -t.
option tells sort
to use the period as a field separator. The three -k
options tells it to sort first on field 3, then on field 2, and then on field 1. Boom.
Unfortunately, this command won’t work if some of the domains don’t have a subdomain. If, for example, our input is
foo.tjluoma.com
a.luo.ma
bar.luo.ma
b.tjluoma.com
leancrew.com
drdrang.com
daringfireball.net
6by6.5by5.fm
4by4.5by5.fm
5by5.fm
tjluoma.com
atp.fm
wordpress.com
wordpress.net
wordpress.co
the above command gives us
wordpress.co
drdrang.com
leancrew.com
tjluoma.com
wordpress.com
5by5.fm
atp.fm
daringfireball.net
wordpress.net
b.tjluoma.com
foo.tjluoma.com
4by4.5by5.fm
6by6.5by5.fm
a.luo.ma
bar.luo.ma
which is a mess. Because nine of the domains have no subdomain—and therefore only two fields—field 3 is blank for those domains and they get sorted before all the others. Using T.J.’s solution on this mixed list, we get
a.luo.ma
bar.luo.ma
5by5.fm
4by4.5by5.fm
6by6.5by5.fm
atp.fm
tjluoma.com
b.tjluoma.com
foo.tjluoma.com
drdrang.com
wordpress.com
leancrew.com
wordpress.co
daringfireball.net
wordpress.net
which (again) isn’t alphabetical but is clearly much better because it keeps the TLDs and the regular domains together.
To alphabetize in top level domain/regular domain/subdomain order, we’ll need to
- Split each line into fields at the periods.
- Reorder the fields from back to front.
- Sort the lines according to the reordered fields.
- Put the fields back in their original order.
Because awk
automatically splits lines into fields, it’s the natural tool for the first two steps. Step 3 is obviously sort
’s job. And with a simple trick, I got awk
to handle Step 4 without much fuss. Here’s the command:
pbpaste |
awk -F . '{for(i=NF;i>0;i--)printf("%s ",$i);printf(" %s\n",$0)}' |
sort |
awk '{print $NF}'
where I’ve split it across four lines to make it easier to read. Running this with the longer list of domains on the clipboard gives
wordpress.co
drdrang.com
leancrew.com
tjluoma.com
b.tjluoma.com
foo.tjluoma.com
wordpress.com
5by5.fm
4by4.5by5.fm
6by6.5by5.fm
atp.fm
a.luo.ma
bar.luo.ma
daringfireball.net
wordpress.net
which is what I was looking for. The sorting is a little easier to see when the list is aligned according to the last dot:
wordpress.co
drdrang.com
leancrew.com
tjluoma.com
b.tjluoma.com
foo.tjluoma.com
wordpress.com
5by5.fm
4by4.5by5.fm
6by6.5by5.fm
atp.fm
a.luo.ma
bar.luo.ma
daringfireball.net
wordpress.net
The heavy lifting is done by the first awk
command. The -F .
option tells it to split fields on the period instead of (the default) whitespace. On each line, the for
loop counts down from the number of fields (NF
) to one. Awk
identifies each field with a symbol like $1
, $2
, $3
, etc. The $i
in the first printf
statement takes on one of these values each time through, so we print each field in reverse order with a space after it. This reversing loop is one of Eric Pement’s handy awk
hints.
Then (and here’s the trick), the next printf
, which is outside the for
loop, prints another space and then the full text of original line ($0). The advantage of adding this to the end of the line is that we won’t have to reconstruct the original domain from its parts.
The output of the first awk
command is
com tjluoma foo foo.tjluoma.com
ma luo a a.luo.ma
ma luo bar bar.luo.ma
com tjluoma b b.tjluoma.com
com leancrew leancrew.com
com drdrang drdrang.com
net daringfireball daringfireball.net
fm 5by5 6by6 6by6.5by5.fm
fm 5by5 4by4 4by4.5by5.fm
fm 5by5 5by5.fm
com tjluoma tjluoma.com
fm atp atp.fm
com wordpress wordpress.com
net wordpress wordpress.net
co wordpress wordpress.co
You can see how sort
will put this in the order we want:
co wordpress wordpress.co
com drdrang drdrang.com
com leancrew leancrew.com
com tjluoma tjluoma.com
com tjluoma b b.tjluoma.com
com tjluoma foo foo.tjluoma.com
com wordpress wordpress.com
fm 5by5 5by5.fm
fm 5by5 4by4 4by4.5by5.fm
fm 5by5 6by6 6by6.5by5.fm
fm atp atp.fm
ma luo a a.luo.ma
ma luo bar bar.luo.ma
net daringfireball daringfireball.net
net wordpress wordpress.net
Note that there are two spaces before the domain. The second space keeps the domain from interfering with the sort because the space character is lower in ASCII order than any character that might be in a domain name.
Because of our trick of adding the domain to the end of each line, we can now get just the part we want by using awk
to print the last (whitespace-separated) field of each line ($NF
).
I’d like to end the explanation with “boom,” but this solution, while it works, has too many parts for “boom.” In fact, when a one-liner starts to stretch out like this, I usually abandon it before finishing and just write a Python or Perl script. Here’s a short Python script that’s longer but easier to read:
python:
1: #!/usr/bin/python
2:
3: import fileinput
4:
5: # Read in the domains and create a list of lists
6: # with the domain parts in reverse order.
7: domainparts = []
8: for d in fileinput.input():
9: dlist = d.strip().split('.')
10: dlist.reverse()
11: domainparts.append(dlist)
12:
13: domainparts.sort()
14:
15: # Print out the sorted domains.
16: for d in domainparts:
17: d.reverse()
18: print '.'.join(d)
There’s no need for any tricks in this script, because in Python it’s easy to reconstruct the original domain from its parts. I should mention that the advantage of using the fileinput
module is that the script can fed its input either through standard input (e.g., from a pipe) or by giving it a file name argument.
I can’t honestly say this is any better than T.J.’s solution, but I had fun working it out.