The key to sorting in Python

November 23, 2019 at 1:09 PM by Dr. Drang

A couple of weeks ago, a member of The Incomparable Slack complained that the list of offerings on the new Disney+ service wan’t convenient for scanning. Too many titles are clumped together in the Ds because they start with “Disney’s.” To waste some time when I should have been raking leaves, I played around with ways to sort titles that would get around that problem. Later that week, I was able to recoup some of that wasted time by using what I’d learned to sort a long list in a program I was writing for work.

When computer scientists discuss sorting, they usually focus on the efficiency of various algorithms. Here, I’m more interested in the conventions we use to decide which items should be sorted before others. In the US, for example, the convention is that people’s names should be sorted alphabetically in lastname firstname order, even when they are presented in firstname lastname order. When you are used to that convention, it’s jarring to see it reversed. I’ve noticed this happening more often as it’s become more common for lists to be sorted by programs than by people.

When I look for Kindle book deals at Amazon, for example, I often scan the lists of authors whose books are on sale. Amazon presents the author list sorted by first name.

Amazon authors sorted by first name

Ben Winters doesn’t belong in B; he should be in W with Donald Westlake. Obviously, Amazon knows the right way to do this but has decided it isn’t worth the effort.¹

A simple way to avoid the “too many Disneys” problem is to extend the longstanding convention of moving definite and indefinite articles to the end of a title: “A Winter’s Tale” gets alphabetized as “Winter’s Tale A” and ends up in the W section instead of the A section. All we have to do is treat “Disney’s” the same way we treat “a,” “an,” and “the.”

As I was thinking about implementing this in Python, the apostrophe in “Disney’s” alerted me to another problem: non-ASCII characters. I wasn’t sure how Python treated them and whether that was how I wanted them treated. So I did some experimenting.

Sorting in Python is done by either the sort method (if you want to sort a list in place) or the sorted function (if you want to create a new list that’s sorted). By default, a list of strings will be sorted by comparing each character in turn, but you can change that by specifying a comparison function through the key parameter:

mylist.sort(key=myfunction)

The comparison function must be written to take a single instance of the kind of thing being sorted and return a value. The returned value can be a number, a string, a list, or anything else that Python already knows how to sort.

Here was my first script for sorting Disney titles:

python:
 1:  #!/usr/bin/env python
 2:  
 3:  import re
 4:  import sys
 5:  
 6:  articles = ['a', 'an', 'the', 'disney', 'disneys']
 7:  punct = re.compile(r'[^\w ]')
 8:  
 9:  titles = sys.stdin.readlines()
10:  titles = [ t.strip() for t in titles ]
11:  
12:  def titlekey(title):
13:          s = title.casefold()
14:          s = punct.sub('', s)
15:          w = s.split()
16:          while w[0] in articles:
17:                  w = w[1:] + w[:1]
18:          return w
19:  
20:  titles.sort(key=titlekey)
21:  print('\n'.join(titles))

It expects standard input to consist of a series of titles, one per line, and outputs a similar series of lines but with titles in alphabetical order. This input:

A Tiger Walks
The Løve Bug
Måry Poppîns
That Darn Cát!
One Hundred and One Dalmatians
Pollyanna
Kidnapped
Dumbo
The Sign of “Zörro”
The Prinçess and the Frog
The Parent Trap
Kim Poßible
Adventures in Babysitting
Boy Meets World
Disney’s The Kid
Disney’s The Christmas Carol
Disney’s A Christmas Carol
Disney’s Fairy Tale Weddings
James and the Giant Peach
Moana
Melody Time
Mulan
The Many Adventures of Winnie the Pooh

produces this output:

Adventures in Babysitting
Boy Meets World
Disney’s A Christmas Carol
Disney’s The Christmas Carol
Dumbo
Disney’s Fairy Tale Weddings
James and the Giant Peach
Disney’s The Kid
Kidnapped
Kim Poßible
The Løve Bug
The Many Adventures of Winnie the Pooh
Melody Time
Moana
Mulan
Måry Poppîns
One Hundred and One Dalmatians
The Parent Trap
Pollyanna
The Prinçess and the Frog
The Sign of “Zörro”
That Darn Cát!
A Tiger Walks

Note that I’ve changed a character or two in many titles to see how non-ASCII characters get sorted.

The bulk of the script’s logic is in the titlekey function, which gets passed as the key parameter in the sort call on Line 20. titlekey starts by applying the casefold method to the input, which the documentations describes as “similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string.” We don’t want uppercase letters sorting before lowercase, and I was hoping casefold would also handle non-ASCII characters gracefully.

Line 14 then gets rid of all the punctuation in the title. For our purposes, punctuation is defined on Line 7 as everything that isn’t a word character (letter, numeral, or underscore) or a space. I thought I could use the punctuation item defined in Python’s string module, but it doesn’t include curly quotes, em and en dashes, or other non-ASCII punctuation.

Line 15 splits the string into words, and Lines 16–17 loop through the words, moving articles (as defined in Line 6 to include both “disney” and “disneys” in addition to actual articles) to the end of the list. Line 18 returns the list of rearranged title words.

An interesting thing about titlekey is that it can deal with more than one article. As you can see from the sorted list, “Disney’s The Kid” was put in the K group. When passed to titlekey, it returned the list ['kid', 'disneys', 'the'].

But all is not well with titlekey. Note that “Måry Poppîns” got placed after “Mulan,” which doesn’t seem right to me. Clearly, Python thinks the non-ASCII å comes after the ASCII u, which is not how I think of it. I think characters with diacritical marks should sort like their unadorned cousins.²

So here’s a revised version of the script:

python:
 1:  #!/usr/bin/env python
 2:  
 3:  from unidecode import unidecode
 4:  import re
 5:  import sys
 6:  
 7:  articles = ['a', 'an', 'the', 'disney', 'disneys']
 8:  punct = re.compile(r'[^\w ]')
 9:  
10:  titles = sys.stdin.readlines()
11:  titles = [ t.strip() for t in titles ]
12:  
13:  def titlekey(title):
14:          s = unidecode(title).lower()
15:          s = punct.sub('', s)
16:          w = s.split()
17:          while w[0] in articles:
18:                  w = w[1:] + w[:1]
19:          return w
20:  
21:  titles.sort(key=titlekey)
22:  print('\n'.join(titles))

There are only two changes: I imported the unidecode module in Line 3 and used its unidecode function in Line 14 instead of casefold. What unidecode does is transliterate non-ASCII characters into their ASCII “equivalent.” It’s a Python port of a Perl module and shares its advantages and disadvantages. For accented characters it does a good job.

The new version of titlekey returns ['mary', 'poppins'] when given Måry Poppîns, so it sorts the list of titles the way I expect.

Of course, unidecode’s transliteration is not an unalloyed success. If we feed the new script

Måry Poppîns
Märy Poppîns II
Máry Poppîns II
Märy Poppîns
Måry Poppîns II
Máry Poppîns
Moana
Melody Time
Mulan
The Many Adventures of Winnie the Pooh

we get back

The Many Adventures of Winnie the Pooh
Måry Poppîns
Märy Poppîns
Máry Poppîns
Märy Poppîns II
Máry Poppîns II
Måry Poppîns II
Melody Time
Moana
Mulan

This is pretty good, but notice that because the accented as are all treated the same, they don’t get sorted among themselves. Note that in the “Mary Poppins” set, the order goes

å ä á

while in the “Mary Poppins II” set, the order goes

ä á å

This is because Python’s sort is stable—items that have the same value come out of the sort in the same order they went in. In the original list, “Måry Poppîns” came before “Märy Poppîns” but “Märy Poppîns II” came before “Måry Poppîns II,” so that’s the order they come out.

Because our older version of the script—the one that uses casefold—doesn’t replace the accented characters, it does what might be considered a better job with all the “Mary Poppins” titles. Here’s how it sorts the list:

The Many Adventures of Winnie the Pooh
Melody Time
Moana
Mulan
Máry Poppîns
Máry Poppîns II
Märy Poppîns
Märy Poppîns II
Måry Poppîns
Måry Poppîns II

Obviously, it still has the problem of putting them all after “Mulan,” but there’s a nice regularity to its “Mary Poppins” series.

What would be the best sorting? I’d say

The Many Adventures of Winnie the Pooh
Máry Poppîns
Märy Poppîns
Måry Poppîns
Máry Poppîns II
Märy Poppîns II
Måry Poppîns II
Melody Time
Moana
Mulan

This puts all the “Marys” before “Melody Time,” puts all the “IIs” after the originals, and puts the variously accented as in the same order in both the “Mary Poppins” and “Mary Poppins II” sections. Can we do this? Yes, by making two lists in titlekey, one that transliterates via unidecode and another that just does casefold. Then we sort according to a compound list of lists:

python:
 1:  #!/usr/bin/env python
 2:  
 3:  from unidecode import unidecode
 4:  import re
 5:  import sys
 6:  
 7:  articles = ['a', 'an', 'the', 'disney', 'disneys']
 8:  suffixes = {'ii':2, 'iii':3, 'iv':4, 'v':5, 'vi':6, 'vii':7, 'viii':8, 'ix':9, 'x':10}
 9:  punct = re.compile(r'[^\w ]')
10:  
11:  titles = sys.stdin.readlines()
12:  titles = [ t.strip() for t in titles ]
13:  
14:  def titlekey(title):
15:          s = unidecode(title).lower()
16:          r = title.casefold()
17:          s = punct.sub('', s)
18:          r = punct.sub('', r)
19:          w = s.split()
20:          v = r.split()
21:          if w[-1] in suffixes:
22:                  number = f'{suffixes[w[-1]]:02d}'
23:                  w = w[:-1] + [number]
24:                  v = v[:-1]
25:          while w[0] in articles:
26:                  w = w[1:] + w[:1]
27:          while v[0] in articles:
28:                  v = v[1:] + v[:1]
29:          return [w, v]
30:  
31:  titles.sort(key=titlekey)
32:  print('\n'.join(titles))

Note also that the dictionary suffixes, which we use to better handle the possibility of Roman numerals at the end of a title. The idea is to convert them to Arabic so IX doesn’t come before V. I stopped at X, but that dictionary could easily be extended if necessary.³

The value returned by titlekey in Line 29 is a list of two lists. The first is basically what the unidecode version of the script gave us but with a two-digit Arabic number (see Lines 22–23) at the end if the title ended with a Roman numeral. The second is the list returned by the casefold version of the script, but with any Roman numeral stripped off.

By setting up the return value in this nested way, sort compares titles first by their unidecoded version and then by their casefolded version. That gives the ordering I like, with accented characters generally treated as unaccented but with different accents sorted consistently.

I’m certain you can come up with lists that won’t sort properly with this final version of the script. If you feel compelled to send them to me, make sure you include a script that can handle them.

I mentioned at the top that I used what I learned during my raking avoidance in a script for work. The work script had nothing to do with alphabetizing non-ASCII characters, but it did use the key parameter.

What I had was a list of strings that looked like “402-1,” “804-13,” and “1201-2.” Because of how they were entered, they were jumbled up, and I needed them sorted numerically by the first number, then the second number. Because there were no leading zeros, an alphabetical sort wouldn’t work. I did it by passing a lambda function as the key:

python:
mylist.sort(key=lambda x: [ int(y) for y in x.split('-') ])

The key function splits the string on the hyphen, converts each substring to an integer, and returns the list of integer pairs. Simple and easy to write, but something I would have spent more time on if I hadn’t been thinking about key just a few days earlier.

Or they’ve done some horrible A/B comparison and decided that people stay on the site longer when lists are alphabetized the wrong way. ↩
An argument certainly can be made for leaving “Måry Poppîns” where it is. An å is not the same as an a, and maybe it should be sorted after u. But because I’m writing this for my own purposes (to avoid raking leaves), I get to decide the most reasonable sorting order. By revising the script to get å to sort like a, I stay at my computer longer. ↩
I suspect there’s a Roman numeral conversion module, too, which would be better than a dictionary. ↩

And now it’s all this

I just said what I said and it was wrong
Or was taken wrong

The key to sorting in Python

Site search

Meta

Recent posts

Credits

And now it’s all this

I just said what I said and it was wrong Or was taken wrong

The key to sorting in Python

Site search

Meta

Recent posts

Credits

I just said what I said and it was wrong
Or was taken wrong