The key to sorting in Python

A couple of weeks ago, a member of The Incomparable Slack complained that the list of offerings on the new Disney+ service wan’t convenient for scanning. Too many titles are clumped together in the Ds because they start with “Disney’s.” To waste some time when I should have been raking leaves, I played around with ways to sort titles that would get around that problem. Later that week, I was able to recoup some of that wasted time by using what I’d learned to sort a long list in a program I was writing for work.

When computer scientists discuss sorting, they usually focus on the efficiency of various algorithms. Here, I’m more interested in the conventions we use to decide which items should be sorted before others. In the US, for example, the convention is that people’s names should be sorted alphabetically in lastname firstname order, even when they are presented in firstname lastname order. When you are used to that convention, it’s jarring to see it reversed. I’ve noticed this happening more often as it’s become more common for lists to be sorted by programs than by people.

When I look for Kindle book deals at Amazon, for example, I often scan the lists of authors whose books are on sale. Amazon presents the author list sorted by first name.

Amazon authors sorted by first name

Ben Winters doesn’t belong in B; he should be in W with Donald Westlake. Obviously, Amazon knows the right way to do this but has decided it isn’t worth the effort.1

A simple way to avoid the “too many Disneys” problem is to extend the longstanding convention of moving definite and indefinite articles to the end of a title: “A Winter’s Tale” gets alphabetized as “Winter’s Tale A” and ends up in the W section instead of the A section. All we have to do is treat “Disney’s” the same way we treat “a,” “an,” and “the.”

As I was thinking about implementing this in Python, the apostrophe in “Disney’s” alerted me to another problem: non-ASCII characters. I wasn’t sure how Python treated them and whether that was how I wanted them treated. So I did some experimenting.

Sorting in Python is done by either the sort method (if you want to sort a list in place) or the sorted function (if you want to create a new list that’s sorted). By default, a list of strings will be sorted by comparing each character in turn, but you can change that by specifying a comparison function through the key parameter:

mylist.sort(key=myfunction)

The comparison function must be written to take a single instance of the kind of thing being sorted and return a value. The returned value can be a number, a string, a list, or anything else that Python already knows how to sort.

Here was my first script for sorting Disney titles:

python:
 1:  #!/usr/bin/env python
 2:  
 3:  import re
 4:  import sys
 5:  
 6:  articles = ['a', 'an', 'the', 'disney', 'disneys']
 7:  punct = re.compile(r'[^\w ]')
 8:  
 9:  titles = sys.stdin.readlines()
10:  titles = [ t.strip() for t in titles ]
11:  
12:  def titlekey(title):
13:          s = title.casefold()
14:          s = punct.sub('', s)
15:          w = s.split()
16:          while w[0] in articles:
17:                  w = w[1:] + w[:1]
18:          return w
19:  
20:  titles.sort(key=titlekey)
21:  print('\n'.join(titles))

It expects standard input to consist of a series of titles, one per line, and outputs a similar series of lines but with titles in alphabetical order. This input:

A Tiger Walks
The Løve Bug
Måry Poppîns
That Darn Cát!
One Hundred and One Dalmatians
Pollyanna
Kidnapped
Dumbo
The Sign of “Zörro”
The Prinçess and the Frog
The Parent Trap
Kim Poßible
Adventures in Babysitting
Boy Meets World
Disney’s The Kid
Disney’s The Christmas Carol
Disney’s A Christmas Carol
Disney’s Fairy Tale Weddings
James and the Giant Peach
Moana
Melody Time
Mulan
The Many Adventures of Winnie the Pooh

produces this output:

Adventures in Babysitting
Boy Meets World
Disney’s A Christmas Carol
Disney’s The Christmas Carol
Dumbo
Disney’s Fairy Tale Weddings
James and the Giant Peach
Disney’s The Kid
Kidnapped
Kim Poßible
The Løve Bug
The Many Adventures of Winnie the Pooh
Melody Time
Moana
Mulan
Måry Poppîns
One Hundred and One Dalmatians
The Parent Trap
Pollyanna
The Prinçess and the Frog
The Sign of “Zörro”
That Darn Cát!
A Tiger Walks

Note that I’ve changed a character or two in many titles to see how non-ASCII characters get sorted.

The bulk of the script’s logic is in the titlekey function, which gets passed as the key parameter in the sort call on Line 20. titlekey starts by applying the casefold method to the input, which the documentations describes as “similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string.” We don’t want uppercase letters sorting before lowercase, and I was hoping casefold would also handle non-ASCII characters gracefully.

Line 14 then gets rid of all the punctuation in the title. For our purposes, punctuation is defined on Line 7 as everything that isn’t a word character (letter, numeral, or underscore) or a space. I thought I could use the punctuation item defined in Python’s string module, but it doesn’t include curly quotes, em and en dashes, or other non-ASCII punctuation.

Line 15 splits the string into words, and Lines 16–17 loop through the words, moving articles (as defined in Line 6 to include both “disney” and “disneys” in addition to actual articles) to the end of the list. Line 18 returns the list of rearranged title words.

An interesting thing about titlekey is that it can deal with more than one article. As you can see from the sorted list, “Disney’s The Kid” was put in the K group. When passed to titlekey, it returned the list ['kid', 'disneys', 'the'].

But all is not well with titlekey. Note that “Måry Poppîns” got placed after “Mulan,” which doesn’t seem right to me. Clearly, Python thinks the non-ASCII å comes after the ASCII u, which is not how I think of it. I think characters with diacritical marks should sort like their unadorned cousins.2

So here’s a revised version of the script:

python:
 1:  #!/usr/bin/env python
 2:  
 3:  from unidecode import unidecode
 4:  import re
 5:  import sys
 6:  
 7:  articles = ['a', 'an', 'the', 'disney', 'disneys']
 8:  punct = re.compile(r'[^\w ]')
 9:  
10:  titles = sys.stdin.readlines()
11:  titles = [ t.strip() for t in titles ]
12:  
13:  def titlekey(title):
14:          s = unidecode(title).lower()
15:          s = punct.sub('', s)
16:          w = s.split()
17:          while w[0] in articles:
18:                  w = w[1:] + w[:1]
19:          return w
20:  
21:  titles.sort(key=titlekey)
22:  print('\n'.join(titles))

There are only two changes: I imported the unidecode module in Line 3 and used its unidecode function in Line 14 instead of casefold. What unidecode does is transliterate non-ASCII characters into their ASCII “equivalent.” It’s a Python port of a Perl module and shares its advantages and disadvantages. For accented characters it does a good job.

The new version of titlekey returns ['mary', 'poppins'] when given Måry Poppîns, so it sorts the list of titles the way I expect.

Of course, unidecode’s transliteration is not an unalloyed success. If we feed the new script

Måry Poppîns
Märy Poppîns II
Máry Poppîns II
Märy Poppîns
Måry Poppîns II
Máry Poppîns
Moana
Melody Time
Mulan
The Many Adventures of Winnie the Pooh

we get back

The Many Adventures of Winnie the Pooh
Måry Poppîns
Märy Poppîns
Máry Poppîns
Märy Poppîns II
Máry Poppîns II
Måry Poppîns II
Melody Time
Moana
Mulan

This is pretty good, but notice that because the accented as are all treated the same, they don’t get sorted among themselves. Note that in the “Mary Poppins” set, the order goes

å ä á

while in the “Mary Poppins II” set, the order goes

ä á å

This is because Python’s sort is stable—items that have the same value come out of the sort in the same order they went in. In the original list, “Måry Poppîns” came before “Märy Poppîns” but “Märy Poppîns II” came before “Måry Poppîns II,” so that’s the order they come out.

Because our older version of the script—the one that uses casefold—doesn’t replace the accented characters, it does what might be considered a better job with all the “Mary Poppins” titles. Here’s how it sorts the list:

The Many Adventures of Winnie the Pooh
Melody Time
Moana
Mulan
Máry Poppîns
Máry Poppîns II
Märy Poppîns
Märy Poppîns II
Måry Poppîns
Måry Poppîns II

Obviously, it still has the problem of putting them all after “Mulan,” but there’s a nice regularity to its “Mary Poppins” series.

What would be the best sorting? I’d say

The Many Adventures of Winnie the Pooh
Máry Poppîns
Märy Poppîns
Måry Poppîns
Máry Poppîns II
Märy Poppîns II
Måry Poppîns II
Melody Time
Moana
Mulan

This puts all the “Marys” before “Melody Time,” puts all the “IIs” after the originals, and puts the variously accented as in the same order in both the “Mary Poppins” and “Mary Poppins II” sections. Can we do this? Yes, by making two lists in titlekey, one that transliterates via unidecode and another that just does casefold. Then we sort according to a compound list of lists:

python:
 1:  #!/usr/bin/env python
 2:  
 3:  from unidecode import unidecode
 4:  import re
 5:  import sys
 6:  
 7:  articles = ['a', 'an', 'the', 'disney', 'disneys']
 8:  suffixes = {'ii':2, 'iii':3, 'iv':4, 'v':5, 'vi':6, 'vii':7, 'viii':8, 'ix':9, 'x':10}
 9:  punct = re.compile(r'[^\w ]')
10:  
11:  titles = sys.stdin.readlines()
12:  titles = [ t.strip() for t in titles ]
13:  
14:  def titlekey(title):
15:          s = unidecode(title).lower()
16:          r = title.casefold()
17:          s = punct.sub('', s)
18:          r = punct.sub('', r)
19:          w = s.split()
20:          v = r.split()
21:          if w[-1] in suffixes:
22:                  number = f'{suffixes[w[-1]]:02d}'
23:                  w = w[:-1] + [number]
24:                  v = v[:-1]
25:          while w[0] in articles:
26:                  w = w[1:] + w[:1]
27:          while v[0] in articles:
28:                  v = v[1:] + v[:1]
29:          return [w, v]
30:  
31:  titles.sort(key=titlekey)
32:  print('\n'.join(titles))

Note also that the dictionary suffixes, which we use to better handle the possibility of Roman numerals at the end of a title. The idea is to convert them to Arabic so IX doesn’t come before V. I stopped at X, but that dictionary could easily be extended if necessary.3

The value returned by titlekey in Line 29 is a list of two lists. The first is basically what the unidecode version of the script gave us but with a two-digit Arabic number (see Lines 22–23) at the end if the title ended with a Roman numeral. The second is the list returned by the casefold version of the script, but with any Roman numeral stripped off.

By setting up the return value in this nested way, sort compares titles first by their unidecoded version and then by their casefolded version. That gives the ordering I like, with accented characters generally treated as unaccented but with different accents sorted consistently.

I’m certain you can come up with lists that won’t sort properly with this final version of the script. If you feel compelled to send them to me, make sure you include a script that can handle them.

I mentioned at the top that I used what I learned during my raking avoidance in a script for work. The work script had nothing to do with alphabetizing non-ASCII characters, but it did use the key parameter.

What I had was a list of strings that looked like “402-1,” “804-13,” and “1201-2.” Because of how they were entered, they were jumbled up, and I needed them sorted numerically by the first number, then the second number. Because there were no leading zeros, an alphabetical sort wouldn’t work. I did it by passing a lambda function as the key:

python:
mylist.sort(key=lambda x: [ int(y) for y in x.split('-') ])

The key function splits the string on the hyphen, converts each substring to an integer, and returns the list of integer pairs. Simple and easy to write, but something I would have spent more time on if I hadn’t been thinking about key just a few days earlier.


  1. Or they’ve done some horrible A/B comparison and decided that people stay on the site longer when lists are alphabetized the wrong way. 

  2. An argument certainly can be made for leaving “Måry Poppîns” where it is. An å is not the same as an a, and maybe it should be sorted after u. But because I’m writing this for my own purposes (to avoid raking leaves), I get to decide the most reasonable sorting order. By revising the script to get å to sort like a, I stay at my computer longer. 

  3. I suspect there’s a Roman numeral conversion module, too, which would be better than a dictionary. 


Prompt forever

A lot of my work on the iPad isn’t really on the iPad. It’s command line work that’s executed on a Mac or a Linux server while I use my iPad as a terminal via SSH. A feature recently added to Prompt has made that much easier.

The problem I’ve always had with using Prompt is that it would disconnect from the server if it was the background app for more than a few minutes. This was a “feature” of iOS that the folks at Panic couldn’t seem to get around. Because it’s common for me to jump between three or four apps while working, I couldn’t always keep Prompt as one of the active apps in Split View, and I’d often have to reconnect.

I tried making the reconnection as painless as possible by using tmux. This worked, but working in a tmux session didn’t allow me to scroll back freely to see the results of old commands. Using the mobile shell, Mosh, and the Blink app had the same scrolling problem. (I should make it clear that when using tmux or mosh I could use keyboard commands to scroll back a page at a time, but I didn’t want to work as if I were sitting at a text-only terminal in 1982. I like working at the command line, but I want to do so in a modern setting.)

In recent months, the terminal feature built in to Textastic has been my workhorse. Because I tend to keep Textastic active in one of the Split View panes as I take notes, edit a program, or write a report, its connection to the server seldom gets culled by the operating system. Unfortunately, its terminal emulation isn’t as full-featured as Prompt’s—it’s fine for simple tasks, but not great for a Jupyter console session.

Recently, though, I learned that a new feature of Prompt may give me everything I want. It’s called Connection Keeper, and when I first heard of it I had two misconceptions:

  1. I thought it was a temporary workaround to the terrible background process culling that iOS 13.2.1 was doing and wouldn’t be helpful once Apple fixed that in 13.2.2.
  2. I thought it was essentially built into Prompt, not a setting that has to be turned on.

I was set straight earlier this week by Athanasios Alexadrides and Anders Borum.

Here’s what Panic says in the latest release notes:

Prompt’s new Connection Keeper feature lets you audit exactly when and where you’ve connected to your servers. It also helps keep your connections alive while Prompt is in the background.

You can turn this feature on in “Settings > Connection Keeper”

While I’m sure there are plenty of people who want to track where and when they connected to the server, it’s the part tucked after the “also” that excites me. Since turning Connection Keeper on several days ago, every SSH session in Prompt has had an essentially permanent connection, no matter how long I’ve had Prompt in the background.

When Prompt is connected to a server and in the background, this flag appears in the statusbar:

Prompt background flag

It’s similar to the background flags that appear when you’re talking on the phone or getting directions in Maps. As with those flags, tap it and you’re taken back to Prompt.

As Panic says in the release notes, you turn Connection Keeper on in Prompt’s settings, but you might have trouble finding it. Like many gear/hamburger “menus” in iPadOS, Prompt’s Settings popup is limited in height. On my 12.9″ iPad, only the items on the left are shown when I tap the gear icon.

Prompt settings menu items

At first, I didn’t notice the line under the Keyboard item that indicates there are more items available. Scrolling up revealed all the other items shown on the right.

Turning Connection Keeper on is pretty much what you’d expect: flip the slider button to the right.

Connection Keeper

As with the release notes, the instructions here focus more on connection history than connection maintenance. I suspect Panic doesn’t want to oversell connection maintenance because it’s not entirely under their control; they know Apple could kill it with another point release.

But until that happens, I’m enjoying SSH connections that last as long as I want them to. One more step in the direction of making the iPad into a full-featured computer.


A little chart adjustment

One of the best things about today’s introduction of the 16″ MacBook Pro was that it inspired Marco Arment to write a new blog post—a rare treat. I am perhaps unusual in that I think Marco is a better writer than he is a talker. I like his talking; I just like his writing more.

Of course he commented on the new keyboard, and his comments were accompanied by this cute graph comparing the new keyboard with the old one and some others:

Marco's chart

It’s nicely done. Displaying the key spacing and travel as a stacked pair of charts was a good idea. Putting them together in a single chart with the spacing and travel columns next to one another would have been the easy choice, but it wouldn’t be nearly as clear. And the decision to show the key travel columns growing down—the direction of travel—was inspired. It does what good charts do: give you an instant understanding of the point being made while also providing a clear way to explore the data further.

But I have opinions about chart style, and I think a handful of small improvements could be made. Most of these could not be made within the charting software Marco was using (Numbers, I think); the charts would have to be imported into a drawing package and manipulated there. But the small extra effort would be worth it.

I don’t have the data Marco was working from, and I wanted some practice editing images in Pixelmator on my iPad, so I edited his PNG image instead of creating a new chart from scratch. Here’s his chart with my edits:

Edited version of Marco's chart

First, I made it seem more like a single chart by getting rid of the superfluous second set of categories and moving the lower chart up to make it more obvious that it and the upper chart are sharing the categories. The legends were Marco’s only bad idea, so I got rid of them. A chart with only one data set doesn’t need a legend, it needs a title. I confess my placement of the titles could be better.

I also changed two of the grid lines in the lower chart, thickening the one at 0 and thinning the one at 5 to make it clear which was the base from which the red columns were growing. And I deleted the minus signs in front of the ordinate labels. I’m sure Marco used negative values to trick the charting software into making his columns grow downward, but we think of key travel as a positive number, and that’s how it should be labeled.

One change I couldn’t make in Pixelmator, but is the change I first thought of when looking at Marco’s original, is to adjust the ticks and grid lines to whole millimeters. This is a pet peeve of mine: charting algorithms often make unnatural decisions about default tick spacing, setting the marks at places that no human would. In this case, the algorithm no doubt felt that four divisions was best, leaving us with a silly grid spacing of 1.25 mm. It is true that adding an extra grid line would make the chart a little more busy, but the advantage of having simple numbers as the labels outweighs the additional clutter. This is especially true given that one of the main points of the new keyboard is that it’s travel is 1.0 mm.

My apologies to Marco for this. His chart really is good. I just can’t help myself.


Accidents and estimates

I came very close to being in a car accident on Friday, which got me thinking about kinematics and estimation in engineering calculations.

I was stopped in a line of cars at a traffic light when something—probably the squeal of brakes, although I may have heard that after later—made me look up into my rear view mirror. A car going way too fast was coming up behind me. I leaned back in my seat, put my hands on my head, and closed my eyes, waiting for the impact.

Which never came.

After a couple of seconds, I opened my eyes and looked in the mirror. There was a car back there, slanted out toward the shoulder, but it was much further away than I expected. Then I noticed a car on the shoulder to my right and ahead of me. That was the car I had expected to hit me. The driver had managed to swerve to the right and avoid me.

That led to some conflicting feelings. I was pleased he was skillful enough to steer out of the accident but angry at his stupidity in needing to exercise that skill. Then the engineer in me took over. If he came to a stop ahead of me, how fast would he have hit me if he hadn’t veered to the right?

It’s a pretty simple calculation, the kind you learn in high school physics. There are two equations of kinematics we need:

[d = v_0 t - \frac{1}{2} \alpha g t^2]

and

[v_0 = \alpha g t]

These are covering the period of time from when his front bumper passed my rear bumper to when he came to rest. The distance traveled is [d], his speed at the beginning of this period is [v_0], the duration is [t], and the deceleration (assumed constant) is [\alpha g]. It’s common in situations like this to express the acceleration or deceleration as a fraction of the acceleration due to gravity; [\alpha] is a pure number.

We don’t really care about [t], so with a little algebra we can turn these into a single formula with only the variables of interest:

[v_0 = \sqrt{2 \alpha g d}]

Based on where the car ended up, I’d say [d] is about 25 feet. The deceleration factor, [\alpha] is a bit more dicey to estimate, but it’s likely to be somewhere around 0.6 to 0.8. And since we’re using feet for distance, we’ll use 32.2 [\mathrm{ft/s^2}] for [g]. That gives us a range of values from 31 to 36 [\mathrm{ft/s}] for [v_0]. Converting to more conventional units for car speeds, that puts him between 21 and 24 mph. That would have been a pretty good smack. Not only would my trunk have been smashed in, I likely would have had damage to my front bumper from being pushed into the car ahead of me.

This was a simple calculation, but it illustrates an interesting feature of estimation. Despite starting with a fairly wide range in our estimate of [\alpha] (0.8 is 33% higher than 0.6), we ended up with a much narrower range in the estimate of [v_0] (24 is only about 15% higher than 21). For this we have the square root to thank. It cuts the relative error in half.

Why? Let’s say we have this simple relationship:

[a = \sqrt{b}]

We can express our uncertainty in the value of [b] by saying our estimate of it is [b (1 + \epsilon)], where [\epsilon] is the relative error in the estimate. We can then say

[\sqrt{b(1 + \epsilon)} = \sqrt{b}\sqrt{1 + \epsilon}]

and using the Taylor series expansion of the second term about [\epsilon = 0], we get

[\sqrt{b} \left( 1 + \frac{1}{2}\epsilon + \mathrm{h.o.t} \right)]

If the absolute value of [\epsilon] is small, the higher order terms (h.o.t) won’t amount to much, and the relative error of [a] will be about half that of [b].

Lots of engineering estimates aren’t as forgiving as this one, so it’s important to know when your inputs have to be known precisely and when you can be a little loose with them.

Speaking of forgiving, I searched for rear end crash test results for my car to see how much damage it would have taken. I came up empty, but here’s a more recent model undergoing an impact at twice the speed.