Mail links and percentages

Yesterday, John Voorhees wrote a nice article at MacStories about creating links to specific email messages. His system is in the form of a Shortcut, but the real work is done by an AppleScript. The AppleScript is an extension of one John Gruber wrote 15 years ago.1

The script works by extracting the message ID of the currently selected mail message and assembling it into a URL of this form:

message://<message ID>

The message ID itself will look something like this:

So if you embed a Markdown link of the form

[Message subject](message://<>)

into any number of note-taking apps, you’ll get a nice link which will, when clicked on, open that message in Mail. John’s Shortcut has some other niceties, too, but I want to focus on the messages:// URL.

The article prompted a very good question from Michael Tsai:

Haven’t found any messages where the Message-ID needed to be percent-escaped?

John’s answer was no, and Gruber chimed in that he hadn’t run into a problem in the 15 years he’s been using his script. I used to use Gruber’s script (or some variant of it) to link to emails in my work notes, and I hadn’t needed to percent-encode any of my message IDs, either. But was that a sufficient answer? I wanted to find out.

If you’ve been wrangling URLs for any length of time, you’ve run into percent-encoding. Certain characters (reserved characters) have a special meaning in a URL, and to use such a character without invoking its special meaning, you have to convert it to a percent symbol (%) followed by its hex character code. For example, a question mark is changed to %3F, an ampersand is changed to %26, and a percent symbol is changed to %25.

Of course, these are the rules you use for http URLs. Are the same rules needed for message URLs? Given the experiences of the two Johns and me, it seemed unlikely, but I wanted to test it. The questions I needed to answer were:

  1. Are reserved characters used in message IDs?
  2. If so, do they need to be percent-encoded in a message URL?

I started answering the first question by collecting the message IDs from a bunch of emails using this AppleScript:

 1:  set msgIDs to {}
 3:  tell application "Mail"
 4:    tell account "Fastmail"
 5:      set mboxes to every mailbox
 6:      repeat with m in mboxes
 7:        if (count of messages in m) is greater than 0 then
 8:          set end of msgIDs to message id of every message of m
 9:        end if
10:      end repeat
11:    end tell
12:  end tell
14:  set text item delimiters to linefeed
15:  get msgIDs as text

This returned a big chunk of text—about 1.4 MB, composed of over 24,000 message IDs, one per line. To figure out whether any reserved characters were used in this corpus, I stole some ideas from Doug McIlroy’s famous word-counting pipeline.

With the output of the above AppleScript on my clipboard, I ran this pipeline:

pbpaste | fold -w 1 | sort | uniq -c

The fold command is the key. It turned the corpus into a new text string with one character per line. The fake message ID up near the top of the post would become


All 24,000 message IDs in my corpus were converted likewise, creating about 1.4 million one-character lines. These lines were then sorted and compressed by the uniq command into 78 lines, one for each unique character. Because of the -c option to uniq, the characters were preceded by their counts. The results, after some reformatting to make them easier to read, were

   12 !      24201 @       1627 T       4680 k
    2 #      33657 A       1343 U      18814 l
 1333 $      38424 B       1100 V      35900 m
    2 %      31145 C       1398 W      27670 n
    6 &      28671 D       3494 X      39174 o
  722 +      29486 E       1726 Y      14475 p
56536 -      26060 F       1110 Z       1778 q
55152 .       1454 G          3 [      14647 r
  110 /       4295 H          3 ]      13117 s
73638 0       1043 I       2013 _      11391 t
68062 1       2676 J      41330 a      14885 u
58154 2       1392 K      15572 b       4940 v
50054 3       2011 L      39224 c       6854 w
65081 4      10691 M      29364 d       2334 x
51243 5       4299 N      43692 e       3051 y
49928 6       1592 O      23604 f       2145 z
49139 7       6646 P      17034 g          3 |
51868 8       1260 Q       4166 h         16 ~
52855 9       7500 R      32222 i      
  773 =       3826 S       1458 j      

As you can see, lots of reserved characters are included in the message IDs. Which led to the second question: Do they need to be percent-encoded?

I pasted the corpus into BBEdit, and searched for each reserved character in turn. I copied the line with the character in question and tried it out on the command line using open. Here’s an example with a slash:

open 'message://<drdrang/notes/pull/>'

The URL is in single quotes to prevent the shell from interpreting any of its special characters. Running this command brought up an old message from GitHub.

None of the reserved characters caused a problem except the percent sign itself. A command like

open 'message://<>'

would fail unless I converted it to

open 'message://<>'

(I should mention that when I first discovered this, I thought the URL wouldn’t work even after percent-encoding the percent symbol. I thought that because I either mistyped the %25 as %24 or read the ASCII code wrong from the table. However I made the mistake, thanks to Leon Cowle for the tweet that made me realize what I’d done wrong.)

My conclusion is that you probably should be on the lookout for percent signs in your message IDs and change them to %25 if they appear. They’re pretty rare—less than one in a thousand of my emails—but they could screw up your links.

If you have FastScripts (even the free edition), you can do the replacement in AppleScript very easily with this snippet of code:

1:  tell application "FastScripts"
2:    set msgID to replace text msgID matching pattern "%" replacement pattern "%25"
3:  end tell

where I’ve assumed the message ID is in the variable msgID.

I suppose there could be other characters—characters that don’t appear in my corpus—that need percent-encoding, but I don’t think so. There’s a logic to the percent sign being the only character that needs encoding. All the other reserved characters can be encoded. For example, that GitHub message could have been opened with

open 'message://'

where I encoded all the slashes and the angle brackets. Because of this, percent signs cannot be interpreted literally. But they are the only character for which this is true.

  1. Ironically, most of the extensions have to do with turning the URL into a Markdown link.