Regexes, BBEdit, and Twitter screen names
December 22, 2011 at 8:41 PM by Dr. Drang
I found another bug in Dr. Twoot today. A pretty embarrassing one, given the context in which I found it.
Here’s what my Twitter timeline looked like when I opened Dr. Twoot this morning:
For some reason, the screen name for @bbedit_hints wasn’t turned into a link, only the @bbedit part was. It wasn’t too hard to find the reason. Here’s the section of code that finds screen names and turns them into links:
javascript:
57: // Handle Twitter names, ignoring case.
58: $.each(users, function(i, u) {
59: iname = new RegExp('@' + u.screen_name, 'gi');
60: link = '<a href="http://twitter.com/' + u.screen_name + '">' + '@' + u.screen_name + '</a>';
61: body = body.replace(iname, link);
62: }) // each
The users
list comes from the user_mentions
tweet entity. The anonymous function called by the JQuery each
method searches the tweet for the mentioned user’s screen name and replaces it with a link to the user’s Twitter page. If the screen name was miscapitalized in the tweet, the replace
changes it to the canonical form. For example, a tweet that says
I love @BBEdit!
will be turned into
I love <a href="http://twitter.com/bbedit>@bbedit</a>!
and will be a working link when displayed by Dr. Twoot.
This code has been working for months. The bug that was lying in wait until today only appears when one of the users mentioned in a tweet has a screen name that matches the beginning of another screen name in the tweet. When that happens, the shorter screen name can transform the initial portion of the longer screen name into a link and make it impossible for the longer name to be found.
In the tweet above, @bbedit_hints
got turned into
<a href="http://twitter.com/bbedit>@bbedit</a>_hints
when the each
method was looking for instances of @bbedit
. When each
was later looking for @bbedit_hints
, it was nowhere to be found—the closing anchor tag before the underscore made the search fail.1
The solution was pretty easy. Like many regular expression engines, JavaScript’s has a token, \b
, that matches a word boundary. Including the word boundary in the regex search pattern in Line 59 fixed the problem.
javascript:
57: // Handle Twitter names, ignoring case.
58: $.each(users, function(i, u) {
59: iname = new RegExp('@' + u.screen_name + '\\b', 'gi');
60: link = '<a href="http://twitter.com/' + u.screen_name + '">' + '@' + u.screen_name + '</a>';
61: body = body.replace(iname, link);
62: }) // each
The extra backslash is one of those annoying, easy-to-miss string escaping things. The backslash is JavaScript’s escape character, so to get a literal backslash into a string you have to use \\
.
With the fix in place, the screen names were detected properly and they all linked to the correct Twitter pages.
The bitter irony of this is that the tweet that pointed out the bug was part of a discussion of the regular expression chapter of the BBEdit manual—the manual that was my introduction to the wonderful world of regular expressions back in 1995 or so. My mistakes are not a reflection on the quality of the manual.
-
Order is important. If
@bbedit_hints
had come before@bbedit
in theuser_mentions
entity, the bug would have remained hidden. ↩