ASCIIfying

I’ve been adding more automation to my static blog publishing workflow.1 The scripts themselves are of no use to anyone else, but some bits and pieces may be of wider interest. For example, this morning I wrote a script using a library that converts Unicode strings to their nearest ASCII equivalent.

The script, written to be used as a Text Filter in BBEdit, automates the generation of header lines in the Markdown source code of a post. The header of this post, for example, looks like this:

Title: ASCIIfying
Keywords: python, programming
Date: 2014-10-19 22:10:00
Slug: asciifying


I write the Title and Keyword lines as I start the post, using a simple BBEdit Clipping. But before I publish, I need the other lines. The Date is easy to generate using the datetime library. That’s also the library I use to generate the year and month portions of the Link URL. The tricky thing is automating the creation of the Slug, which also shows up in the Link.

Oh, it’s very easy to make a slug when the title is as simple as this one, but suppose we started with this:

Title: Çingleton/Montréal isn't done
Keywords: test


Non-ASCII characters are allowed in URLs, but they can be troublesome, and I prefer to avoid them. Also, we can’t have the slash in there, and the apostrophe ought to go, too. Finally, I don’t want any spaces, because they cause nothing but trouble in the file system, and I hate seeing %20 in a URL.

The function I settled on is this:

python:
1:  def slugify(u):
2:    "Convert Unicode string into blog slug."
3:    u = re.sub(u'[–—/:;,.]', '-', u)  # replace separating punctuation
4:    a = unidecode(u).lower()          # best ASCII substitutions, lowercased
5:    a = re.sub(r'[^a-z0-9 -]', '', a) # delete any other characters
6:    a = a.replace(' ', '-')           # spaces to hyphens
7:    a = re.sub(r'-+', '-', a)         # condense repeated hyphens
8:    return a


All of the lines are straightforward and obvious except the unidecode call in Line 4. That is the one function exported by the unidecode library, and it does the substitutions that make slugify generate strings that are much more useful than anything I could write with the standard encode and decode methods. My script turns that two-line header above into

Title: Çingleton/Montréal isn't done
Keywords: test
Date: 2014-10-19 21:31:22
Slug: cingleton-montreal-isnt-done

The unidecode library is a Python port of a Perl module, and its documentation is sparse. If you want to know what it does and why it does it, go to Sean Burke’s writeup of his original Perl module, Text::Unidecode. It lays out his goals for the module, explains its limitations, and includes little gems like this:
If you ever need to ASCIIfy some text, Text::Unidecode or one of its ports (here’s one for Ruby) will come in handy.