Slugify (slight return)

Earlier this year, I had some trouble publishing one of my posts. I think it was this one, and the problem was caused by the parentheses in the title. The code I’d written long ago to turn a title into the slug used in the URL wasn’t as robust as I thought it was. At the time, I made a quick change by hand to get the post published and made a note to myself to fix the code. Today I did. Twice.

The word slug was apparently taken from the newspaper business and is defined this way:

A slug is a few words that describe a post or a page. Slugs are usually a URL friendly version of the post title.

The URLs to individual posts here look like this:

https://leancrew.com/all-this/2023/08/slugify-slight-return/

which is the domain, a subdirectory, the year and month, and then the slug, which is based on the title. It’s supposed to be lower case, with all the punctuation stripped and all word separators turned into hyphens. Some people prefer underscores, but I like dashes.

I’ve had a slugify function in my blog publishing system for ages. In a long-ago post, I wrote about this early version of it:

python:
1:  def slugify(u):
2:    "Convert Unicode string into blog slug."
3:    u = re.sub(u'[–—/:;,.]', '-', u)  # replace separating punctuation
4:    a = unidecode(u).lower()          # best ASCII substitutions, lowercased
5:    a = re.sub(r'[^a-z0-9 -]', '', a) # delete any other characters
6:    a = a.replace(' ', '-')           # spaces to hyphens
7:    a = re.sub(r'-+', '-', a)         # condense repeated hyphens
8:    return a

This was written in Python 2. It had been updated to Python 3 and improved in the intervening years, but it was obviously still not bulletproof. Here’s the version I came up with this morning, including the necessary imports:

python:
 1:  import re
 2:  from unicodedata import normalize
 3:  
 4:  def slugify(text):
 5:    '''Make an ASCII slug of text'''
 6:    
 7:    # Make lower case and delete apostrophes from contractions
 8:    slug = re.sub(r"(\w)['’](\w)", r"\1\2", text.lower())
 9:    
10:    # Convert runs of non-characters to single hyphens, stripping from ends
11:    slug = re.sub(r'[\W_]+', '-', slug).strip('-')
12:    
13:    # Replace a few special characters that normalize doesn't handle
14:    specials = {'æ':'ae', 'ß':'ss', 'ø':'o'}
15:    for s, r in specials.items():
16:        slug = slug.replace(s, r)
17:    
18:    # Normalize the non-ASCII text
19:    slug = normalize('NFKD', slug).encode('ascii', 'ignore').decode()
20:    
21:    # Return the transformed string
22:    return slug

This will turn

Parabolic mirrors made simple(r)

into

parabolic-mirrors-made-simple-r

which is what I want. A more complicated string, including non-ASCII characters,

Hél_lo—yøü don’t wånt “25–30%,” do you?

will be converted to

hel-lo-you-dont-want-25-30-do-you

which would also work well as a slug.

Line 19, which uses the normalize function from the unicodedata module followed by encode('ascii', 'ignore') is far from perfect or complete, but it converts most accented letters into reasonable ASCII. Line 19 ends with decode to turn what would otherwise be a bytes object into a string.

You’ll note that Lines 14–16 handle the conversion of a few special characters: æ, ß, and ø. I learned by running tests that those are some of the letters the normalize/decode system doesn’t convert to reasonable ASCII. Even though I couldn’t imagine myself using any of these letters—or any of the myriad of other letters that don’t get converted by normalize/decode, it bothered me that I was rewriting slugify yet again and still didn’t have a way of handling lots of non-ASCII characters.

I decided it was time to swallow my pride and look for a slugifying function written by someone who was willing to put in the time to do a complete job.

The answer was the aptly named python-slugify module by AvidCoderr, which has its own slugify function with many optional parameters. I learned that the defaults work for me. This code

python:
1:  from slugify import slugify
2:  
3:  print(slugify("Hél_lo—yøü don’t wånt “25–30%,” do you, Mr. Encyclopædia?"))

returns

hel-lo-you-dont-want-25-30-do-you-mr-encyclopaedia

which is just what I want.

A lot of this slugify’s power comes from its use of Tomaž Šolc’s unidecode module, which does the conversion to ASCII in a way that’s much more complete than the normalize/decode method.

So now my publishing code doesn’t have its own slugify function, it just imports AvidCoderr’s and calls it. Kind of anticlimactic, but it works better.

One more nice thing about the slugify module. When you install it—which I did via conda install python-slugify because I use Anaconda to manage Python and its libraries—it comes with a command-line program also called slugify, which lets you test things out in the Terminal. You don’t even have to wrap the string you want to slugify in quotes:

slugify Hél_lo—yøü don’t wånt “25–30%,” do you, Mr. Encyclopædia?

returns

hel-lo-you-dont-want-25-30-do-you-mr-encyclopaedia

Note that if the string you’re converting includes characters that are special to the shell, you will have to wrap it in single quotes.

slugify '$PATH'

returns

path

but

slugify $PATH   

returns a very long string that you probably don’t want in your URL.