Slugify (slight return)
August 22, 2023 at 2:33 PM by Dr. Drang
Earlier this year, I had some trouble publishing one of my posts. I think it was this one, and the problem was caused by the parentheses in the title. The code I’d written long ago to turn a title into the slug used in the URL wasn’t as robust as I thought it was. At the time, I made a quick change by hand to get the post published and made a note to myself to fix the code. Today I did. Twice.
The word slug was apparently taken from the newspaper business and is defined this way:
A slug is a few words that describe a post or a page. Slugs are usually a URL friendly version of the post title.
The URLs to individual posts here look like this:
https://leancrew.com/all-this/2023/08/slugify-slight-return/
which is the domain, a subdirectory, the year and month, and then the slug, which is based on the title. It’s supposed to be lower case, with all the punctuation stripped and all word separators turned into hyphens. Some people prefer underscores, but I like dashes.
I’ve had a slugify
function in my blog publishing system for ages. In a long-ago post, I wrote about this early version of it:
python:
1: def slugify(u):
2: "Convert Unicode string into blog slug."
3: u = re.sub(u'[–—/:;,.]', '-', u) # replace separating punctuation
4: a = unidecode(u).lower() # best ASCII substitutions, lowercased
5: a = re.sub(r'[^a-z0-9 -]', '', a) # delete any other characters
6: a = a.replace(' ', '-') # spaces to hyphens
7: a = re.sub(r'-+', '-', a) # condense repeated hyphens
8: return a
This was written in Python 2. It had been updated to Python 3 and improved in the intervening years, but it was obviously still not bulletproof. Here’s the version I came up with this morning, including the necessary import
s:
python:
1: import re
2: from unicodedata import normalize
3:
4: def slugify(text):
5: '''Make an ASCII slug of text'''
6:
7: # Make lower case and delete apostrophes from contractions
8: slug = re.sub(r"(\w)['’](\w)", r"\1\2", text.lower())
9:
10: # Convert runs of non-characters to single hyphens, stripping from ends
11: slug = re.sub(r'[\W_]+', '-', slug).strip('-')
12:
13: # Replace a few special characters that normalize doesn't handle
14: specials = {'æ':'ae', 'ß':'ss', 'ø':'o'}
15: for s, r in specials.items():
16: slug = slug.replace(s, r)
17:
18: # Normalize the non-ASCII text
19: slug = normalize('NFKD', slug).encode('ascii', 'ignore').decode()
20:
21: # Return the transformed string
22: return slug
This will turn
Parabolic mirrors made simple(r)
into
parabolic-mirrors-made-simple-r
which is what I want. A more complicated string, including non-ASCII characters,
Hél_lo—yøü don’t wånt “25–30%,” do you?
will be converted to
hel-lo-you-dont-want-25-30-do-you
which would also work well as a slug.
Line 19, which uses the normalize
function from the unicodedata
module followed by encode('ascii', 'ignore')
is far from perfect or complete, but it converts most accented letters into reasonable ASCII. Line 19 ends with decode
to turn what would otherwise be a bytes
object into a string.
You’ll note that Lines 14–16 handle the conversion of a few special characters: æ, ß, and ø. I learned by running tests that those are some of the letters the normalize/decode
system doesn’t convert to reasonable ASCII. Even though I couldn’t imagine myself using any of these letters—or any of the myriad of other letters that don’t get converted by normalize/decode
, it bothered me that I was rewriting slugify
yet again and still didn’t have a way of handling lots of non-ASCII characters.
I decided it was time to swallow my pride and look for a slugifying function written by someone who was willing to put in the time to do a complete job.
The answer was the aptly named python-slugify
module by AvidCoderr, which has its own slugify
function with many optional parameters. I learned that the defaults work for me. This code
python:
1: from slugify import slugify
2:
3: print(slugify("Hél_lo—yøü don’t wånt “25–30%,” do you, Mr. Encyclopædia?"))
returns
hel-lo-you-dont-want-25-30-do-you-mr-encyclopaedia
which is just what I want.
A lot of this slugify
’s power comes from its use of Tomaž Šolc’s unidecode
module, which does the conversion to ASCII in a way that’s much more complete than the normalize/decode
method.
So now my publishing code doesn’t have its own slugify
function, it just imports AvidCoderr’s and calls it. Kind of anticlimactic, but it works better.
One more nice thing about the slugify
module. When you install it—which I did via conda install python-slugify
because I use Anaconda to manage Python and its libraries—it comes with a command-line program also called slugify
, which lets you test things out in the Terminal. You don’t even have to wrap the string you want to slugify in quotes:
slugify Hél_lo—yøü don’t wånt “25–30%,” do you, Mr. Encyclopædia?
returns
hel-lo-you-dont-want-25-30-do-you-mr-encyclopaedia
Note that if the string you’re converting includes characters that are special to the shell, you will have to wrap it in single quotes.
slugify '$PATH'
returns
path
but
slugify $PATH
returns a very long string that you probably don’t want in your URL.