Slugify (slight return)
August 22, 2023 at 2:33 PM by Dr. Drang
Earlier this year, I had some trouble publishing one of my posts. I think it was this one, and the problem was caused by the parentheses in the title. The code I’d written long ago to turn a title into the slug used in the URL wasn’t as robust as I thought it was. At the time, I made a quick change by hand to get the post published and made a note to myself to fix the code. Today I did. Twice.
The word slug was apparently taken from the newspaper business and is defined this way:
A slug is a few words that describe a post or a page. Slugs are usually a URL friendly version of the post title.
The URLs to individual posts here look like this:
https://leancrew.com/all-this/2023/08/slugify-slight-return/
which is the domain, a subdirectory, the year and month, and then the slug, which is based on the title. It’s supposed to be lower case, with all the punctuation stripped and all word separators turned into hyphens. Some people prefer underscores, but I like dashes.
I’ve had a slugify
function in my blog publishing system for ages. In a long-ago post, I wrote about this early version of it:
python:
1: def slugify(u):
2: "Convert Unicode string into blog slug."
3: u = re.sub(u'[–—/:;,.]', '-', u) # replace separating punctuation
4: a = unidecode(u).lower() # best ASCII substitutions, lowercased
5: a = re.sub(r'[^a-z0-9 -]', '', a) # delete any other characters
6: a = a.replace(' ', '-') # spaces to hyphens
7: a = re.sub(r'-+', '-', a) # condense repeated hyphens
8: return a
This was written in Python 2. It had been updated to Python 3 and improved in the intervening years, but it was obviously still not bulletproof. Here’s the version I came up with this morning, including the necessary import
s:
python:
1: import re
2: from unicodedata import normalize
3:
4: def slugify(text):
5: '''Make an ASCII slug of text'''
6:
7: # Make lower case and delete apostrophes from contractions
8: slug = re.sub(r"(\w)['’](\w)", r"\1\2", text.lower())
9:
10: # Convert runs of non-characters to single hyphens, stripping from ends
11: slug = re.sub(r'[\W_]+', '-', slug).strip('-')
12:
13: # Replace a few special characters that normalize doesn't handle
14: specials = {'æ':'ae', 'ß':'ss', 'ø':'o'}
15: for s, r in specials.items():
16: slug = slug.replace(s, r)
17:
18: # Normalize the non-ASCII text
19: slug = normalize('NFKD', slug).encode('ascii', 'ignore').decode()
20:
21: # Return the transformed string
22: return slug
This will turn
Parabolic mirrors made simple(r)
into
parabolic-mirrors-made-simple-r
which is what I want. A more complicated string, including non-ASCII characters,
Hél_lo—yøü don’t wånt “25–30%,” do you?
will be converted to
hel-lo-you-dont-want-25-30-do-you
which would also work well as a slug.
Line 19, which uses the normalize
function from the unicodedata
module followed by encode('ascii', 'ignore')
is far from perfect or complete, but it converts most accented letters into reasonable ASCII. Line 19 ends with decode
to turn what would otherwise be a bytes
object into a string.
You’ll note that Lines 14–16 handle the conversion of a few special characters: æ, ß, and ø. I learned by running tests that those are some of the letters the normalize/decode
system doesn’t convert to reasonable ASCII. Even though I couldn’t imagine myself using any of these letters—or any of the myriad of other letters that don’t get converted by normalize/decode
, it bothered me that I was rewriting slugify
yet again and still didn’t have a way of handling lots of non-ASCII characters.
I decided it was time to swallow my pride and look for a slugifying function written by someone who was willing to put in the time to do a complete job.
The answer was the aptly named python-slugify
module by AvidCoderr, which has its own slugify
function with many optional parameters. I learned that the defaults work for me. This code
python:
1: from slugify import slugify
2:
3: print(slugify("Hél_lo—yøü don’t wånt “25–30%,” do you, Mr. Encyclopædia?"))
returns
hel-lo-you-dont-want-25-30-do-you-mr-encyclopaedia
which is just what I want.
A lot of this slugify
’s power comes from its use of Tomaž Šolc’s unidecode
module, which does the conversion to ASCII in a way that’s much more complete than the normalize/decode
method.
So now my publishing code doesn’t have its own slugify
function, it just imports AvidCoderr’s and calls it. Kind of anticlimactic, but it works better.
One more nice thing about the slugify
module. When you install it—which I did via conda install python-slugify
because I use Anaconda to manage Python and its libraries—it comes with a command-line program also called slugify
, which lets you test things out in the Terminal. You don’t even have to wrap the string you want to slugify in quotes:
slugify Hél_lo—yøü don’t wånt “25–30%,” do you, Mr. Encyclopædia?
returns
hel-lo-you-dont-want-25-30-do-you-mr-encyclopaedia
Note that if the string you’re converting includes characters that are special to the shell, you will have to wrap it in single quotes.
slugify '$PATH'
returns
path
but
slugify $PATH
returns a very long string that you probably don’t want in your URL.
Reducing the size of PNGs with Keyboard Maestro, AppleScript, and ImageOptim
August 20, 2023 at 1:16 PM by Dr. Drang
For a long time, I’ve been using ImageOptim to reduce the size of PNG files I use here on the blog. The SnapClip and SnapSCP macros I use for taking most of my screenshots run ImageOptim automatically, but when I need to annotate or otherwise edit a screenshot, I have to run ImageOptim manually on the final version of the image. Until recently I’ve been doing this by selecting the file and control-clicking on it to open it in ImageOptim.
This is relatively quick, but I do have to make sure I hit ImageOptim in the long menu of apps—easy to do when I’m sitting up at a desk but less so when I’m lying on a bed or a couch. I decided to turn the operation into a Keyboard Maestro macro. I still have to start by selecting the file(s) I want to optimize, but I no longer have to aim at a menu item.
The macro is called Optimize PNG, and here’s a screenshot of it:
If you download it and import it into Keyboard Maestro as is, it will appear in the Finder group and will be active. You can run it when you have one or more PNG files selected in the Finder.
The macro has one step, which is this AppleScript:
applescript:
1: -- Set text item delimiters for extracting the extension
2: set text item delimiters to "."
3:
4: -- Set the path to the ImageOptim command line executable
5: set io to "/Applications/ImageOptim.app/Contents/MacOS/ImageOptim"
6:
7: -- Run ImageOptim on each selected file whose extension is png or PNG
8: tell application "Finder"
9: set imageFiles to selection
10: repeat with imageFile in imageFiles
11: set filePath to POSIX path of (imageFile as alias)
12: set fileExtension to last text item of filePath
13: if fileExtension is "png" or fileExtension is "PNG" then
14: do shell script (io & " " & quoted form of filePath)
15: end if
16: end repeat
17: end tell
18:
19: do shell script "afplay /System/Library/Sounds/Glass.aiff"
Basically, the script loops through all the selected files and runs ImageOptim on them. There’s some logic in there that makes sure1 that ImageOptim is run only on PNG files, and a conversion from an AppleScript file description to a Unix-style file path. The command that gets run on every PNG file is
/Applications/ImageOptim.app/Contents/MacOS/ImageOptim '/path/to/image file.png'
The file path is quoted (Line 14) to ensure that spaces are handled correctly.
When the optimizing is done, the Glass sound plays (Line 19) to let me know the files are ready.
The name of the macro is “Optimize PNG,” but I use ⌃⌥⌘I as the trigger because I think of it as opening ImageOptim, even though ImageOptim never shows itself except briefly in the Dock.
-
It’s certainly not a foolproof way of “making sure.” All it does is get the file extension (Lines 2 and 12) and checks to see if it’s “png” or “PNG” (Line 13). That’s good enough for me, but if you’re the kind of person who saves files with misleading extensions (or no extension at all) you’ll have to come up with a better way of distinguishing PNG files. Also, you should rethink your life choices. ↩
Sleeping Beauty is in the eye of the beholder
August 19, 2023 at 1:44 PM by Dr. Drang
A couple of days ago, Numberphile posted another Tom Crawford video in which he presents an interesting problem and explains it in an unnecessarily complicated way. This time, it’s the Sleeping Beauty problem.
Here’s the problem as posed in the Wikipedia article:
Sleeping Beauty volunteers to undergo the following experiment and is told all of the following details: On Sunday she will be put to sleep. Once or twice, during the experiment, Sleeping Beauty will be awakened, interviewed, and put back to sleep with an amnesia-inducing drug that makes her forget that awakening. A fair coin will be tossed to determine which experimental procedure to undertake:
- If the coin comes up heads, Sleeping Beauty will be awakened and interviewed on Monday only.
- If the coin comes up tails, she will be awakened and interviewed on Monday and Tuesday.
In either case, she will be awakened on Wednesday without interview and the experiment ends.
Any time Sleeping Beauty is awakened and interviewed she will not be able to tell which day it is or whether she has been awakened before. During the interview Sleeping Beauty is asked: “What is your degree of belief1 now for the proposition that the coin landed heads?”
I don’t understand why the problem is typically described with Sleeping Beauty being given a drug to put her to sleep. Surely it would be more appropriate for it to be a magic spell.
The first thing I don’t like about Tom’s presentation is how he poses the question asked of Sleeping Beauty: What is the probability that the coin was a head?
Asking about the probability instead of the degree of belief suggests an objectivity that shouldn’t be there. What is the probablity connotes a sort of omniscience that doesn’t belong in the question. That’s certainly one of the reasons Brady thinks at one point that the answer should be ½—a fair coin was flipped, and its probability of landing heads isn’t affected by any of the other bits of the story.
But when the question is posed in terms of degree of belief, and we remember that it’s Sleeping Beauty’s degree of belief each time she is awakened, we start thinking about the problem differently. This is what leads to the longish section in the middle of the video in which Tom goes through various assumptions and conditional probabilities to get to the “thirder” answer. And this is the part that I think can be made shorter and clearer.
First, let’s think about what degree of belief is. It is an expression of the odds that would be given in a fair wager. In this case, we recast the problem as Sleeping Beauty being offered a bet—heads or tails—by the experimenter each time she’s awakened. We can start by considering which way she should bet if she’s offered 1:1 odds and then move on to determining what odds would be fair to both her and the experimenter.
Because it’s a fair coin, half the time it will land on heads and there will be one wager. The other half of the time it will land on tails and there will be two wagers. If Sleeping Beauty bets on tails, she will, on average, lose one bet half the time and win two bets half the time. If we say the bet is $10, her expected return from betting on tails is
\[\frac{1}{2} (-\$10) + \frac{1}{2} (2 \times \$10) = \$5\]The experimenter would have to be an idiot to make this bet with even odds. The fair way is for the person who bets on tails to put up $20 and the person who bets on heads to put up $10. That way the expected return for the tails-bettor is
\[\frac{1}{2} (-\$20) + \frac{1}{2} (2 \times \$10) = $0\]and the expected return for the heads-bettor is the same:
\[\frac{1}{2} (\$20) + \frac{1}{2} (2 \times -\$10) = $0\]The 2:1 odds make the bet fair.
Because 2:1 odds is the same as “two out of three,” Sleeping Beauty’s degree of belief in tails is ⅔. Conversely, her degree of belief in heads is ⅓.
Note that it’s the disparity in the number of wagers (or questions, if we go back to the original problem statement) that makes the degrees of belief differ from ½. If we change the problem slightly and say that there will be one question, regardless of the outcome of the coin toss (if it’s tails we could do another coin toss to decide whether the question is asked on Monday or Tuesday), then there will be no disparity in wagers and even odds would be fair. It’s possible that this misinterpretation of the problem—that the question is asked once per experiment rather than once per awakening—is what leads some people to think that Sleeping Beauty’s degree of belief should be ½.
Another way for the degree of belief to be ½ would be if the wager is made not in the middle of the experiment, but either before it on Sunday or after it on Wednesday. In both of these cases, 1:1 odds would be fair.
We can also run simulations of the problem to give us insight into the answer. Here’s a short Python program that simulates both the one-question-per-awakening problem and the one-question-per-experiment problem:
python:
1: #!/usr/bin/env python3
2:
3: from collections import defaultdict
4: from random import choice
5:
6: # Set up the problem
7: sides = 'Heads Tails'.split()
8: days = 'Monday Tuesday'.split()
9: qdays = {'Heads': ['Monday'], 'Tails': days}
10:
11: # Initialize the question matrix
12: q = defaultdict(int)
13:
14: # Run 10,000 experiments assuming the question is asked every day
15: for f in range(10000):
16: flip = choice(sides)
17: for day in qdays[flip]:
18: q[(flip, day)] += 1
19:
20: # Show the results
21: print('Question asked every awakening')
22: for s in sides:
23: for d in days:
24: print(f'{s} and {d}: {q[(s, d)]}')
25:
26: print()
27:
28: # Reinitialize the question matrix
29: q = defaultdict(int)
30:
31: # Run 10,000 experiments assuming the question is asked once per experiment
32: for f in range(10000):
33: flip = choice(sides)
34: day = choice(qdays[flip])
35: q[(flip, day)] += 1
36:
37: # Show the results
38: print('Question asked once per experiment')
39: for s in sides:
40: for d in days:
41: print(f'{s} and {d}: {q[(s, d)]}')
In both cases, the q
dictionary is being used to keep track of questions. The keys of q
are tuples of the (initial) coin toss and the day, e.g., ('Tails', 'Monday')
, and the values of q
are the number of questions asked for each of those condition pairs. I’m using a defaultdict
for q
to avoid having to initialize it, and the choice
function from the random
module to simulate the coin flips.
Because the program uses random numbers and doesn’t specify a seed, it will give slightly different answers every time it’s run. Here’s the answer from one run,
Question asked every awakening
Heads and Monday: 4969
Heads and Tuesday: 0
Tails and Monday: 5031
Tails and Tuesday: 5031
Question asked once per experiment
Heads and Monday: 4905
Heads and Tuesday: 0
Tails and Monday: 2572
Tails and Tuesday: 2523
which fits well with our previous answers.
Simulations like this can give you confidence in the solutions you’ve come up with by other means. If you haven’t come up with a solution by other means, a simulation can lead you to the correct line of reasoning. Of course, your simulation code has to match the setup of the problem, which is often the tricky bit.
As I was going through this problem, I couldn’t help but think about the Sleeping Beauty episode of Fractured Fairy Tales.
The depiction of Walt Disney as a con man is probably not as wildly obvious now as it was in the early 60s, but even if you don’t know that Daws Butler is recycling his Hokey Wolf/Sgt. Bilko voice or that Disneyland used to have lettered tickets for different attractions, you still get the point.
-
The article actually uses credence instead of degree of belief, but I think the latter is easier to understand, especially for a character from the Middle Ages. ↩
Continued fractions in Python
August 16, 2023 at 10:40 AM by Dr. Drang
My last post ends with “one last thing” about continued fractions. That turned out to be a lie. After playing around a bit more, I decided I should have some functions that compute continued fractions in Python, so I looked around for continued fraction libraries. I found some, but none of them seemed like the library. So I decided to write my own.
Let’s review some notation and properties. A continued fraction is one in which the denominator contains a fraction, and the denominator of that fraction contains a fraction, and so on.
\[x = a_0 + \cfrac{1}{a_1 + \cfrac{1}{a_2 + \cfrac{1}{a_3 + \ldots}}}\]This is considered the standard form for continued fractions, where the numerators are all ones. You can write out a continued fraction with other numbers as the numerators, but it can always be reduced to this form.
If \(x\) is a rational number, then the continued fraction has a finite number of terms and will end with a \(1/a_n\) term. If \(n=4\), for example, the fraction will look like this:
\[x = a_0 + \cfrac{1}{a_1 + \cfrac{1}{a_2 + \cfrac{1}{a_3 + \cfrac{1}{a_4}}}}\]If \(x\) is irrational, the continued fraction has an infinite number of terms, although terms may repeat. Famously, the golden ratio goes on forever and all the terms are one:
\[\phi = 1 + \cfrac{1}{1 + \cfrac{1}{1 + \cfrac{1}{1 + \ldots}}}\]A less explicit but far more compact way to display a continued fraction is to just show the \(a\) terms as a bracketed list:
\[x = [a_0; a_1, a_2, a_3, \ldots ]\]It’s common to use a semicolon to separate the \(a_0\) term from the others. Mathematica doesn’t do that because it’s more convenient to just use a list. As we saw in the last post, the first five terms of the continued fraction for \(\pi\) is
In[1]:= ContinuedFraction[Pi, 5]
Out[1]= {3, 7, 15, 1, 292}
where Mathematica uses braces to surround its lists. We’ll use this same idea in Python, where the lists are bracketed.
A segment of a continued fraction, \(s_k\), is a finite continued fraction consisting of the first \(k+1\) terms of \(x\):
\[s_k = [a_0; a_1, a_2, \ldots, a_k]\]A remainder, \(r_k\), is all the terms starting with the \(k^{th}\) and continuing on, whether the continued fraction is finite or infinite:
\[r_k = [a_k; a_{k+1}, a_{k+2}, \ldots ]\]So any continued fraction can be broken into a segment, \(s_{k-1}\), and a remainder, \(r_k\).
A convergent is the rational number corresponding to a segment. Convergents are what we use to get rational approximations of numbers. In the last post, we did this
In[2]:= Convergents[ContinuedFraction[Pi, 5]]
22 333 355 103993
Out[2]= {3, --, ---, ---, ------}
7 106 113 33102
to see why \(22/7\) and \(355/113\) are good rational approximations of \(\pi\).
An interesting property of convergents is that those from even-indexed segments—\(s_0\), \(s_2\), and so on—bound \(x\) from below, and those from odd-indexed segments—\(s_1\), \(s_3\), and so on—bound \(x\) from above. (If \(x\) is rational, then there is a final convergent which is equal to \(x\), regardless of whether it’s even or odd.) The even convergents form an increasing sequence; the odd convergents form a decreasing sequence.
OK, if you want more you can go to the Wikipedia page or get a copy of Khinchin’s book1. Let’s move onto the code.
The function I wrote, continued
, returns a tuple of
- the continued fraction of the argument as a list of integers; and
- the convergents of the argument as a list of Fractions.
Fractions are a Python type supplied by the fractions library.
continued
is the only function in cfractions.py
, a file I’ve saved in my site-packages
directory. This makes it easy to import when I’m working in Jupyter:
In [1]: import math
In [2]: from cfractions import continued
In [3]: continued(math.pi)
Out[3]:
([3, 7, 15, 1, 292],
[Fraction(3, 1),
Fraction(22, 7),
Fraction(333, 106),
Fraction(355, 113),
Fraction(103993, 33102)])
Here’s the code:
python:
1: from fractions import Fraction
2: from math import isclose
3:
4: def continued(x, terms=20, rel_tol=1e-9, abs_tol=0.0):
5: 'Return the continued fraction and convergents of the argument.'
6: # Initialize, using Khinchin's notation
7: a = [] # continued fraction terms
8: p = [0, 1] # convergent numerator terms (-2 and -1 indices)
9: q = [1, 0] # convergent denominator terms (-2 and -1 indices)
10: s = [] # convergent terms
11: remainder = x
12:
13: # Collect the continued fraction and convergent terms
14: for i in range(terms):
15: # Compute the next terms
16: whole, frac = divmod(remainder, 1)
17: an = int(whole)
18: pn = an*p[-1] + p[-2]
19: qn = an*q[-1] + q[-2]
20: sn = Fraction(pn, qn)
21:
22: # Add terms to lists
23: a.append(an)
24: p.append(pn)
25: q.append(qn)
26: s.append(Fraction(sn))
27:
28: # Convergence check
29: if isclose(x, float(sn), rel_tol=rel_tol, abs_tol=abs_tol):
30: break
31:
32: # Get ready for next iteration
33: remainder = 1/frac
34:
35: # Return the tuple of the continued fraction and the convergents
36: return(a, s)
The terms of the continued fraction are calculated using a form of Euclid’s algorithm for finding the greatest common divisor (GCD) of two numbers. The numerators and denominators of the convergents are calculated using the recurrence relations,
\[p_k = a_k p_{k-1} + p_{k-2}\] \[q_k = a_k q_{k-1} + q_{k-2}\]You may be wondering why I’m calculating both the continued fraction and the convergents in the same function instead of doing them separately as Mathematic does. Two reasons:
- First, I typically want both the continued fraction and its convergents, so there’s no point in forcing me to make two function calls in the usual case.
- Second, I want to be able to set specific convergence criteria, and to do that I need to know how close I am to the input argument as new terms of the continued fraction are calculated. That means I need to calculate the convergents along with the continued fraction terms.
The three optional parameters to the function, terms
, rel_tol
, and abs_tol
, set the convergence criteria. terms
is an upper bound on the number of continued fraction terms that will be calculated, no matter what the other tolerance values are. rel_tol
and abs_tol
, are relative and absolute tolerance values that can stop the process before the terms
limit is reached. Their names and default values are taken from the isclose
function of the math
library, which is used on Line 29. For example, we could set an absolute tolerance on our rational estimate of \(\pi\) this way:
In [4]: continued(math.pi, rel_tol=0, abs_tol=1e-12)
Out[4]:
([3, 7, 15, 1, 292, 1, 1, 1, 2, 1, 3],
[Fraction(3, 1),
Fraction(22, 7),
Fraction(333, 106),
Fraction(355, 113),
Fraction(103993, 33102),
Fraction(104348, 33215),
Fraction(208341, 66317),
Fraction(312689, 99532),
Fraction(833719, 265381),
Fraction(1146408, 364913),
Fraction(4272943, 1360120)])
We’ve hit our tolerance because
\[\pi - \frac{4272943}{1360120} = 4 \times 10^{-13}\]I like the idea of having control over the convergence criteria. Mathematica’s second argument to ContinuedFraction
gives the equivalent of my terms
parameter, but its precision control is, as far as I can tell, entirely internal—there’s no way for the user to set a tolerance.
On the other hand, a disadvantage of my function is that its precision is limited to that of Python floats, whereas Mathematica will give you as much precision as you as for. I can’t, for example, ask for an abs_tol
of \(1 \times 10^{-20}\) and expect to get a correct answer:
In [5]: continued(math.pi, rel_tol=0, abs_tol=1e-20)
Out[5]:
([3, 7, 15, 1, 292, 1, 1, 1, 2, 1, 3, 1, 14, 3],
[Fraction(3, 1),
Fraction(22, 7),
Fraction(333, 106),
Fraction(355, 113),
Fraction(103993, 33102),
Fraction(104348, 33215),
Fraction(208341, 66317),
Fraction(312689, 99532),
Fraction(833719, 265381),
Fraction(1146408, 364913),
Fraction(4272943, 1360120),
Fraction(5419351, 1725033),
Fraction(80143857, 25510582),
Fraction(245850922, 78256779)])
Mathematica will happily use as many digits as needed and do so correctly. It tells us that Python screwed up in the 14th term:
In[3]:= ContinuedFraction[Pi, 15]
Out[3]= {3, 7, 15, 1, 292, 1, 1, 1, 2, 1, 3, 1, 14, 2, 1}
I’m not especially concerned about this reduced precision, as I seldom want my continued fractions to go out that far. And when I do, I have Mathematica to fall back on.
Finally, a small advantage of doing continued fractions in Python instead of Mathematica is that Python uses zero-based indexing for lists, which is consistent with the standard notation given above. Mathematica uses one-based indexing, which usually works out nicely when dealing with vectors and matrices but not in this case.
Update 16 Aug 2023 10:53 PM
Shortly after this post was published Thomas J. Fan got in touch with me on Mastodon and told me that the SymPy library has continued fraction functions in the number theory sublibrary. Of course! I felt silly for not looking there.
He also included this code snippet, which I ran in Jupyter:
In [1]: from itertools import islice
In [2]: from sympy.core import pi
In [3]: from sympy.ntheory.continued_fraction import continued_fraction_iterator
In [4]: list(islice(continued_fraction_iterator(pi), 15))
Out[4]: [3, 7, 15, 1, 292, 1, 1, 1, 2, 1, 3, 1, 14, 2, 1]
So the function is not only prewritten, it’s more accurate than mine. Of course, the final line to get a list of terms is kind of convoluted, but it could easily be wrapped in a more compact function. I may give it a go.
Thanks, Thomas!
Update 18 Aug 2023 9:49 AM
I played around with SymPy’s continued fraction functions yesterday and have decided to stick with my function. As discussed above, the SymPy functions are more accurate than mine, but to get that accuracy you have to be working with the SymPy definitions of numbers like \(\pi\) and functions like sqrt
. Since I’m usually working with the math
package’s definitions, which are numeric rather than symbolic, I wouldn’t normally get the value out of the SymPy functions. Still, it’s good to know that they’re there.
-
It’s a thin Dover paperback, so it’s pretty cheap. ↩