Wordle letters
January 8, 2022 at 4:47 PM by Dr. Drang
Like all internet hipsters, I started playing Wordle a few days before the New York Times article that introduced it to the great unwashed. I don’t think I’ll stick with it for very long—the universe of five-letter words seems like something that will wear thin soon—but I am interested in the strategy. So I did a little scripting.
Clearly, the idea is to identify as many letters in the target word as quickly as possible. Letter frequencies in English text famously follow the ETAOIN SHRDLU order, an ordering that was built into Linotype keyboards back in the days of hot metal type. But Wordle isn’t based on general English text, it’s based specifically on five-letter words. So we need the letter frequencies for that restricted set.
Mac and Linux computers carry on the Unix tradition of including a file, /usr/share/dict/words
that’s used for spell checking. It’s an alphabetical list with one word per line, which is very convenient for working out letter frequencies. But first, we’ll need to pull out just the five-letter words, leaving behind any proper nouns. That can be done with a simple Perl one-liner:1
perl -nle 'print if /^[a-z]{5}$/' /usr/share/dict/words > words5.txt
The regular expression that is the backbone of this command matches only five-letter words with no capitals. After running this, we have a file, words5.txt
, that contains just the words we need for Wordle. It has about 8,500 entries.
Now that we have a file with just five-letter words, we can compute the letter frequencies with this script:
perl:
1: #!/usr/bin/perl
2:
3: while($word = <>){
4: chomp $word;
5: foreach (split //, $word){
6: $freq{$_}++;
7: }
8: }
9:
10: foreach $letter (sort keys %freq){
11: print "$letter\t$freq{$letter}\n";
12: }
Lines 3—8 loop through the lines of the input file and build up a hash (or associative array) named %freq
with letters as the keys and their counts as the values. Lines 10—12 then print out the hash in alphabetical order:
a 4467
b 1162
c 1546
d 1399
e 4255
f 661
g 1102
h 1323
i 2581
j 163
k 882
l 2368
m 1301
n 2214
o 2801
p 1293
q 84
r 3043
s 2383
t 2381
u 1881
v 466
w 685
x 189
y 1605
z 250
I could have included a sorting command to reorder the hash in frequency order, but it was easier to just copy the output, paste it into a spreadsheet, and do the reordering there. I also used the spreadsheet to sum the counts and present the frequencies as percentages.
Letter | Count | Frequency |
---|---|---|
a | 4467 | 10.5% |
e | 4255 | 10.0% |
r | 3043 | 7.2% |
o | 2801 | 6.6% |
i | 2581 | 6.1% |
s | 2383 | 5.6% |
t | 2381 | 5.6% |
l | 2368 | 5.6% |
n | 2214 | 5.2% |
u | 1881 | 4.4% |
y | 1605 | 3.8% |
c | 1546 | 3.6% |
d | 1399 | 3.3% |
h | 1323 | 3.1% |
m | 1301 | 3.1% |
p | 1293 | 3.0% |
b | 1162 | 2.7% |
g | 1102 | 2.6% |
k | 882 | 2.1% |
w | 685 | 1.6% |
f | 661 | 1.6% |
v | 466 | 1.1% |
z | 250 | 0.6% |
x | 189 | 0.4% |
j | 163 | 0.4% |
q | 84 | 0.2% |
Using /usr/share/dict/words
as a source of words was convenient, as it was already on my computer, but I doubted it was the source of legal words in Wordle. Wouldn’t a word gamer use a Scrabble dictionary? A little searching led me to this page and this one. I copied the source code for each, opened them in BBEdit, and after a few search-and-replaces, had a two new lists of five-letter words. They were nearly the same, differing by about 60 words out of 8,900. I merged the two and named the result scrabble5.txt
.2
Running the letter frequency script on this new file and doing the same sorting and percentage calculations as before gave me this list:
Letter | Count | Frequency |
---|---|---|
s | 4623 | 10.4% |
e | 4585 | 10.3% |
a | 3986 | 8.9% |
o | 2977 | 6.7% |
r | 2916 | 6.5% |
i | 2633 | 5.9% |
l | 2440 | 5.5% |
t | 2319 | 5.2% |
n | 2022 | 4.5% |
d | 1727 | 3.9% |
u | 1697 | 3.8% |
c | 1475 | 3.3% |
y | 1403 | 3.1% |
p | 1384 | 3.1% |
m | 1339 | 3.0% |
h | 1214 | 2.7% |
g | 1113 | 2.5% |
b | 1096 | 2.5% |
k | 949 | 2.1% |
f | 790 | 1.8% |
w | 690 | 1.5% |
v | 475 | 1.1% |
z | 249 | 0.6% |
x | 213 | 0.5% |
j | 186 | 0.4% |
q | 79 | 0.2% |
The leap of s from 5.6% to 10.4% suggest plurals play a big role in Scrabble dictionaries and not much of one in /usr/share/dict/words
. I checked this by running
perl -nle 'print if /s$/' words5.txt | wc -l
and
perl -nle 'print if /s$/' scrabble5.txt | wc -l
to tell me how words that end in s are in each of the two files. There were 357 such words in words5.txt
and 2771 in scrabble.txt
. This told me two things:
- Spell checkers that use
/usr/share/dict/words
must use algorithmic methods to deal with plurals. - A lot of legal five-letter Scrabble words are just pluralized four-letter words.
I did this on Monday and was pretty happy with it until I read that Times article on Wednesday, where it said that Josh Wardle, the creator of Wordle, had started with a list of 12,000 words but then
…narrowed down the list of Wordle words to about 2,500, which should last for a few years.
That would mean my frequencies are based on a much broader set of words than Wordle considers legal, which could throw off my calculated frequencies.
And yet…
Today I sacrificed my score by trying out some oddball Scrabble words to see if Wordle would accept them. It did. Here’s my game:
I don’t know about you, but if I were limiting myself to just 2,500 words, things like heuch and vrows wouldn’t make the cut. (I would definitely include rebus and tapir, which some dorks—I would also include dorks—have apparently complained about.) So I’m wondering if the Times got this part of the story mixed up somehow. (Update: Nope, see below.)
You might be wondering if counting the number of times each letter appears in the list of legal five-letter words is the right way to characterize the frequency of letters. Maybe we should be counting the number of words each letter appears in. This script, which uses the uniq
function in the List::Util
module to filter out repeated letters, does just that:
perl:
1: #!/usr/bin/perl
2:
3: use List::Util qw(uniq);
4:
5: while($word = <>){
6: chomp $word;
7: foreach (uniq split //, $word){
8: $freq{$_}++;
9: }
10: }
11:
12: foreach $letter (sort keys %freq){
13: print "$letter\t$freq{$letter}\n";
14: }
Using this way of counting gives us another letter frequency table:
Letter | Count | Frequency |
---|---|---|
s | 4106 | 46.1% |
e | 3993 | 44.8% |
a | 3615 | 40.5% |
r | 2751 | 30.9% |
o | 2626 | 29.5% |
i | 2509 | 28.1% |
l | 2231 | 25.0% |
t | 2137 | 24.0% |
n | 1912 | 21.4% |
u | 1655 | 18.6% |
d | 1615 | 18.1% |
c | 1403 | 15.7% |
y | 1371 | 15.4% |
p | 1301 | 14.6% |
m | 1267 | 14.2% |
h | 1185 | 13.3% |
g | 1050 | 11.8% |
b | 1023 | 11.5% |
k | 913 | 10.2% |
f | 707 | 7.9% |
w | 686 | 7.7% |
v | 465 | 5.2% |
z | 227 | 2.5% |
x | 212 | 2.4% |
j | 184 | 2.1% |
q | 79 | 0.9% |
In this table, the frequency column gives the percentage of words in which each letter appears. The ordering of the letters is basically the same as before, so I don’t think this way of counting will change your strategy.
Update 1/8/2022 5:29 PM
People have gently tweeted me that the Wordle source code—which you can easily download, and I could have easily downloaded before writing this post—has two lists. One is words that might be answers (2,315), and the other is additional words that can be guessed (10,657). You can run my scripts on either of these lists (or their concatenation) to refine your strategies. If you’re a serious Wordle player, think carefully about whether doing so would spoil your fun.
Thanks to Tim Dierks and Antonio Bueno.
Update 01/9/2022 11:22 AM
Todd Wells devised a way to grade five-letter words according to how well they match a decent-sized corpus of five-letter words. His grade is based on both the number of letters that match and whether they’re in the right position. Both his Python code and his explanation of how it works are very clear, and you should go take a look.
I suppose I should have said this in the original post, but I’ll say it here: These methods of scoring letters and words are really just for your first—and maybe your second—guess (Todd makes this explicit by naming his script wordle_starting_guess
). They are ways of increasing the probability of “hits” early in the game. After that, it’s a matter of vocabulary and logic to bring you home.
-
Please don’t tweet me shell commands that can do this with less typing. I know they exist, but this was the most efficient for me because I could type it out with virtually no thought at all. ↩
-
You may remember I did a similar thing a couple of years ago to create lists of words with 6–9 letters to help me cheat at Countdown. ↩