How many words have I written?
August 28, 2017 at 9:07 PM by Dr. Drang
I’m about an hour into the current episode of The Talk Show (“Prison Oreos” with Jason Snell), and as I was listening to John Gruber talk about the word count statistics for Daring Fireball—two million words and growing—I thought about how I could do the same thing here.
My task is simplified by the way I have the blog structured. All the Markdown source files (and all the source files are in Markdown format) have a .md
extension and are in a set of nested folders—a folder for each year and a folder for each month within each year.
Let’s start by going to the top level source folder and figuring out how many posts I’ve written.
find . -iname *.md | wc -l
The find
command prints out all the .md
files in the underlying folders, one per line, and the wc
command counts the lines. That yields 2,395 blog posts.
A simple word count can be done by altering the pipeline a bit.
find . -iname *.md | xargs wc -w
The xargs
command gets the file names from find
and feeds them as arguments to wc
. This prints out the word counts for each file in turn and then gives the total.
562 ./2017/08/a-riveting-show.md
345 ./2017/08/apple-sales.md
569 ./2017/08/bulk.md
203 ./2017/08/familiar-tools.md
891 ./2017/08/my-jxa-problem.md
537 ./2017/08/return-to-textexpander.md
610 ./2017/08/subscriptions.md
1422476 total
It must be those extra 600,000 words Gruber’s written that makes his site more popular than mine.
But I really should refine my word count down. In 2008 and 2009, I had the bad idea to automatically post a summary of my tweets from the previous day. There were, I’m sorry to say, 333 such posts, and although all the words in them were mine, including them is cheating. They all had titles like “Tweets for January 15, 2009, so we can filter them out of what find
returns by looking for the string tweets-for-
in the file name.
find -E . -not -regex '.+tweets-for-.+' -iname *.md | xargs wc -w
The -regex '.+tweets-for-.+'
part finds files with the telltale string, and the -not
excludes them from the list. The -E
tells find
to use “extended” regular expressions, which is the kind I’m familiar with.
Excluding the tweet posts brings us down to 1,391,790 words. What else can we filter out?
Every source file includes header lines at the top with the title, date, keywords, and other metadata that shouldn’t be counted as writing. These lines are separated from the body of the post by at least one blank line, and we can use sed
to get rid of them:
find -E . -not -regex '.+tweets-for-.+' -iname *.md | xargs -I {} sed '1,/^$/ d' {} | wc -w
Here, sed '1,/^$/ d'
deletes the lines from the top of the file through the first blank line. To make sure sed
acts on each individual file instead of the concatenation of all of them, we add the -I {}
part to xargs
and {}
to the end of the sed
command. This feeds the file names to sed
and generates a long, long string of text that’s piped to wc -w
to get the word count.
Without headers, we’re down to 1,347,380 words. I think this is a legitimate word count, but maybe we should filter out the URLs of links. Because I use reference links, they’re pretty easy to find and delete with another sed
invocation.
find -E . -not -regex '.+tweets-for-.+' -iname *.md | xargs -I {} sed '1,/^$/ d' {} | sed -E '/^\[.+\]: / d' | wc -w
The sed -E
/^\[.+\]: / d’ command deletes all lines that start with bracketed text followed by a colon and space. This is the format for reference links. The -E
flag tells sed
to use extended regex syntax.
Now we’re down to 1,283,149 words, and I suppose if we’re going to exclude link URLs we should exclude source code of scripts, too. No matter how long they took to write, the scripts (usually) weren’t written for the purposes of blogging, they were written to solve a problem and then pasted into the blog.
Source code blocks start with four or more spaces at the beginning of each line, so the filter for them is easy to add:
find -E . -not -regex '.+tweets-for-.+' -iname *.md | xargs -I {} sed '1,/^$/ d' {} | sed -E '/^\[.+\]: / d' | sed '/^ / d' | wc -w
Now we’re down to 1,147,413 words. Frankly, I thought deleting the source code would bring it down further than that.
This was a fun exercise and didn’t take too long. I’d never used -not
or -regex
with the find
command before, so I’m a little smarter than I was at the beginning of the day. Similarly for the -I {}
option to xargs
. The sed
commands were nothing new, but I probably should get in the habit of adding the -E
option regardless of whether it’s needed. The syntax of “basic” regular expressions is something I don’t know and don’t ever want to learn.