Improved RSS subscriber count script
September 27, 2012 at 10:12 PM by Dr. Drang
When I got my RSS subscriber count email this morning, I knew something was wrong because the count was about twice what it was the day before. I’ve found and fixed the bug that can overcount the subscribers through Google Reader, but there’s still one more bug that needs to be fixed.
The script I run to calculate the subscriber count is a slight variation on Marco Arment’s original. Surprisingly, the overcounting bug has nothing to do with my changes; it’s in Marco’s code.
Here’s the problem code.
bash:
# Google Reader subscribers and other user-agents reporting "subscribers" and using the "feed-id" parameter for uniqueness:
GRSUBS=`fgrep "$LOG_FDATE" "$LOG_FILE" | fgrep " $RSS_URI" | egrep -o '[0-9]+ subscribers; feed-id=[0-9]+' | sort | uniq | cut -d' ' -f 1 | awk '{s+=$1} END {print s}'`
That’s a really long pipeline, so let’s look at each part individually:
fgrep "$LOG_FDATE" "$LOG_FILE"
This reads the site’s access log file ($LOG_FILE
is defined as the path to the log file earlier in the script) and returns only those lines associated with yesterday (again,$LOG_FDATE
is defined as yesterday’s date earlier in the script).fgrep " $RSS_URI"
returns only those lines accessing the site’s feed URL (yes,$RSS_URI
is defined earlier in the script).egrep -o '[0-9]+ subscribers; feed-id=[0-9]+'
returns only those lines that have both a subscriber count and afeed-id
definition. This is characteristic of hits from Google’s FeedFetcher for Reader. The-o
option tellsegrep
to return only the portion of the line that matches the regular expression, so we’re left with lines that look like this:2735 subscribers; feed-id=9141626367700991551
sort
sorts the lines alphabetically.uniq
eliminates duplicate lines that are adjacent to one another. Thissort | uniq
construct is common in shell scripts. Becauseuniq
only eliminates duplicates if they are adjacent, thesort
is needed to make them adjacent.cut -d' ' -f 1
returns just the subscriber count for each line, which is before the first space character.awk '{s+=$1} END {print s}'
adds up all the counts and returns the sum.
The problem is in the uniq
command. If your subscriber count changes during the course of the day—not an uncommon occurrence—you’ll get lines that look like this after the sort
:
2735 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
2737 subscribers; feed-id=9141626367700991551
2737 subscribers; feed-id=9141626367700991551
The uniq
command will not treat these as duplicates because they aren’t. It’ll convert these lines into
2735 subscribers; feed-id=9141626367700991551
2737 subscribers; feed-id=9141626367700991551
You can see the problem. Because uniq
keeps two lines associated with the same feed-id
, we’re counting most of those subscriptions twice. That’s why my subscriber count this morning was nearly twice what it should have been.
What we need is a way to tell uniq
to return just one line for each feed-id
. Ideally, we’d like the line from the end of the day, because that’s the most up-to-date count.
Here’s my solution. Instead of a simple sort | uniq
, I do this:
sort -t= -k2 -s | tac | uniq -f2
The -t= -k2
options tell sort
to reorder the lines on basis of what comes after the equals sign, which is the feed-id
. The -s
option ensures that the sort is stable, that is, that lines with the same feed-id
appear in their original order after the sort. The -r
option then reverses the sort, so for each feed-id
, the top line will be the one associated with the last inquiry of the day.
The tac
command then reverses all the lines, so for each feed-id
the top line will be the one associated with the last inquiry of the day. This will be important after our next step.
The -f2
option tells uniq
to ignore the first two “fields” of each line, where fields are separated by white space. In other words, it decides on the uniqueness of a line by looking at the feed-id=1234567890
part only. This will turn a section like
2737 subscribers; feed-id=9141626367700991551
2737 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
into
2737 subscribers; feed-id=9141626367700991551
which is just what we want.
You can see now why I reversed the lines after sorting. When uniq
is used without options, the line it chooses to retain doesn’t matter because they’re all the same. But when we use the -f
option, it’s important to know which of the “duplicate” lines (which are duplicates over only a portion of their length) is returned. As it happens, it’s the first “duplicate” that’s returned, so by reversing the stable sort in the previous step, uniq -f2
returns the line from the last hit of the day for each feed-id
.
The remainder of pipeline is unchanged, so my new version of Marco’s script gets the Google Reader subscriber count like this:
bash:
# Google Reader subscribers and other user-agents reporting "subscribers" and using the "feed-id" parameter for uniqueness:
GRSUBS=`fgrep "$LOG_FDATE" "$LOG_FILE" | fgrep " $RSS_URI" | egrep -o '[0-9]+ subscribers; feed-id=[0-9]+' | sort -t= -k2 -s | tac | uniq -f2 | awk '{s+=$1} END {print s}'`
Update 9/28/12
When I first wrote this, I didn’t understand how the -s
and -r
options worked in sort
. I thought the -r
would reverse the lines after the stable sort, but that’s not what it does. To do what I wanted, I needed the line reversal as a separate command. tac
(so named because it acts like cat
in reverse) filled the bill.
Also, as one of the commenters on Marco’s script pointed out, there’s no need to do the cut
when awk
can sum over the first field directly.
There’s a similar bug in the pipeline for other aggregators that provide a subscriber count in their access log entries:
bash:
# Other user-agents reporting "subscribers", for which we'll use the entire user-agent string for uniqueness:
OTHERSUBS=`fgrep "$LOG_FDATE" "$LOG_FILE" | fgrep " $RSS_URI" | fgrep -v 'subscribers; feed-id=' | egrep '[0-9]+ subscribers' | egrep -o '"[^"]+"$' | sort | uniq | egrep -o '[0-9]+ subscribers' | awk '{s+=$1} END {print s}'`
As you can see, this pipeline does the same sort | uniq
thing and will double count subscribers to an aggregator if that aggregator’s subscriber figure changes during the course of the day. Unfortunately, I’m not sure how to fix this problem. Because the identifier that distinguishes one aggregator from another doesn’t necessarily come after the subscriber count in these log lines, I don’t know how to trick uniq
into behaving the way I want.
For example, if I run just the part of the pipeline through uniq
, I get these lines from the NewsGator aggregator:
"NewsGatorOnline/2.0 (http://www.newsgator.com; 1 subscribers)"
"NewsGatorOnline/2.0 (http://www.newsgator.com; 3 subscribers)"
"NewsGatorOnline/2.0 (http://www.newsgator.com; 4 subscribers)"
These shouldn’t be added together, but I can’t tell uniq
to consider only the first field of each line—it doesn’t have an option for that. I’m pretty sure a Perl one-liner could do it, but my Perl is a little rusty at the moment. If you can whip one up, or if you have a better idea, I’d like to hear about it.
As a practical matter, aggregators that report subscribers but aren’t Google Reader make up such a small part of my total subscriber base that double counting them has little effect. Even if there’s no solution to this problem, it won’t make much difference.
Update 9/28/12
Well, there is a solution, and Marco provided it (mostly) in the comments. The trick lies in using awk
arrays and the post-increment (++
) operator. Here’s the improved code:
bash:
# Other user-agents reporting "subscribers", for which we'll use the entire user-agent string for uniqueness:
OTHERSUBS=`fgrep "$LOG_FDATE" "$LOG_FILE" | fgrep " $RSS_URI" | fgrep -v 'subscribers; feed-id=' | egrep '[0-9]+ subscribers' | egrep -o '"[^"]+"$' | tac | awk -F\( '!x[$1]++' | egrep -o '[0-9]+ subscribers' | awk '{s+=$1} END {print s}'`
Instead of sort | uniq
, this uses
tac | awk -F\( '!x[$1]++'
The tac
command reverses the lines, which puts them in reverse chronological order.
The clever part—due to Marco—is the awk
one-liner that returns just the first line of each aggregator. Note first that it’s all pattern: if the value of the pattern is true, the line is printed (print
being awk
’s default command); if it’s false, nothing is printed. So the script acts as a filter, printing only those lines for which the pattern evaluates to true.
Truth is determined through the value of an associative array, x
. As awk
reads through the lines of the file, items of x
are created with keys that come from the first field, $1, and values that are incremented for every line with a matching first field. The first field is everything before the first open parenthesis, which—in my logs, anyway—corresponds to the name of the aggregator.
The trick is that the ++
incrementing operator acts after x[$1]
is evaluated. The first time a new value of $1
is encountered, the value of x[$1]
is zero. In the context of an awk
pattern, this is false. The not operator, !
, flips that to true, and the line is printed. For subsequent lines with that same $1
, the value of x[$1] will be a positive number—a true value which the !
will flip to false. Thus, the subsequent lines with the same $1
aren’t printed.
You’ll note that there’s no sort
in this pipeline. Because we’re using an awk
filter instead of uniq
, we don’t have to have the lines grouped by aggregator before running the filter.
I won’t pretend I understood the awk
one-liner as soon as I saw it. I seldom use awk
and have never written a script that used arrays in the pattern. But once I started to catch on, I realized it was very much like some Perl programs I’ve seen that build associative arrays to count word occurrences.
Double-counting Google Reader subscribers, though, is a big deal. If you’re using Marco’s script, you should change that pipeline.
Update 9/28/12
I see that Marco has updated his script since I first posted. My changes to the pipeline differ a little from his, so I set up my own fork of his script.