Improved RSS subscriber count script

When I got my RSS subscriber count email this morning, I knew something was wrong because the count was about twice what it was the day before. I’ve found and fixed the bug that can overcount the subscribers through Google Reader, but there’s still one more bug that needs to be fixed.

The script I run to calculate the subscriber count is a slight variation on Marco Arment’s original. Surprisingly, the overcounting bug has nothing to do with my changes; it’s in Marco’s code.

Here’s the problem code.

bash:
# Google Reader subscribers and other user-agents reporting "subscribers" and using the "feed-id" parameter for uniqueness:
GRSUBS=`fgrep "$LOG_FDATE" "$LOG_FILE" | fgrep " $RSS_URI" | egrep -o '[0-9]+ subscribers; feed-id=[0-9]+' | sort | uniq | cut -d' ' -f 1 | awk '{s+=$1} END {print s}'`

That’s a really long pipeline, so let’s look at each part individually:

The problem is in the uniq command. If your subscriber count changes during the course of the day—not an uncommon occurrence—you’ll get lines that look like this after the sort:

2735 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
2737 subscribers; feed-id=9141626367700991551
2737 subscribers; feed-id=9141626367700991551

The uniq command will not treat these as duplicates because they aren’t. It’ll convert these lines into

2735 subscribers; feed-id=9141626367700991551
2737 subscribers; feed-id=9141626367700991551

You can see the problem. Because uniq keeps two lines associated with the same feed-id, we’re counting most of those subscriptions twice. That’s why my subscriber count this morning was nearly twice what it should have been.

What we need is a way to tell uniq to return just one line for each feed-id. Ideally, we’d like the line from the end of the day, because that’s the most up-to-date count.

Here’s my solution. Instead of a simple sort | uniq, I do this:

sort -t= -k2 -s | tac | uniq -f2

The -t= -k2 options tell sort to reorder the lines on basis of what comes after the equals sign, which is the feed-id. The -s option ensures that the sort is stable, that is, that lines with the same feed-id appear in their original order after the sort. The -r option then reverses the sort, so for each feed-id, the top line will be the one associated with the last inquiry of the day.

The tac command then reverses all the lines, so for each feed-id the top line will be the one associated with the last inquiry of the day. This will be important after our next step.

The -f2 option tells uniq to ignore the first two “fields” of each line, where fields are separated by white space. In other words, it decides on the uniqueness of a line by looking at the feed-id=1234567890 part only. This will turn a section like

2737 subscribers; feed-id=9141626367700991551
2737 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551

into

2737 subscribers; feed-id=9141626367700991551

which is just what we want.

You can see now why I reversed the lines after sorting. When uniq is used without options, the line it chooses to retain doesn’t matter because they’re all the same. But when we use the -f option, it’s important to know which of the “duplicate” lines (which are duplicates over only a portion of their length) is returned. As it happens, it’s the first “duplicate” that’s returned, so by reversing the stable sort in the previous step, uniq -f2 returns the line from the last hit of the day for each feed-id.

The remainder of pipeline is unchanged, so my new version of Marco’s script gets the Google Reader subscriber count like this:

bash:
# Google Reader subscribers and other user-agents reporting "subscribers" and using the "feed-id" parameter for uniqueness:
GRSUBS=`fgrep "$LOG_FDATE" "$LOG_FILE" | fgrep " $RSS_URI" | egrep -o '[0-9]+ subscribers; feed-id=[0-9]+' | sort -t= -k2 -s | tac | uniq -f2 | awk '{s+=$1} END {print s}'`

Update 9/28/12
When I first wrote this, I didn’t understand how the -s and -r options worked in sort. I thought the -r would reverse the lines after the stable sort, but that’s not what it does. To do what I wanted, I needed the line reversal as a separate command. tac (so named because it acts like cat in reverse) filled the bill.

Also, as one of the commenters on Marco’s script pointed out, there’s no need to do the cut when awk can sum over the first field directly.

There’s a similar bug in the pipeline for other aggregators that provide a subscriber count in their access log entries:

bash:
# Other user-agents reporting "subscribers", for which we'll use the entire user-agent string for uniqueness:
OTHERSUBS=`fgrep "$LOG_FDATE" "$LOG_FILE" | fgrep " $RSS_URI" | fgrep -v 'subscribers; feed-id=' | egrep '[0-9]+ subscribers' | egrep -o '"[^"]+"$' | sort | uniq | egrep -o '[0-9]+ subscribers' | awk '{s+=$1} END {print s}'`

As you can see, this pipeline does the same sort | uniq thing and will double count subscribers to an aggregator if that aggregator’s subscriber figure changes during the course of the day. Unfortunately, I’m not sure how to fix this problem. Because the identifier that distinguishes one aggregator from another doesn’t necessarily come after the subscriber count in these log lines, I don’t know how to trick uniq into behaving the way I want.

For example, if I run just the part of the pipeline through uniq, I get these lines from the NewsGator aggregator:

"NewsGatorOnline/2.0 (http://www.newsgator.com; 1 subscribers)"
"NewsGatorOnline/2.0 (http://www.newsgator.com; 3 subscribers)"
"NewsGatorOnline/2.0 (http://www.newsgator.com; 4 subscribers)"

These shouldn’t be added together, but I can’t tell uniq to consider only the first field of each line—it doesn’t have an option for that. I’m pretty sure a Perl one-liner could do it, but my Perl is a little rusty at the moment. If you can whip one up, or if you have a better idea, I’d like to hear about it.

As a practical matter, aggregators that report subscribers but aren’t Google Reader make up such a small part of my total subscriber base that double counting them has little effect. Even if there’s no solution to this problem, it won’t make much difference.

Update 9/28/12
Well, there is a solution, and Marco provided it (mostly) in the comments. The trick lies in using awk arrays and the post-increment (++) operator. Here’s the improved code:

bash:
# Other user-agents reporting "subscribers", for which we'll use the entire user-agent string for uniqueness:
OTHERSUBS=`fgrep "$LOG_FDATE" "$LOG_FILE" | fgrep " $RSS_URI" | fgrep -v 'subscribers; feed-id=' | egrep '[0-9]+ subscribers' | egrep -o '"[^"]+"$' | tac | awk -F\( '!x[$1]++' | egrep -o '[0-9]+ subscribers' | awk '{s+=$1} END {print s}'`

Instead of sort | uniq, this uses

tac | awk -F\( '!x[$1]++'

The tac command reverses the lines, which puts them in reverse chronological order.

The clever part—due to Marco—is the awk one-liner that returns just the first line of each aggregator. Note first that it’s all pattern: if the value of the pattern is true, the line is printed (print being awk’s default command); if it’s false, nothing is printed. So the script acts as a filter, printing only those lines for which the pattern evaluates to true.

Truth is determined through the value of an associative array, x. As awk reads through the lines of the file, items of x are created with keys that come from the first field, $1, and values that are incremented for every line with a matching first field. The first field is everything before the first open parenthesis, which—in my logs, anyway—corresponds to the name of the aggregator.

The trick is that the ++ incrementing operator acts after x[$1] is evaluated. The first time a new value of $1 is encountered, the value of x[$1] is zero. In the context of an awk pattern, this is false. The not operator, !, flips that to true, and the line is printed. For subsequent lines with that same $1, the value of x[$1] will be a positive number—a true value which the ! will flip to false. Thus, the subsequent lines with the same $1 aren’t printed.

You’ll note that there’s no sort in this pipeline. Because we’re using an awk filter instead of uniq, we don’t have to have the lines grouped by aggregator before running the filter.

I won’t pretend I understood the awk one-liner as soon as I saw it. I seldom use awk and have never written a script that used arrays in the pattern. But once I started to catch on, I realized it was very much like some Perl programs I’ve seen that build associative arrays to count word occurrences.

Double-counting Google Reader subscribers, though, is a big deal. If you’re using Marco’s script, you should change that pipeline.

Update 9/28/12
I see that Marco has updated his script since I first posted. My changes to the pipeline differ a little from his, so I set up my own fork of his script.