Last post on RSS subscriber counting

I’ve made a few final (I think) changes to the RSS subscriber count script and figured I might as well post them here in case anyone is interested.

The script is based on Marco Arment’s original. The great bulk of the script is his; I made a couple of improvements to the counting pipelines, added the ability to query multiple feeds, and changed the output method from emailing the counts to appending them to a history file.

Here’s the script:

 1  #!/bin/bash
 2  
 3  # A modification of Marco Arment's script at
 4  #
 5  #   https://gist.github.com/3783146
 6  #
 7  # It's intended to be run once a day as a cron job. The main
 8  # differences between it and Marco's script are:
 9  #
10  #  1. It checks two feeds instead of just one.
11  #  2. It combines the non-Google Reader counts into a single number.
12  #  3. It doesn't write anything to stdout or send email.
13  #  4. It adds a line to a history file with the date and counts.
14  # 
15  
16  # Required variables. Edit these for your server.
17  FEED_LIST="/all-this/feed/ /all-this/feed/atom/"
18  LOG_FILE="/path/to/apache/access/log/file"
19  HISTORY_FILE="subscribers.txt"
20  
21  # Date expression for yesterday
22  DATE="-1 day"
23  
24  # Date format in Apache log
25  LOG_FDATE=`date -d "$DATE" '+%d/%b/%Y'`
26  
27  # Date format for display in emails
28  HUMAN_FDATE=`date -d "$DATE" '+%F'`
29  
30  # Date format for history file.
31  HISTORY_FDATE=`date -d "$DATE" '+%Y-%m-%d'`
32  
33  # Start the line with yesterday's date. 
34  DAYLINE=$(printf "%s:  " $HISTORY_FDATE)
35  
36  # Loop through the feeds, collecting subscriber counts and adding
37  # them to the line.
38  for RSS_URI in $FEED_LIST; do
39    
40    # Unique IPs requesting RSS, except those reporting "subscribers":
41    IPSUBS=`fgrep "$LOG_FDATE" "$LOG_FILE" | fgrep " $RSS_URI " | egrep -v '[0-9]+ subscribers' | cut -d' ' -f 1 | sort | uniq | wc -l`
42    
43    # Google Reader subscribers and other user-agents reporting "subscribers" 
44    # and using the "feed-id" parameter for uniqueness:
45    GRSUBS=`fgrep "$LOG_FDATE" "$LOG_FILE" | fgrep " $RSS_URI " | egrep -o '[0-9]+ subscribers; feed-id=[0-9]+' | sort -t= -k2 -s | tac | uniq -f2 | awk '{s+=$1} END {print s}'`
46    
47    # Other user-agents reporting "subscribers", for which we'll use the 
48    # entire user-agent string for uniqueness:
49    OTHERSUBS=`fgrep "$LOG_FDATE" "$LOG_FILE" | fgrep " $RSS_URI " | fgrep -v 'subscribers; feed-id=' | egrep '[0-9]+ subscribers' | egrep -o '"[^"]+"$' | tac | awk -F\( '!x[$1]++' | egrep -o '[0-9]+ subscribers' | awk '{s+=$1} END {print s}'`
50    
51    # Add the non-Google Reader subscribers.
52    NONGRSUBS=$(($IPSUBS + $OTHERSUBS))
53    
54    DAYLINE=$DAYLINE$(printf "%5d  " $GRSUBS; printf "%5d  " $NONGRSUBS)
55  done
56  
57  # Append yesterday's info to the history file.
58  echo "$DAYLINE" >> $HISTORY_FILE

The line that’s appended to the history file looks like this:

2012-10-13:   2783    521     27     21

After the date, the subscriber counts are in the order (Google Reader count, non-Google Reader count) for each feed in the list. The list of feed URLs is simply a string with the two feeds separated by a space. The “http://leancrew.com” prefix to the URLs isn’t included because it isn’t present in the Apache log file.

I decided to append the counts to a history file for two reasons:

  1. I was tired of getting a daily email with the counts.
  2. When I did want to look at the counts, I didn’t want just a snapshot of that day’s subscriber counts. I wanted to be able to see how they were changing.

The history file is kept on the server. If I want it on my local machine, a quick scp will copy it here.