Last post on RSS subscriber counting

I’ve made a few final (I think) changes to the RSS subscriber count script and figured I might as well post them here in case anyone is interested.

The script is based on Marco Arment’s original. The great bulk of the script is his; I made a couple of improvements to the counting pipelines, added the ability to query multiple feeds, and changed the output method from emailing the counts to appending them to a history file.

Here’s the script:

bash:
 1:  #!/bin/bash
 2:  
 3:  # A modification of Marco Arment's script at
 4:  #
 5:  #   https://gist.github.com/3783146
 6:  #
 7:  # It's intended to be run once a day as a cron job. The main
 8:  # differences between it and Marco's script are:
 9:  #
10:  #  1. It checks two feeds instead of just one.
11:  #  2. It combines the non-Google Reader counts into a single number.
12:  #  3. It doesn't write anything to stdout or send email.
13:  #  4. It adds a line to a history file with the date and counts.
14:  # 
15:  
16:  # Required variables. Edit these for your server.
17:  FEED_LIST="/all-this/feed/ /all-this/feed/atom/"
18:  LOG_FILE="/path/to/apache/access/log/file"
19:  HISTORY_FILE="subscribers.txt"
20:  
21:  # Date expression for yesterday
22:  DATE="-1 day"
23:  
24:  # Date format in Apache log
25:  LOG_FDATE=`date -d "$DATE" '+%d/%b/%Y'`
26:  
27:  # Date format for display in emails
28:  HUMAN_FDATE=`date -d "$DATE" '+%F'`
29:  
30:  # Date format for history file.
31:  HISTORY_FDATE=`date -d "$DATE" '+%Y-%m-%d'`
32:  
33:  # Start the line with yesterday's date. 
34:  DAYLINE=$(printf "%s:  " $HISTORY_FDATE)
35:  
36:  # Loop through the feeds, collecting subscriber counts and adding
37:  # them to the line.
38:  for RSS_URI in $FEED_LIST; do
39:    
40:    # Unique IPs requesting RSS, except those reporting "subscribers":
41:    IPSUBS=`fgrep "$LOG_FDATE" "$LOG_FILE" | fgrep " $RSS_URI " | egrep -v '[0-9]+ subscribers' | cut -d' ' -f 1 | sort | uniq | wc -l`
42:    
43:    # Google Reader subscribers and other user-agents reporting "subscribers" 
44:    # and using the "feed-id" parameter for uniqueness:
45:    GRSUBS=`fgrep "$LOG_FDATE" "$LOG_FILE" | fgrep " $RSS_URI " | egrep -o '[0-9]+ subscribers; feed-id=[0-9]+' | sort -t= -k2 -s | tac | uniq -f2 | awk '{s+=$1} END {print s}'`
46:    
47:    # Other user-agents reporting "subscribers", for which we'll use the 
48:    # entire user-agent string for uniqueness:
49:    OTHERSUBS=`fgrep "$LOG_FDATE" "$LOG_FILE" | fgrep " $RSS_URI " | fgrep -v 'subscribers; feed-id=' | egrep '[0-9]+ subscribers' | egrep -o '"[^"]+"$' | tac | awk -F\( '!x[$1]++' | egrep -o '[0-9]+ subscribers' | awk '{s+=$1} END {print s}'`
50:    
51:    # Add the non-Google Reader subscribers.
52:    NONGRSUBS=$(($IPSUBS + $OTHERSUBS))
53:    
54:    DAYLINE=$DAYLINE$(printf "%5d  " $GRSUBS; printf "%5d  " $NONGRSUBS)
55:  done
56:  
57:  # Append yesterday's info to the history file.
58:  echo "$DAYLINE" >> $HISTORY_FILE

The line that’s appended to the history file looks like this:

2012-10-13:   2783    521     27     21

After the date, the subscriber counts are in the order (Google Reader count, non-Google Reader count) for each feed in the list. The list of feed URLs is simply a string with the two feeds separated by a space. The “http://leancrew.com” prefix to the URLs isn’t included because it isn’t present in the Apache log file.

I decided to append the counts to a history file for two reasons:

  1. I was tired of getting a daily email with the counts.
  2. When I did want to look at the counts, I didn’t want just a snapshot of that day’s subscriber counts. I wanted to be able to see how they were changing.

The history file is kept on the server. If I want it on my local machine, a quick scp will copy it here.