RSS subscriber counting in a sane language

Recently, both Gabe Weatherhead and Marcelo Somers have tried to use this shell script to get subscriber counts for their RSS feeds. And it failed for both of them. The reason it failed (I think) is because of how it divides subscriptions into different types, counts each type, and then adds the types together. On a new site, there may be no subscribers of one or more of these types; in that case, the count for that type comes up as an empty string (I think) rather than a zero. Adding an empty string to a number generates (I think) an error.

You may suspect from the foregoing that I don’t know my ass from a hole in the ground when it comes to shell scripting. And you’d be right. As I mentioned in this post, I typically avoid shell scripts because I find their branching, looping, and arithmetic commands thoroughly opaque. The only reason the RSS subscriber counting was done in a shell script is that Marco Arment had already written it—instead of writing my own from scratch, I just made a few tweaks to his. But the bugs uncovered by Gabe and Marcelo exposed the shallowness of my knowledge. If I ever needed to change the script for my own purposes, I’d be lucky to get it running again.

The need to change the script is not some hazy, far-off hypothetical. Google Reader is shutting down in two months, and a one of the three subscriber types identified and counted by the script is based on the lines Google Reader leaves in the Apache access logfile. Whatever substitutes people turn to when Reader shuts down, I’ll need to modify the script to accomodate them.

So a couple of nights ago I rewrote the script from scratch in Python. It was surprisingly easy because even though Marco’s pipelines were incredibly long and subtle in their logic, his comments were clear. All I had to do was translate the comments into Python. (I did make a few small changes to the logic, but nothing worth mentioning. The counts are usually the same.)

Here’s the script:

Update 5/2/13
As originally posted, the script had a couple of extraneous print statements left over from a debugging phase. Sorry about that.

python:
 1:  #!/usr/bin/python
 2:  
 3:  from datetime import date
 4:  from datetime import timedelta
 5:  import re
 6:  
 7:  # Site-specific variables.
 8:  feeds = ["/all-this/feed/", "/all-this/feed/atom/"]
 9:  log = "/path/to/access/log/file"
10:  history = "subscribers.txt"
11:  
12:  # Date strings for yesterday.
13:  yesterday = date.today() - timedelta(days=1)
14:  logdate = yesterday.strftime("%d/%b/%Y")
15:  outdate = yesterday.strftime("%Y-%m-%d")
16:  
17:  # Read the log file into a list. Filter out everything except yesterday.
18:  with open(log) as f:
19:    lines = [ x for x in f.read().splitlines() if logdate in x ]
20:  
21:  
22:  # Compile regexes for finding lines associated with the three types
23:  # of hits:
24:  #
25:  # 1. Google-like hits in which the number of subscribers and a unique
26:  #    feed ID are provided. Use the feed ID as the dictionary key to avoid
27:  #    double counting.
28:  #
29:  # 2. Other hits in which the number of subscribers is reported, but there's
30:  #    no feed ID. Use the User-Agent string up to the subscriber count as
31:  #    the dictionary key to avoid double counting.
32:  #
33:  # 3. Hits with no subscriber counts. Use the IP number as the dictionary key
34:  #    to avoid double counting.
35:  
36:  googlish = re.compile(r'(\d+) subscribers;\s*feed-id=(\d+)', flags=re.I)
37:  noID = re.compile(r'"([^"]+);\s*(\d+) subscribers(?!;\s*feed-id=)', flags=re.I)
38:  iponly = re.compile(r'^([0-9.]+).+(?!subscribers)')
39:  
40:  
41:  # Gather subscriber counts for each feed.
42:  counts = {}
43:  for feed in feeds:
44:    getstr = "GET {0} ".format(feed)
45:    feedlines = [x for x in lines if getstr in x ]
46:    feedlines.reverse()         # now in reverse chronological order
47:  
48:    # Google-like.
49:    grsubs = {}
50:    for line in feedlines:
51:      gr = googlish.search(line)
52:      if gr:
53:        if gr.group(2) not in grsubs:         # skip if we've seen it already
54:          grsubs[gr.group(2)] = gr.group(1)
55:  
56:    # With subscribers but no feed ID.
57:    othersubs = {}
58:    for line in feedlines:
59:      other = noID.search(line)
60:      if other:
61:        if other.group(1) not in othersubs:   # skip if we've seen it already
62:          othersubs[other.group(1)] = other.group(2)
63:  
64:    # No subscriber count.
65:    ipsubs = {}
66:    for line in feedlines:
67:      ip = iponly.search(line)
68:      if ip:
69:        if ip.group(1) not in ipsubs:         # skip if we've seen it already
70:          ipsubs[ip.group(1)] = 1
71:  
72:    # Total the subscriber counts for Google and non-Google.
73:    grcounts = sum(map(int, grsubs.values()))
74:    othercounts = sum(map(int, othersubs.values()))
75:    ipcounts = sum(map(int, ipsubs.values()))
76:    counts[feed] = (grcounts, othercounts + ipcounts)
77:  
78:  
79:  # Print the values for the day into the history file.
80:  daystr = [outdate, ":"]
81:  for feed in feeds:
82:    daystr.append("  {0:5d}  {1:5d}".format(*counts[feed]))
83:  daystr.append("\n")
84:  
85:  with open(history, "a") as f:
86:    f.write("".join(daystr))

It’s certainly longer than Marco’s shell script, but Python tends to be more explicit. Also, this script doesn’t have any lines over 200 characters long.

Like my version of the shell script, this script appends the counts to a file that keeps track of the subscriber history, one line per day. The formatting you see in Lines 82-85 was set up to match the existing format of my history file.

For this script to work, it must be put on your web server and run periodically through some utility like cron. Because it uses with and format, it requires Python 2.6 or later. My web host uses Python 2.4 (!) by default but fortunately has 2.6 available as a alternative.

You can expect to see this script modified as we enter a brave new RSS world this summer.