Parsing my Apache logs

I stopped using Google Analytics a couple of months ago. It’s nice to have all that information at my fingertips, but I’m not sure what Google’s tracking code does, and I decided it was presumptuous of me to expose my readers to it without their permission. I’m not especially worried about Google tracking me, but that doesn’t mean you share my carefree attitude. Also, Google Analytics is overkill for an Internet backwater like ANIAT. I don’t sell ads, I don’t have “conversions,” and I don’t need to know how many of my visitors are from Moldova (40 this year before I turned GA off).

But I am still curious about which pages are being read and where the readers are coming from, so a wrote a little script to parse my site’s Apache log file and return the top five pages and referrers for a given day. Along the way, I learned more about Python’s collections library and the groupdict method for regular expression matches. My script owes a lot to this article by Jochen Voss, from which I shamelessly stole the logfile parsing pattern.

The script is called top5log and it’s run this way:

top5log 25 < apache.log

where the argument 25 represents how many days ago the day of interest is (the default is 1, i.e., yesterday), and apache.log is the name of the log file, which the script reads via stdin. The output looks like this:

Jul 03, 2013 pages
    907  /all-this/2013/06/feedle-dee-dee/
    813  /all-this/
    749  /all-this/2013/07/last-i-hope-thoughts-on-rss/
     74  /all-this/2007/03/improved-currency-conversion-card/
     74  /all-this/2009/08/camry-smart-key-battery-replacement/
   7134  total

Jul 03, 2013 referrers

The script is smart enough to ignore “pages” that are actually requests for things like the RSS feed, CSS files, JavaScript files, and image and sound files. Similarly, it filters out uninteresting referrers: links from Google and ANIAT itself.

Here’s the script:

  1:  #!/usr/bin/python
  3:  import re
  4:  import sys
  5:  from datetime import datetime, date, timedelta
  6:  from collections import Counter
  8:  # Define the day of interest in the Apache common log format.
  9:  try:
 10:    daysAgo = int(sys.argv[1])
 11:  except:
 12:    daysAgo = 1
 13:  theDay = - timedelta(daysAgo)
 14:  apacheDay = theDay.strftime('[%d/%b/%Y:')
 16:  # Regex for the Apache common log format.
 17:  parts = [
 18:      r'(?P<host>\S+)',                   # host %h
 19:      r'\S+',                             # indent %l (unused)
 20:      r'(?P<user>\S+)',                   # user %u
 21:      r'\[(?P<time>.+)\]',                # time %t
 22:      r'"(?P<request>.*)"',               # request "%r"
 23:      r'(?P<status>[0-9]+)',              # status %>s
 24:      r'(?P<size>\S+)',                   # size %b (careful, can be '-')
 25:      r'"(?P<referrer>.*)"',              # referrer "%{Referer}i"
 26:      r'"(?P<agent>.*)"',                 # user agent "%{User-agent}i"
 27:  ]
 28:  pattern = re.compile(r'\s+'.join(parts)+r'\s*\Z')
 30:  # Regex for a feed request.
 31:  feed = re.compile(r'/all-this/(\d\d\d\d/\d\d/[^/]+/)?feed/(atom/)?')
 33:  # Change Apache log items into Python types.
 34:  def pythonized(d):
 35:    # Clean up the request.
 36:    d["request"] = d["request"].split()[1]
 38:    # Some dashes become None.
 39:    for k in ("user", "referrer", "agent"):
 40:      if d[k] == "-":
 41:        d[k] = None
 43:    # The size dash becomes 0.
 44:    if d["size"] == "-":
 45:      d["size"] = 0
 46:    else:
 47:      d["size"] = int(d["size"])
 49:    # Convert the timestamp into a datetime object. Accept the server's time zone.
 50:    time, zone = d["time"].split()
 51:    d["time"] = datetime.strptime(time, "%d/%b/%Y:%H:%M:%S")
 53:    return d
 55:  # Is this hit a page?
 56:  def ispage(hit):
 57:    # Failures and redirects.
 58:    hit["status"] = int(hit["status"])
 59:    if hit["status"] < 200 or hit["status"] >= 300:
 60:      return False
 62:    # Feed requests.
 63:    if["request"]):
 64:      return False
 66:    # Requests that aren't GET.
 67:    if hit["request"][0:3] != "GET":
 68:      return False
 70:    # Images, sounds, etc.
 71:    if hit["request"].split()[1][-1] != '/':
 72:      return False
 74:    # Must be a page.
 75:    return True  
 77:  # Regexes for internal and Google search referrers.
 78:  internal = re.compile(r'https?://(www\.)?leancrew\.com.*')
 79:  google = re.compile(r'https?://(www\.)?google\..*')
 81:  # Is the referrer interesting? Internal and Google referrers are not.
 82:  def goodref(hit):
 83:    if hit['referrer']:
 84:      return not (['referrer']) or     
 85:        ['referrer']))
 86:    else:
 87:      return False
 89:  # Initialize. 
 90:  pages = []
 92:  # Parse all the lines associated with the day of interest.
 93:  for line in sys.stdin:
 94:    if apacheDay in line:
 95:      m = pattern.match(line)
 96:      hit = m.groupdict()
 97:      if ispage(hit):
 98:        pages.append(pythonized(hit))
 99:      else:
100:        continue
102:  # Show the top five pages and the total.
103:  print '%s pages' % theDay.strftime("%b %d, %Y")
104:  pageViews = Counter(x['request'] for x in pages)
105:  top5 = pageViews.most_common(5)
106:  for p in top5:
107:    print "  %5d  %s" % p[::-1]
108:  print "  %5d  total" % len(pages)
110:  # Show the top five referrers.
111:  print '''
112:  %s referrers''' % theDay.strftime("%b %d, %Y")
113:  referrers = Counter(x['referrer'] for x in pages if goodref(x) )
114:  top5 = referrers.most_common(5)
115:  for r in top5:
116:    print "  %5d  %s" % r[::-1]

Lines 9-14 read the first argument to the script and figure out which day I’m interested in and the Apache log format for that day. If there is no argument, or if I give it a non-numeric argument like “fred,” it uses yesterday. I don’t mess around with time zones; the server uses Eastern, I’m in Central, and the one hour difference is too small to bother with.

Lines 17-28 define and compile the regex for a line of the log file. Apart from changing one + into a *, I took these lines wholesale from Jochen Voss’s post. I’ve never used the (?P<name>...) syntax before. It makes the substring captured by the parentheses accessible by name instead of just by position index. When used with the groupdict method, which we’ll see in Line 96, we can create a dictionary of the captured substrings keyed to the names. That’s pretty cool and is the basis for the rest of the script.

The pythonized function defined in Lines 34-53 takes a dictionary of strings captured from a line of the Apache log file and returns a dictionary of Python objects. Dashes in the user, referrer, and agent fields are turned into None, a dash in the size field is turned into zero, and the date string is parsed into a datetime object.

The ispage function defined in Lines 56-75 takes a dictionary associated with a line from the Apache log file and returns a Boolean: True if it looks like a hit on a page and False if it doesn’t. Redirects and failures are filtered out in Lines 58-60 by looking at the status field. Requests for my RSS feed are filtered out in Lines 63-64 by using the regex defined on Line 31. Page requests always use the GET method, so Lines 67-68 filter out all non-GETs; these are either my administrative interactions with the blog through the WordPress interface or hacking attempts by others. All the blog page URLs end with a slash, so requests that don’t are excluded by Lines 71-72. This filters out hits on image, sound, CSS, and JavaScript files.

The goodref function defined in Lines 82-87 also takes a dictionary associated with a line from the Apache log file and returns a Boolean. It uses the regexes defined in Lines 78-79 to hunt for referrals from Google and the site itself. These referrers, and referrers of None will return False; all others will return True.

With everything defined up front, we move into the main body of the script. Lines 93-100 loop through all the lines of the log file. Line 94 checks if the line represents the day of interest; because there’s no else clause, other days are ignored.

Line 95 does the pattern match on a line of the log file, and Line 961 turns that match into a dictionary called hit with groupdict. If hit represents a page, it gets added to the pages list in Line 98.

Line 104 creates a Counter collection called pageViews from the pages list. This is basically a dictionary of page counts keyed on the URLs, but it reduces all the usual looping and incrementing into a single creation command. Line 105 then uses the Counter’s most_common method to pull out the five items with the highest counts. The results are printed in Lines 106-108.

Lines 113-114 do basically the same thing as Lines 104-105, but with referrer URLs instead of page URLs.

I have top5log sitting on the server with its executable bit set. Whenever I feel the need to check, I log in through ssh and run it. I suppose I should set up a cron job to run it every day and append the output to a file.

  1. Remember Line 96? This is a song about Line 96.