More Apache log parsing

At the end of this post about my Apache log parsing script, I wrote

I have top5log sitting on the server with its executable bit set. Whenever I feel the need to check, I log in through ssh and run it. I suppose I should set up a cron job to run it every day and append the output to a file.

Three months later, I finally got around to doing just that, but in the process of doing so I changed the script a bit and also realized that the best way to run it was through a shell script.

The parsing script, which I’ve renamed tophits, now looks like this:

python:
  1:  #!/usr/bin/python32
  2:  
  3:  import re
  4:  import sys
  5:  from datetime import datetime, date, timedelta
  6:  from collections import Counter
  7:  import getopt
  8:  
  9:  # Get the number of pages to return. Default is 5.
 10:  count = 5
 11:  opts, args = getopt.getopt(sys.argv[1:], 'c:')
 12:  for o,a in opts:
 13:    if o == '-c':
 14:      count = int(a)
 15:  
 16:  # Define the day of interest in the Apache common log format.
 17:  # Default is yesterday.
 18:  try:
 19:    daysAgo = int(args[0])
 20:  except:
 21:    daysAgo = 1
 22:  theDay = date.today() - timedelta(daysAgo)
 23:  apacheDay = theDay.strftime('[%d/%b/%Y:')
 24:  
 25:  # Regex for the Apache common log format.
 26:  parts = [
 27:      r'(?P<host>\S+)',                   # host %h
 28:      r'\S+',                             # indent %l (unused)
 29:      r'(?P<user>\S+)',                   # user %u
 30:      r'\[(?P<time>.+)\]',                # time %t
 31:      r'"(?P<request>.*)"',               # request "%r"
 32:      r'(?P<status>[0-9]+)',              # status %>s
 33:      r'(?P<size>\S+)',                   # size %b (careful, can be '-')
 34:      r'"(?P<referrer>.*)"',              # referrer "%{Referer}i"
 35:      r'"(?P<agent>.*)"',                 # user agent "%{User-agent}i"
 36:  ]
 37:  pattern = re.compile(r'\s+'.join(parts)+r'\s*\Z')
 38:  
 39:  # Regex for a feed request.
 40:  feed = re.compile(r'/all-this/(\d\d\d\d/\d\d/[^/]+/)?feed/(atom/)?')
 41:  
 42:  # Change Apache log items into Python types.
 43:  def pythonized(d):
 44:    # Clean up the request.
 45:    d["request"] = d["request"].split()[1]
 46:  
 47:    # Some dashes become None.
 48:    for k in ("user", "referrer", "agent"):
 49:      if d[k] == "-":
 50:        d[k] = None
 51:  
 52:    # The size dash becomes 0.
 53:    if d["size"] == "-":
 54:      d["size"] = 0
 55:    else:
 56:      d["size"] = int(d["size"])
 57:  
 58:    # Convert the timestamp into a datetime object. Accept the server's time zone.
 59:    time, zone = d["time"].split()
 60:    d["time"] = datetime.strptime(time, "%d/%b/%Y:%H:%M:%S")
 61:  
 62:    return d
 63:  
 64:  # Is this hit a page?
 65:  def ispage(hit):
 66:    # Failures and redirects.
 67:    hit["status"] = int(hit["status"])
 68:    if hit["status"] < 200 or hit["status"] >= 300:
 69:      return False
 70:  
 71:    # Feed requests.
 72:    if feed.search(hit["request"]):
 73:      return False
 74:  
 75:    # Requests that aren't GET.
 76:    if hit["request"][0:3] != "GET":
 77:      return False
 78:  
 79:    # Images, sounds, etc.
 80:    if hit["request"].split()[1][-1] != '/':
 81:      return False
 82:  
 83:    # Must be a page.
 84:    return True
 85:  
 86:  # Regexes for internal and Google search referrers.
 87:  internal = re.compile(r'https?://(www\.)?leancrew\.com.*')
 88:  google = re.compile(r'https?://(www\.)?google\..*')
 89:  
 90:  # Is the referrer interesting? Internal and Google referrers are not.
 91:  # def goodref(hit):
 92:  #   if hit['referrer']:
 93:  #     return not (google.search(hit['referrer']) or
 94:  #                 internal.search(hit['referrer']))
 95:  #   else:
 96:  #     return False
 97:  
 98:  # Initialize.
 99:  pages = []
100:  
101:  # Parse all the lines associated with the day of interest.
102:  for line in sys.stdin:
103:    if apacheDay in line:
104:      m = pattern.match(line)
105:      hit = m.groupdict()
106:      if ispage(hit):
107:        pages.append(pythonized(hit))
108:      else:
109:        continue
110:  
111:  # Show the top five pages and the total.
112:  print('%s pages' % theDay.strftime("%b %d, %Y"))
113:  pageViews = Counter(x['request'] for x in pages)
114:  top5 = pageViews.most_common(count)
115:  for p in top5:
116:    print("  %5d  %s" % p[::-1])
117:  print("  %5d  total" % len(pages))
118:  print('')
119:  
120:  
121:  # Show the top five referrers.
122:  # print('''
123:  # %s referrers''' % theDay.strftime("%b %d, %Y"))
124:  # referrers = Counter(x['referrer'] for x in pages if goodref(x) )
125:  # top5 = referrers.most_common(5)
126:  # for r in top5:
127:  #   print("  %5d  %s" % r[::-1])

The only significant differences between this and the original version are

These differences aren’t enough to justify a full re-explanation of the script. The important stuff is explained in the original post.

Under normal circumstances, to append the top ten pages from the previous day to a file, you’d run the script this way:

tophits -c 10 < path/to/apache.log >> toppages.txt

(When the second argument to tophits is omitted, it defaults to 1, which represents “yesterday.”)

This works fine most of the time and is how I first had my cron job set up. However, my web host rotates its clients’ Apache log files every month, compressing the current log file, saving the compressed version to another directory, and erasing the original to start collecting the next month’s worth of data. Because this process happens in the middle of a day, part of that day’s entries go off into the compressed file. Running tophits as above the day after a rotation will count only the pageviews after the rotation—the pageviews before the rotation are in the compressed file.

To solve this problem, we need to unzip the compressed file and run tophits on the concatenation of it with the running log file. For my system, that’ll look like this:

gunzip -c old-logs/leancrew.com-Oct-2013.gz | cat - logs/leancrew.com | tophits -c 10 >> toppages.txt

Because my web host uses gzip to compress the log files, gunzip is the command to uncompress them. The -c option to gunzip tells it to send the results to standard output instead of a file. My host uses domain-month-year.gz as its file naming format for compressed logs the example is uncompressing last month’s log file. This is then piped to cat, which concatenates the current log file, leancrew.com, to standard input, represented by -. The combined output is then piped to tophits.

Update 11/2/13
Reader Cameron Higby-Naquin suggested using the zcat command to combine the gunzip and cat parts into a single step. At first, I thought this wasn’t possible, because I thought zcat was simply a shorter way of doing gunzip -c, and so I’d still need the cat later on to tack on the uncompressed file. But, as Cameron pointed out, giving zcat the -f option forces it to pass through uncompressed files without change. So the pipeline above can be simplified to

zcat -f old-logs/leancrew.com-Oct-2013.gz logs/leancrew.com | tophits -c 10 >> toppages.txt

and it works perfectly. So, thanks to Cameron, I’ve edited the script below to take advantage of this simplification.

How do we know which way to invoke tophits? The log files get rotated at the end of each month, on either the last day or second-to-last day of the month. We can look in the old-logs directory to see if there’s a compressed log associated with yesterday’s month. If there is, run tophits the second way; if there isn’t, run it the first way.

Here’s the shell script that does it:

bash:
1:  #!/bin/bash
2:  
3:  gfile=`date  --date=yesterday "+old-logs/leancrew.com-%b-%Y.gz"`
4:  if [ -f $gfile ]
5:    then zcat -f $gfile logs/leancrew.com | ./tophits -c 10 >> toppages.txt
6:    else ./tophits -c 10 < logs/leancrew.com >> toppages.txt
7:  fi
8:  

Line 3 uses date to construct the name of the compressed log file that may or may not exist. The --date=yesterday option is a feature of the GNU version of date, which is what my web host runs. It’s equivalent to -v-1d for the version of date you’ll find on your Mac.

Line 4 tests for the existence of the compressed log file and runs the appropriate command depending on the result of the test.

There’s a bit of inefficiency in this script. If the log file is rotated on the second-to-last day of the month, the longer command will be run twice: once on the last day of the month and then again on the first day of the next month. For this case, the gunzip done on the first of the month is unnecessary because all the data for the previous day will be in the running log file. Squeezing out that inefficiency isn’t worth the time I’d put into it, so it’ll have to stay.

I have the shell script saved as log-yesterday and a cron job set to run it every morning. Now I have a toppages.txt file with entries for each day that look like this:

Oct 25, 2013 pages
   8123  /all-this/2013/10/mixed-applescript-signals/
   2557  /all-this/2013/10/tweetbot-3-and-sign-errors/
    919  /all-this/
    492  /all-this/2013/10/quick-switch-to-big-finder-icons/
    292  /all-this/2011/02/iphone-notes-app-comparison/
    156  /all-this/2013/10/iphone-miscellany/
    133  /all-this/2013/10/whose-in-applescript/
    126  /all-this/2013/10/the-small-improvement-in-iphone-battery-capacity/
    121  /all-this/2013/10/more-on-location-based-reminders/
    111  /all-this/2013/10/reminders-for-reminders/
  21358  total

What will I do with this? I’m sure I’ll think of something. In the meantime, it was fun figuring out how to handle the rotated log problem.