Parsing my Apache logs
July 28, 2013 at 10:29 PM by Dr. Drang
I stopped using Google Analytics a couple of months ago. It’s nice to have all that information at my fingertips, but I’m not sure what Google’s tracking code does, and I decided it was presumptuous of me to expose my readers to it without their permission. I’m not especially worried about Google tracking me, but that doesn’t mean you share my carefree attitude. Also, Google Analytics is overkill for an Internet backwater like ANIAT. I don’t sell ads, I don’t have “conversions,” and I don’t need to know how many of my visitors are from Moldova (40 this year before I turned GA off).
But I am still curious about which pages are being read and where the readers are coming from, so a wrote a little script to parse my site’s Apache log file and return the top five pages and referrers for a given day. Along the way, I learned more about Python’s collections
library and the groupdict
method for regular expression matches. My script owes a lot to this article by Jochen Voss, from which I shamelessly stole the logfile parsing pattern.
The script is called top5log
and it’s run this way:
top5log 25 < apache.log
where the argument 25 represents how many days ago the day of interest is (the default is 1, i.e., yesterday), and apache.log
is the name of the log file, which the script reads via stdin
. The output looks like this:
Jul 03, 2013 pages
907 /all-this/2013/06/feedle-dee-dee/
813 /all-this/
749 /all-this/2013/07/last-i-hope-thoughts-on-rss/
74 /all-this/2007/03/improved-currency-conversion-card/
74 /all-this/2009/08/camry-smart-key-battery-replacement/
7134 total
Jul 03, 2013 referrers
111 http://t.co/eWcQISolqP
45 http://twitterrific.com/referrer#iPhone
24 http://cloud.feedly.com/
15 http://www.marco.org/
11 http://twitterrific.com/referrer#iPad
The script is smart enough to ignore “pages” that are actually requests for things like the RSS feed, CSS files, JavaScript files, and image and sound files. Similarly, it filters out uninteresting referrers: links from Google and ANIAT itself.
Here’s the script:
python:
1: #!/usr/bin/python
2:
3: import re
4: import sys
5: from datetime import datetime, date, timedelta
6: from collections import Counter
7:
8: # Define the day of interest in the Apache common log format.
9: try:
10: daysAgo = int(sys.argv[1])
11: except:
12: daysAgo = 1
13: theDay = date.today() - timedelta(daysAgo)
14: apacheDay = theDay.strftime('[%d/%b/%Y:')
15:
16: # Regex for the Apache common log format.
17: parts = [
18: r'(?P<host>\S+)', # host %h
19: r'\S+', # indent %l (unused)
20: r'(?P<user>\S+)', # user %u
21: r'\[(?P<time>.+)\]', # time %t
22: r'"(?P<request>.*)"', # request "%r"
23: r'(?P<status>[0-9]+)', # status %>s
24: r'(?P<size>\S+)', # size %b (careful, can be '-')
25: r'"(?P<referrer>.*)"', # referrer "%{Referer}i"
26: r'"(?P<agent>.*)"', # user agent "%{User-agent}i"
27: ]
28: pattern = re.compile(r'\s+'.join(parts)+r'\s*\Z')
29:
30: # Regex for a feed request.
31: feed = re.compile(r'/all-this/(\d\d\d\d/\d\d/[^/]+/)?feed/(atom/)?')
32:
33: # Change Apache log items into Python types.
34: def pythonized(d):
35: # Clean up the request.
36: d["request"] = d["request"].split()[1]
37:
38: # Some dashes become None.
39: for k in ("user", "referrer", "agent"):
40: if d[k] == "-":
41: d[k] = None
42:
43: # The size dash becomes 0.
44: if d["size"] == "-":
45: d["size"] = 0
46: else:
47: d["size"] = int(d["size"])
48:
49: # Convert the timestamp into a datetime object. Accept the server's time zone.
50: time, zone = d["time"].split()
51: d["time"] = datetime.strptime(time, "%d/%b/%Y:%H:%M:%S")
52:
53: return d
54:
55: # Is this hit a page?
56: def ispage(hit):
57: # Failures and redirects.
58: hit["status"] = int(hit["status"])
59: if hit["status"] < 200 or hit["status"] >= 300:
60: return False
61:
62: # Feed requests.
63: if feed.search(hit["request"]):
64: return False
65:
66: # Requests that aren't GET.
67: if hit["request"][0:3] != "GET":
68: return False
69:
70: # Images, sounds, etc.
71: if hit["request"].split()[1][-1] != '/':
72: return False
73:
74: # Must be a page.
75: return True
76:
77: # Regexes for internal and Google search referrers.
78: internal = re.compile(r'https?://(www\.)?leancrew\.com.*')
79: google = re.compile(r'https?://(www\.)?google\..*')
80:
81: # Is the referrer interesting? Internal and Google referrers are not.
82: def goodref(hit):
83: if hit['referrer']:
84: return not (google.search(hit['referrer']) or
85: internal.search(hit['referrer']))
86: else:
87: return False
88:
89: # Initialize.
90: pages = []
91:
92: # Parse all the lines associated with the day of interest.
93: for line in sys.stdin:
94: if apacheDay in line:
95: m = pattern.match(line)
96: hit = m.groupdict()
97: if ispage(hit):
98: pages.append(pythonized(hit))
99: else:
100: continue
101:
102: # Show the top five pages and the total.
103: print '%s pages' % theDay.strftime("%b %d, %Y")
104: pageViews = Counter(x['request'] for x in pages)
105: top5 = pageViews.most_common(5)
106: for p in top5:
107: print " %5d %s" % p[::-1]
108: print " %5d total" % len(pages)
109:
110: # Show the top five referrers.
111: print '''
112: %s referrers''' % theDay.strftime("%b %d, %Y")
113: referrers = Counter(x['referrer'] for x in pages if goodref(x) )
114: top5 = referrers.most_common(5)
115: for r in top5:
116: print " %5d %s" % r[::-1]
Lines 9-14 read the first argument to the script and figure out which day I’m interested in and the Apache log format for that day. If there is no argument, or if I give it a non-numeric argument like “fred,” it uses yesterday. I don’t mess around with time zones; the server uses Eastern, I’m in Central, and the one hour difference is too small to bother with.
Lines 17-28 define and compile the regex for a line of the log file. Apart from changing one +
into a *
, I took these lines wholesale from Jochen Voss’s post. I’ve never used the (?P<name>...)
syntax before. It makes the substring captured by the parentheses accessible by name instead of just by position index. When used with the groupdict
method, which we’ll see in Line 96, we can create a dictionary of the captured substrings keyed to the names. That’s pretty cool and is the basis for the rest of the script.
The pythonized
function defined in Lines 34-53 takes a dictionary of strings captured from a line of the Apache log file and returns a dictionary of Python objects. Dashes in the user
, referrer
, and agent
fields are turned into None
, a dash in the size
field is turned into zero, and the date string is parsed into a datetime
object.
The ispage
function defined in Lines 56-75 takes a dictionary associated with a line from the Apache log file and returns a Boolean: True
if it looks like a hit on a page and False
if it doesn’t. Redirects and failures are filtered out in Lines 58-60 by looking at the status
field. Requests for my RSS feed are filtered out in Lines 63-64 by using the regex defined on Line 31. Page requests always use the GET
method, so Lines 67-68 filter out all non-GET
s; these are either my administrative interactions with the blog through the WordPress interface or hacking attempts by others. All the blog page URLs end with a slash, so requests that don’t are excluded by Lines 71-72. This filters out hits on image, sound, CSS, and JavaScript files.
The goodref
function defined in Lines 82-87 also takes a dictionary associated with a line from the Apache log file and returns a Boolean. It uses the regexes defined in Lines 78-79 to hunt for referrals from Google and the site itself. These referrers, and referrers of None
will return False
; all others will return True
.
With everything defined up front, we move into the main body of the script. Lines 93-100 loop through all the lines of the log file. Line 94 checks if the line represents the day of interest; because there’s no else
clause, other days are ignored.
Line 95 does the pattern match on a line of the log file, and Line 961 turns that match into a dictionary called hit
with groupdict
. If hit
represents a page, it gets added to the pages
list in Line 98.
Line 104 creates a Counter
collection called pageViews
from the pages
list. This is basically a dictionary of page counts keyed on the URLs, but it reduces all the usual looping and incrementing into a single creation command. Line 105 then uses the Counter
’s most_common
method to pull out the five items with the highest counts. The results are printed in Lines 106-108.
Lines 113-114 do basically the same thing as Lines 104-105, but with referrer URLs instead of page URLs.
I have top5log
sitting on the server with its executable bit set. Whenever I feel the need to check, I log in through ssh
and run it. I suppose I should set up a cron
job to run it every day and append the output to a file.
-
Remember Line 96? This is a song about Line 96. ↩