Local archive of WordPress posts

I wouldn’t be surprised to find myself jumping on the baked blog bandwagon. Everybody’s doing it. If I do, the first thing I’ll need is a collection of all my previous posts, downloaded to my local machine and saved in a directory structure that matches the URL structure WordPress has been presenting to the outside world. This evening, I did just that.

There were two main reasons it went so quickly and so smoothly:

  1. My posts have always been saved as Markdown on the server, so I didn’t have to go through the pain that Gabe went through when he converted from WordPress to Pelican last year. When ANIAT was served by Blosxom, the Markdown was saved in files on the server and was converted to HTML on the fly by a CGI script. When it was served by Movable Type, the Markdown was saved in a database on the server and converted into static HTML pages. Currently, the Markdown is saved in a database on the server and converted to HTML on the fly by WordPress’s PHP engine (and the first time a page is accessed, it’s cached as a static file in some compressed format).
  2. I’d already written a script, get-post, that got the saved content of a post from the server through the XML-RPC MetaWeblog API. All I needed to do was extend it to collect a set of posts and to save each one in the appropriate directory.

The semi-new script that does the collection and saving is called save-posts. It’s run this way:

save-posts 1 1000

This command downloads each post from 1 to 1000 in turn and saves them in a folder with a name like all-this/source/2013/01 in my Dropbox folder. Within each month folder, the individual Markdown source files are saved with names like geofencing-in-flickr.md, where the basename is slug of the post as it was created and served by WordPress. This hierarchy and naming scheme will ensure that the individual posts created by the static blog will have the same URLs as their WP counterparts—old links should continue to work.

Here’s the Python source of save-posts:

python:
 1:  #!/usr/bin/python
 2:  
 3:  import xmlrpclib
 4:  import sys
 5:  import os
 6:  from datetime import datetime
 7:  import pytz
 8:  import re
 9:  
10:  # Blog parameters (url, user, pw) are stored in ~/.blogrc.
11:  # One parameter per line, with name and value separated by colon-space.
12:  p = {}
13:  with open(os.environ['HOME'] + '/.blogrc') as bloginfo:
14:    for line in bloginfo:
15:      k, v = line.split(': ')
16:      p[k] = v.strip()
17:  
18:  # The header fields and their metaWeblog synonyms.
19:  hFields = [ 'Title', 'Keywords', 'Date', 'Post',
20:              'Slug', 'Link', 'Status', 'Comments' ]
21:  wpFields = [ 'title', 'mt_keywords', 'date_created_gmt',  'postid', 
22:               'wp_slug', 'link', 'post_status', 'mt_allow_comments' ]
23:  h2wp = dict(zip(hFields, wpFields))
24:  
25:  # Regex for the path to save the post in from the slug.
26:  pathRE = re.compile(r'(/\d\d\d\d/\d\d/).+/$')
27:  
28:  # Get the post ID from the command line.
29:  try:
30:    startID = int(sys.argv[1])
31:    endID = int(sys.argv[2])
32:  except:
33:    sys.exit()
34:  
35:  # Time zones. WP is trustworthy only in UTC.
36:  utc = pytz.utc
37:  myTZ = pytz.timezone('US/Central')
38:  
39:  # Connect.
40:  blog = xmlrpclib.Server(p['url'])
41:  
42:  # The posts in header/body format.
43:  for i in range(startID, endID + 1):
44:    try:
45:      post = blog.metaWeblog.getPost(i, p['user'], p['pw'])
46:      header = ''
47:      for f in hFields:
48:        if f == 'Date':
49:          # Change the date from UTC to local and from DateTime to string.
50:          dt = datetime.strptime(post[h2wp[f]].value, "%Y%m%dT%H:%M:%S")
51:          dt = utc.localize(dt).astimezone(myTZ)
52:          header += "%s: %s\n" % (f, dt.strftime("%Y-%m-%d %H:%M:%S"))
53:        else:
54:          header += "%s: %s\n" % (f, post[h2wp[f]])
55:  
56:      contents = header.encode('utf8') + '\n' 
57:      contents += post['description'].encode('utf8') + '\n'
58:  
59:      print post['link']
60:      partialPath = pathRE.search(post['link']).group(1)
61:      fullPath = os.environ['HOME'] + '/Dropbox/all-this/source' + partialPath
62:      try:
63:        os.makedirs(fullPath)
64:      except OSError:
65:        pass
66:      postFile = open(fullPath + post['wp_slug'] + '.md', 'w')
67:      postFile.write(contents)
68:      postFile.close()
69:    except:
70:      pass

As I said, most of this is explained in the post that describes the get-post script. The new stuff is

As the files were saved, I noticed that several posts with names like 666-revision and 999-autosave scroll past. These are obviously weird partial posts that shouldn’t be in the archive. After testing to make sure no real posts would be deleted, I cd‘d into the source directory and issued these commands:

find ./ -name "*-revision*" -exec rm {} \;
find ./ -name "*-autosave*" -exec rm {} \;

That got rid of the oddballs.

Mind you, I’m still not ready to switch to a baked blog. But I’m readier than I was yesterday.

Update 1/15/13
While I appreciate the suggestions (here and on Twitter) for a static blogging engine, I’ll probably write my own. The attraction to me of a baked blog is less the speed of web serving—I don’t get that much traffic, and caching has worked well for me—than the complete control over the site. When you use someone else’s system, you have to work with, or around, their decisions, and if I’m going to do that I’ll just stick with WordPress.

Do not, by the way, expect the switch, if it happens at all, to happen any time soon. Even guys as talented as Gabe, Brett, and Justin took a long time to switch their sites to a static system. And if I never switch, I’ll still consider this exercise to have been worth the modest time I put into it. Having a local archive of the blog will still be useful for searching and other offline work.