Automatically generated blog posts from Twitter

Today I started generating automated posts for the blog from my Twitter stream. The program that does it is a pretty short Python script that leverages a few libraries not included in the standard Python distribution. I’ll point out where you can get the libraries as I describe the script.

First, some motivation for the script. For a few months now, I’ve been including my recent Twitter posts in the sidebar over at the right. I followed the directions Twitter gives, following the HTML/JavaScript method so I could use CSS to style the tweets and match the look of the site. It’s easy, but it does tend to put the tweets in a ghetto. They’re out of sync with the regular posts and don’t get archived. Since many of my tweets would have been tiny blog posts in the pre-Twitter days, I thought they should be added like regular posts.

On the other hand, I didn’t want every tweet to have its own post. A daily collection of tweets seemed like the best compromise. My goal was to write a program that would run every day, gathering all the tweets of the previous day and putting them into a single post, timestamped and separated by blank lines. For example, here’s the first one.

I called the program that does it twitterpost, and it uses four nonstandard libraries:

Obviously, this script won’t work with a blogging engine other than WordPress, but I think it could be easily adapted. Since wordpresslib is a relatively thin wrapper around the standard Python xmlrpclib library, it wouldn’t surprise me to find that there are similar libraries customized for Movable Type, Blogger, etc.

Here’s the source code for twitterpost:

 1:  #!/usr/bin/python
 2:  
 3:  import twitter
 4:  from datetime import datetime, timedelta
 5:  import pytz
 6:  import wordpresslib
 7:  import sys
 8:  import re
 9:  
10:  # Parameters.
11:  tname = 'drdrang'                   # the Twitter username
12:  chrono = True                       # should tweets be in chronological order?
13:  replies = False                     # should Replies be included in the post?
14:  burl = 'http://www.leancrew.com/all-this/xmlrpc.php'    # the blog xmlrpc URL
15:  bid = 0                             # the blog id
16:  bname = 'drdrang'                   # the blog username
17:  bpword = 'qua5hi8aj6re4h'       # the blog password (no, that's not my real password)
18:  bcat = 'personal'                   # the blog category
19:  tz = pytz.timezone('US/Central')    # the blog timezone
20:  utc = pytz.timezone('UTC')
21:  
22:  # Get the starting and ending times for the range of tweets we want to collect.
23:  # Since this  is supposed to collect "yesterday's" tweets, we need to go back
24:  # to 12:00 am of the  previous day. For example, if today is Thursday, we want
25:  # to start at the midnight that  divides Tuesday and Wednesday. All of this is
26:  # in local time.
27:  yesterday = datetime.now(tz) - timedelta(days=1)
28:  starttime = yesterday.replace(hour=0, minute=0, second=0, microsecond=0)
29:  endtime = starttime + timedelta(days=1)
30:  
31:  # Create a regular expression object for detecting URLs in the body of a tweet.
32:  # Adapted from
33:  # http://immike.net/blog/2007/04/06/5-regular-expressions-every-web-programmer-should-know/
34:  url = re.compile(r'''(https?://[-\w]+(\.\w[-\w]*)+(/[^.!,?;"'<>()\[\]\{\}\s\x7F-\xFF]*([.!,?]+[^.!,?;"'<>()\[\]\{\}\s\x7F-\xFF]+)*)?)''', re.I)
35:  
36:  ##### Twitter interaction #####
37:  
38:  # Get all the available tweets from the given user. By default, they're in
39:  # reverse chronological order.  The 'since' parameter of GetUserTimeline could
40:  # be used to limit the collection to recent tweets, but because tweets are kept
41:  # in UTC, and we want them filtered according to local time, we'd have to
42:  # filter them again anyway. So it doesn't seem worth the bother.
43:  api = twitter.Api()
44:  statuses = api.GetUserTimeline(user=tname)
45:  
46:  if chrono:
47:      statuses.reverse()
48:  
49:  # Collect every tweet and its timestamp in the desired time range into a list.
50:  # The Twitter API returns a tweet's posting time as a string like 
51:  # "Sun Oct 19 20:14:40 +0000 2008." Convert that string into a timezone-aware
52:  # datetime object, then convert it to local time. Filter according to the
53:  # start and end times.
54:  tweets = []
55:  for s in statuses:
56:      posted_text = s.GetCreatedAt()
57:      posted_utc = datetime.strptime(posted_text, '%a %b %d %H:%M:%S +0000 %Y').replace(tzinfo=utc)
58:      posted_local = posted_utc.astimezone(tz)
59:      if (posted_local >= starttime) and (posted_local < endtime):
60:          timestamp = posted_local.strftime('%I:%M %p').lower()
61:          body = url.sub(r'<\1>', s.GetText())
62:          if body[0] == '#':
63:            body = ' ' + body
64:          if replies or body[0] != '@':
65:              if timestamp[0] == '0':
66:                  timestamp = timestamp[1:]
67:              tweet = '**%s**  \n%s\n' % (timestamp, body)
68:              tweets.append(tweet)
69:  
70:  # Obviously, we can quit if there were no tweets.
71:  if len(tweets) == 0:
72:      print 0
73:      sys.exit()
74:  
75:  # A line for the end directing readers to the post with this program.
76:  lastline = """
77:  
78:  *This post was generated automatically using the script described [here](http://www.leancrew.com/all-this/2008/10/automatically-generated-blog-posts-from-twitter/).*
79:  """
80:  
81:  # Uncomment the following 2 lines to see the output w/o making a blog post.
82:  # print '\n'.join(tweets) + lastline
83:  # sys.exit()
84:  
85:  ##### Blog interaction #####
86:  
87:  # Connect to the blog.
88:  blog = wordpresslib.WordPressClient(burl, bname, bpword)
89:  blog.selectBlog(bid)
90:  
91:  # Create the info we're going to post.
92:  post = wordpresslib.WordPressPost()
93:  post.title = 'Tweets for %s' % yesterday.strftime('%B %d, %Y')
94:  post.description = '\n'.join(tweets) + lastline
95:  post.categories = (blog.getCategoryIdFromName(bcat),)
96:  
97:  # And post it.
98:  newpost = blog.newPost(post, True)
99:  print newpost

Lines 10–20 are parameters that define the twitter stream and the blog. Most of it is the usual username/password/URL stuff, but there are a couple of parameters that deserve further comment. The Twitter API spits out the posts in reverse chronological order—just as they appear at twitter.com. But I wanted the day’s tweets to come out in the order I wrote them because they’re easier to read that way. The chrono parameter on Line 12 allows the program to print the tweets in either order. The replies parameter on Line 13 allows the user to include or exclude @ replies in the post. Since my @ replies tend to be more like instant messages than like micro blog posts, I chose to exclude them.

The start and end times for the collection of tweets are defined in Lines 27–29, and the comment block above those lines describes what the lines do. If you wanted to define “a day” as going from say 2:00 am to the following 2:00 am instead of from midnight to midnight, you could change the hour=0 parameter in Line 28 to hour=2.

Line 36 starts the interaction with Twitter. The comments are pretty detailed, mainly because I wanted to be absolutely clear to myself how I was handling the conversion from Twitter’s UTC timestamps to my own US/Central timestamps. This is the section where the standard Python datetime library and the nonstandard pytz libraries got a workout.

Line 85 starts the interaction with WordPress. The steps are basically:

  1. connect to the blog;
  2. make a post with the Twitter information; and
  3. post.

When the post is successful, the variable newpost gets its ID number, which I have the script print out for debugging purposes. I’m not sure what the value will be when the post fails, as I haven’t had a failure yet.

Since twitterpost doesn’t need any human interaction, it’s a great candidate for automated command schedulers like cron or launchd. I’m setting it up to run every day in the early morning.

This is a new program and has been running for only a day, so I wouldn’t be surprised to find bugs cropping up as it gets used. I’ll fix the source code above as I find them.

Update (10/21/08)
A few things I should have mentioned:

  1. I use Markdown to format my posts, which is why Line 67 is written the way it is. The double asterisks make the timestamp bold, and the two spaces after the timestamp create a line break. If my posts were in HTML, Line 67 would have to include all the necessary tags.
  2. There’s nothing magical about collecting a day’s worth of tweets; you could easily change it to a week’s worth by changing the days=1 parameters in Lines 27 and 29 to days=7.
  3. URLs in tweets show up as regular text and are not clickable. I’ll probably add some code after Line 56 to search for URLs in body and turn them into automatic links.
  4. The combination of the datetime and pytz libraries handles Daylight Saving Time in the conversion between local time and UTC. Right now, US/Central is 5 hours behind UTC; that should change to 6 hours early next month. Because the dates on which countries move to and from DST change from year to year, automatic time conversions are surprisingly hard to do. I’ll keep an eye on the times to see if still works after we switch back to Standard Time.

Update (10/22/08)
I’ve added URL detection to the program and adjusted (I hope) all the line numbers referred to in the original post and in the previous update. The frightening regular expression on Line 34 was adapted from one given by Mike Malone, which was, in turn, adapted from one given by Jeffrey Friedl in his well-known regular expressions book. Line 61 puts angle brackets (<>) around any detected URL, which Markdown turns into a link.

Update (10/26/08)
I’ve removed the tzinfo=tz parameter in the argument list to replace in Line 28. For some reason I haven’t figured out, it was screwing up the Daylight Saving Time setting for yesterday. As a result, tweets entered between 12:00 am and 1:00 am were being collected with those of the previous day. Now it seems to work correctly, although it remains to be seen if it will still work when we go back to Central Standard Time next weekend.

Update (11/5/08)
Lines 75–79 add a link to the end of the post that points to this page. Lines 62–63 stick a space before any tweet that starts with a hash (#) so Markdown doesn’t wrap the tweet in <h1></h1> tags. I think I’ve fixed all the line number references in the post to reflect the latest version of the program.

Tags: