December 11, 2013 at 12:26 AM by Dr. Drang
When I checked my Twitter feed Sunday morning, this was waiting for me.
Python script to parse HTML and gather your latest blog post! Could use advice! wrappedthoughts.com/home/2013/12/8… cc: @drdrang @viticci @ismh
— Dain Miller (@cdainmiller) Sun Dec 8 2013 1:42 AM CST
I don’t know anything about Editorial and I don’t know Dain Miller, but I figured it couldn’t hurt to look at his script, which is meant to return the URL of his blog’s most recent post. Last year I wrote a script that did essentially the same thing, so if his blog was running on WordPress or some other XML-RPC-based system, I could send him a link to my post. Using Python’s
xmlrpclib module would be a big improvement over the scrape-and-search technique Dain was using.
I did a View Source on his blog home page and learned that it was run by Squarespace, a company whose name you might recognize if you’ve listened to any computery podcast produced in the past couple of years. I don’t believe Squarespace uses XML-RPC, but it does have an API, which I learned through the clever trick of Googling “Squarespace API.” After looking at the return value for the
format=json-pretty request, I wrote this quick script,
python: 1: #!/usr/bin/python 2: 3: import json, urllib2 4: 5: r = urllib2.urlopen('http://wrappedthoughts.com/?format=json-pretty').read() 6: j = json.loads(r) 7: print j['items']['urlId']
put it in a Gist, and sent the link to Dain. It wasn’t everything he wanted, but I figured he could work out the rest. Which he did.
Honestly, though, I can’t say that I like what he came up with. It’s not that it has any errors, or that it’s inefficient or fragile. It’s just not to my taste, and at the risk of offending Dain, I’d like to talk about why.
Here’s the script:
python: 1: # Editorial Module 2: import workflow 3: 4: class Squarespace(): 5: 6: def __init__(self, blogBaseUrl): 7: import json, urllib2 8: 9: self.blogBaseUrl = blogBaseUrl 10: self.response = urllib2.urlopen(self.blogBaseUrl + '?format=json-pretty').read() 11: self.jsonResponse = json.loads(self.response) 12: 13: def gatherLatestBlogPostUrl(self): 14: path = self.jsonResponse['items']['urlId'] 15: url = self.blogBaseUrl + path 16: return url 17: 18: def gatherLatestBlogTitle(self): 19: title = self.jsonResponse['items']['title'] 20: return title 21: 22: # Interface: 23: if __name__ == "__main__": 24: squarespace = Squarespace('http://wrappedthoughts.com/') 25: 26: workflow.set_output(squarespace.gatherLatestBlogPostUrl()) 27: workflow.set_output(squarespace.gatherLatestBlogTitle())
Obviously, it’s a lot longer than my script, but that’s not why I don’t like it. I don’t like it because it follows a trend I see in lots of scripts: it uses object orientation for no particularly good reason.
I’m sure that part of my distaste is generational. There was no object-oriented programming when I took my first programming courses, and it’s never been a technique that I’ve made much use of. Don’t get me wrong—I love using well-built object-oriented modules, but I seldom feel the need to define classes in my own scripts.
I get the distinct sense, though, that young programmers believe that all serious programming must be object-oriented, that anything else is just hackwork. Which is unfortunate, because this mindset leads to colossal wastes of time.
In this case, for example, the Python
json module already does an excellent job of turning the output of the
format=json-pretty call into a dictionary. And a dictionary is a perfectly good object whose elements are easily addressable. Is
really that much more readable (to a programmer) than something like
baseURL + response['items']['urlId']
Isn’t it clear that this is getting the URL of the first item of the response? As long as you know that blogs display the most recent item first, you know that this is what you want.
As important, the dictionary provides a way to get at every piece of information Squarespace returns. How many posts can we get at?
When was the most recent item updated?
What time zone does the blog use?
These aren’t the prettiest looking inquiries, but they are easy to read and interpret.1 If you don’t like all the square brackets and quotation marks, you could turn the dictionary into a nested object using one of the techniques described on this Stack Overflow page. That way you could, for example, use something like
baseURL + resp_obj.items.urlId
to return the full URL of the most recent blog entry.
The syntax, though, isn’t what’s important. What’s important is that the Squarespace API is already providing a well-structured object in its JSON response and Python has a built-in data type that fits the structure perfectly. The script will be simpler and more flexible if it just uses what it’s been given. Creating a new class to access only a small portion of the response—and to use that small portion only once—doesn’t make the script faster, more robust, or easier to read.
I’m not arguing for a return to the pre-object days. In longer programs, and in reusable modules, object-oriented programming can bring great simplicity and clarity to a project. But there’s a time and a place for everything, and a one-off script like Dain’s is neither the time nor the place.
I assume that Dain and all the other programmers who write scripts this same way have been taught to do so. They haven’t been served well by their teachers. They’d be more productive—and we’d be getting the benefit of their talents—if they’d been taught how to design programs that fit the problem to be solved.
OK, it wasn’t obvious that the update time was being given in microseconds, but it wasn’t hard to figure out. ↩︎