September 25, 2011 at 10:52 PM by Dr. Drang
When I checked my email yesterday morning, the daily message with the status of my family’s library check-outs and holds was missing. On many days this is the only useful email I get, so I spent some time last night fixing it. It now uses the twill library to interact with the library website.
I’ve written about the script, called checkcards, that generates this daily email several times over the past couple of years. In a nutshell, it scrapes our local library’s web site to get the items checked out and on hold for every member of our family, then sends my wife and me an email with all the information presented compactly in a pair of tables.
The value of this email is that it lets us plan our trips to the library more efficiently. When an item comes due and needs to be returned, we can look at the message and make sure we return the other items that are nearly due in the same trip. We do the same thing when we go to pick up something on hold.
The problem with scraping a web site to gather information is that small changes to the page can screw up the scraping, and that’s exactly what happened yesterday. As far as I could tell, the web site had no visible differences from the day before, but the script suddenly failed, throwing an error message when it tried to process the login page.
Worse, I couldn’t diagnose the problem. I was using the mechanize library to fill in the fields on the login page, and the error messages indicated that the page had no
<form> section. This was, of course, nonsense. I could clearly see the form when I opened the login page in my browser. I also downloaded the HTML of the page, and there was the
<form> tag, big as day. But for some reason, possibly because the page’s HTML is convoluted and ill-formed, mechanize refused to recognize it.
Or maybe I should say that I couldn’t get mechanize to recognize it, because although I’ve been using the library for over two years, I can’t say that I have a deep understanding of how it works. This is, I suspect, a common occurrence among programmers, especially amateur ones like me. I have a problem to solve, I go looking for a tool to help me solve it, I scan the documentation and the sample code to see if it’s a good fit, and then I decide which library to go with and start hacking.
Sometimes the library’s capabilities and my experience are a perfect match and the use of the library is crystal clear. Sometimes—and this was the case with mechanize—I’m in a light fog. Half of what I’m doing makes sense, but the other half is just pasted in from the sample code, and I don’t fully understand what it does or why it’s needed. If I go on to use the library in more scripts, I’ll usually figure out what everything does; but when it’s a one-time use, I never quite get it.
So I needed a new way to do the web interaction. I thought about moving to Todd Ditchendorf’s Fake, but that would have been a huge change, shifting from a command-line utility that run via launchd to a GUI. And I had a bunch of code that still worked that I didn’t want to abandon.
I found twill, a command-line utility for interacting with web pages that also comes with a Python API. Twill has a simple, easy-to-understand set of commands, including
go <url>, for going to a page
follow <link name>, for following a link
formvalue <n> <field> <value>, for entering a value in the named field of the nth form
submit, for submitting the form you just filled
show, for showing the HTML of the current page
There are other commands, of course, but these are the ones I need.
I had one worry about twill: it’s a wrapper around mechanize, the library I was having trouble with. Fortunately, twill is easy to install and easy to test from the command line. I was able to prove to myself through an interactive session that it would work—apparently twill knows more about mechanize than I do.
In pseudocode, checkcards looks like this:
for each library card go to the login page and login* go to the items out page and get the page's HTML* extract the info for each item out and add to a list go to the holds page and get the page's HTML* extract the info for each hold and add to a list logout* generate an HTML skeleton for the message add the checked-out items to one table add the holds to another table send the HTML message to standard output (which is piped to sendmail by a shell script)
It’s only the starred parts that had to be switched out, which turned out to be really easy. I spent more time trying and failing to get mechanize to work than I did actually getting twill to work.
The code is in this GitHub repository. As I’ve said before, unless you live in Naperville, Illinois, checkcards isn’t going to be much help to you, at least not directly. But you may find that you can do something similar with your library’s web site. Some of my favorite scripts were inspired by work that others did, but which I couldn’t use in their original form.