Feed reader robustification

I had a bit of shock this afternoon when I opened my RSS feed reader to see if anything was new.

Feed table of contents

Not much new, but a lot that’s old. Over 1400 posts from Kieran Healy, holder of the Krzyzewski Chair in Sociological R at the second best basketball university in North Carolina and author of a much-anticipated forthcoming book on how to make good graphs.

What happened? I don’t know for sure, but something in Kieran’s site generation software decided to include every post he’s written in his blog’s RSS feed. It’s an impressive body of work, going back to 2002, but I didn’t have time during my lunch hour to read it all.

My homemade feed reader works like this. For every site I subscribe to, it

  1. read the RSS (or JSON) feed;
  2. checks each article against a SQLite database of articles I’ve already read; and
  3. adds the article to a list if it’s unread;

After going through all the subscriptions, the script sorts the unread articles in alphabetical order and arranges them in a static HTML page on my server, adding a table of contents to the top of the page. The script runs via a cron job a few times an hour from 6:00 am until midnight.

So many of Kieran’s posts appeared today because my database of read posts is relatively young and only the last dozen or so of his articles are in it. It was all the earlier ones that were on my feed reader page.

This is my fault, not Kieran’s. I knew perfectly well when I wrote my script that blogging software will sometimes regenerate its feed with all new GUIDs for each article. When this happens, it makes the articles look new to the feed reader. I’d seen this happen even back when I was using professionally written feed reading apps. What made this especially troublesome for my definitely-not-professionally-written feed reading system was that it’s not equipped with a “Mark all as read” button. Which gave me three choices:

  1. Do the programming to add a “Mark all as read” button, something I will almost never use.
  2. Go through and individually mark all 1400 old posts as read so they get entered into the database and don’t appear again. Fat chance.
  3. Figure out another way to add all these posts to the database.
  4. Change my feed reading script to just ignore articles that are more than a few days old, regardless of whether they’re in the database.

I chose #4 because it was the quickest to implement and should protect me against this kind of thing happening again. Kieran’s older posts disappeared from my feed reading page, and my blog reading went back to normal. Afterward, though, I realized that I could have implemented #3 in combination with #4, ignoring the older articles for the purposes of assembing the feed reading page but adding them to the database of read articles to give me added protection against seeing them pop up again.

I’ll try to get that working in the next day or two and then post the script in its final form. I doubt that many people really want to set up their own feed reading system, but you never know.