More RSS mess

When last we spoke, I was saying that although there have been some benefits to the more varied feed reading environment that grew after Google shut down Reader, we’ve also lost some valuable services. The one I want to talk about today is the near-instantaneous updating of cached feeds. This is primarily a loss for publishers, but readers have suffered, too.

When a feed-syncing service like Feedbin, Feedly, Feed Wrangler, etc. reads your RSS feed,1 it takes note of the new articles that have appeared since the last time it looked and saves a copy of the new stuff on its server. When one of its users asks to see unread posts, the service doesn’t go back to your site for the feed, it sends the user what it copied. This is what distinguishes subscribing through a service from subscribing directly, which is what we all did in the old days. Before there were syncing services, an app like NetNewsWire would poll all your subscribed sites, looking for new articles. It would collect whatever was unread at that time and present it to you.

The old way became annoying as we started using more devices. If we read an article on Device A and then moved to Device B, the feed reader on B wouldn’t know about the history on A and would show that same article as unread again. We could mark it as read, but when we later switched to Device C, it would pop up as unread yet again. The advantage of using a service is that the service knows what you have and haven’t read and knows what to deliver no matter which device you’re using.

The disadvantage of using a service arises when articles get updated after publication. The service delivers whatever it copied at the time it polled the site, which, by the time we launch our feed reader, could be hours or days out of date. We as readers are at the mercy of our service and the schedule it uses to update its copies.

In fact, the service may never update its copy. Each item in an RSS feed has a globally unique identifier, a string that’s not shared with any other RSS item from that site or any other site. The guid is typically not changed when the associated article is updated because it’s still the same article. Each item also has a publication date, which is also typically not changed when the article is updated.2 So the only way a service can be sure to have the most recent version of an item is to redownload it and change the copy on its server every time it polls the feed. This kind of defeats the purpose of making a cached copy, and some services don’t do it. They continue to serve up whatever version of the article was current when they first polled the feed.

How did Google handle updates when it was running Reader? Being Google, it solved the problem two ways: one through brute force and the other through cleverness.

The brute force solution was exactly what you’d expect of a company that indexes the entire internet. It rechecked every feed several times a day and updated all of its cached feed entries. Generally speaking, this meant that the articles it delivered through Reader were seldom more than an hour out of date. This is pretty good, but Google wanted more.

The clever solution was to develop a new protocol, PubSubHubbub, for handling feed updates. With this protocol, a PubSubHubbub server, called a hub, sat between publishers and Google Reader. Whenever an article was published or updated, the publisher could ping the hub to let it know there was something new. The hub, which Google controlled, would then tell Google Reader about the update, and Reader’s cache of the item would be changed right then and there. With PubSubHubbub, Google Reader was seldom more than a few seconds out of date.

The downside to PubSubHubbub was that it required the cooperation of the publisher. The publisher had to add an element to its feed and had to ping the hub whenever something changes. This wasn’t hard—for my money, Nathan Grigg provided the best explanation of what to do—but many publishers didn’t do it, which is why Google continued to use the brute force method as a backstop.

From a publisher’s point of view, Google Reader’s updating system was great, especially if you made the additions needed for PubSubHubbub. You knew that when you corrected a mistake in a blog post, it was the corrected version that people would see in their feed readers from that point on. This is, unfortunately, no longer the case. I’ve had people point out errors on Twitter that I had corrected a day earlier. Their syncing service was providing them with an old cached version that had never been updated.

It looks like Google’s hub is still operating, but I have no idea whether any of the feed syncing services use it. But because PubSubHubbub is an open protocol, anyone can provide hubs. The best known hub outside of Google is Superfeedr, and some of the syncing services do use it. Last week, I ran a small experiment to see which services update the articles they deliver and how quickly they respond to changes.

In hindsight, I would say that the experiment wasn’t very well run. Because it was going on during work hours, I had other things to do and couldn’t give it my fullest attention. Sometimes hours would go by between my checks on what the services were delivering, and I often forgot to clear my browser cache before checking. So my notes (which are in the post itself) on how quickly feeds were updated are not reliable. But I did learn some things:

  1. Feed Wrangler doesn’t use Superfeedr, and I see no evidence that it updates items after they’re cached. When I visit Feed Wrangler’s web site today, it’s still showing the original version of my test post, with no updates at all.
  2. When I started the test, Feedbin was not using Superfeedr to monitor my feed, but a day later it was. Despite that, I never saw any update to the post in Feedbin during the two days of the test. But if I check Feedbin now, it’ll show the fully updated post. To me, Feedbin is the most frustrating of the services. More people read ANIAT through Feedbin than any of the other services, but I have no idea the best way to ensure they its users get the most recent versions of my posts. (Update 2014/11/19 11:59 am Feedly is what most of my readers use, not Feedbin. Too many similar names.)
  3. Feedly and BazQux both use Superfeedr, and they were equally good at staying up to date. I can’t say how good that was because of my own experimental errors, but I do know that they update via both Superfeedr and brute force. I’ve configured both ReadKit and Reeder to use BazQux as my syncing service.
  4. NewsBlur is sui generis. It does not use Superfeedr, opting instead for a feed and site polling system of its own devising that seems to account for both the popularity of a site and how frequently it gets updated. You may recall from my last post that my server logs showed a NewsBlur hit on my DST post that included an absurdly high subscriber count of 15,879 in the user agent string. I found that through a grep search of the Apache log file that piped the lines through sed to filter out everything except the subscriber count. Looking at the full log entry, - - [10/Nov/2014:23:49:59 -0500] "GET /all-this/2013/03/why-i-like-dst/ HTTP/1.1" 200 6451 "-" "NewsBlur Page Fetcher - 15879 subscribers - (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)"

    it’s clear that that’s the number of people who use NewsBlur to subscribe to Daring Fireball. Presumably, NewsBlur visited my DST post because John Gruber linked to it last week, and following links from popular blogs is part of NewsBlur’s updating system. I don’t pretend to understand why.

    Generally speaking, NewsBlur was behind Feedly and BazQux in updating its version of my test post.

  5. I didn’t test FeedHQ, mainly because I’d never heard of it, but it does use Superfeedr, so it ought to be fairly good at keeping up to date.

If I were to run this experiment again—which I’d only do if I were stuck in bed for a few days with the flu—I’d fix the mistakes I made this time, and I’d add a few more services, like FeedHQ, The Old Reader, and Inoreader. I might even try the self-hosted service, Stringer.

In calling the RSS situation a mess, I’m trying to be more descriptive than derogatory. Before Google shut down Reader, there was order imposed through a popular, centralized system. When that order was lost, what was left is a mess. We may prefer the mess because we don’t want a centralized system,3 but we shouldn’t pretend that it’s not a mess. Some of the advantages of Reader vanished along with it. If more of the syncing services recognized the value Google provided to both readers and publishers, I’m sure they could offer features that replace what we’ve lost.

  1. At the risk of offending Atom partisans, I’m going to lump all feed formats together and call them RSS. I know this isn’t technically correct—and the mix of formats is another part of the mess—but it’s common parlance and it’s a convenient shorthand. 

  2. Atom feeds don’t have a pubDate element, they have an updated element. which should tell the service when an item has been changed. RSS 2 feeds have pubDate but not updated. I told you it was a mess. (Updated 2014/11/19 11:59 am Atom has an optional published element for the initial publication date and time. Thanks to ttepasse for the tip.) 

  3. When Dave Winer created RSS on the fourth day, he didn’t intend it to be centralized.