XSLT, of course

A couple of days ago, I stole Jason Snell’s idea of using Yahoo! Pipes to filter the 5by5 After Dark podcast feed to limit it to just the After Darks of shows I listen to. In the first comment on the post, Gabe (Macdrifter) asked whether it was a good idea to rely on Yahoo! for this kind of service.

For reasons I can’t fully explain, I have a little more faith in Yahoo! than almost every other sentient being on the planet. Even though I can’t think of a single thing of value the Yahooers get out of Pipes, I don’t think they’ll be shutting it down anytime soon. And even if they did, losing a sightly more convenient form of a podcast feed wouldn’t drive me to the depths of despair.

But the notion of creating and hosting my own filtered feed was appealing. I asked Twitter for examples of existing libraries or frameworks that I could use for that purpose. I got a few answers, but none that I’d feel comfortable using. Using someone’s half-finished implementation in a language I only half understand does not appeal.

Tonight, though, I realized that a standards-compliant solution was staring me in the face: XSLT. RSS is just XML, and the way to filter XML is through XSLT. I’m hardly an XSLT expert, but I’d used it several years ago when I modified Fletcher Penney’s MultiMarkdown, so I knew I could do it again, especially for a simple filter.

As a language, XSLT blows, but once you accept its absurd verbosity you can get on with life and do the job. And there are plenty of examples on the internet to crib from. I found this one to be close enough to what I was doing. I downloaded the After Dark feed and started experimenting. After several dead ends, I wound up with this:

 1:  <?xml version="1.0" encoding="utf-8" ?>
 2:  <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 4:  <xsl:output method="xml" encoding="utf-8" indent="yes"/>
 6:  <!-- First, get everything. -->
 7:  <xsl:template match="node() | @*">
 8:     <xsl:copy>
 9:         <xsl:apply-templates select="node() | @*"/>
10:     </xsl:copy>
11:  </xsl:template>
13:  <!-- Then restrict to just certain items. -->
14:  <xsl:template match="/rss/channel/item">
15:     <xsl:if test="title[contains(., 'Incomparable') or contains(., 'Talk Show') or contains(., 'Back to Work') or contains(., 'Hypercritical')]">
16:       <item>
17:         <xsl:apply-templates select="node()" />
18:      </item>
19:    </xsl:if>
20:  </xsl:template>
22:  </xsl:stylesheet>

The pattern used here—start out with a stanza that grabs everything and then follow up with one that picks out the nodes you need—seems to be standard; I saw it in several examples. What makes it useful in this situation is that the first stanza gives you all the channel information—which you need to keep—without having to write any code specific to that structure.

Let’s be more concrete. Here’s the skeleton of the After Dark RSS feed:

<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/" xmlns:atom="http://www.w3.org/2005/Atom/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <title>After Dark</title>
    <pubDate>Wed, 16 May 2012 18:00:00 GMT</pubDate>
    <description>This is what happens after Dan and his co-hosts hit "STOP" and their official shows are over. Behind the scenes, casual, unedited, and uncensored. Hosted by Dan Benjamin.</description>
    <!-- many nodes with info about the podcast in general -->
      <title>After Dark 157: After Amplified #7</title>
      <!-- more info about this episode -->
      <title>After Dark 156: After Back to Work #67</title>
      <!-- more info about this episode -->
      <title>After Dark 155: After Build and Analyze #77</title>
      <!-- more info about this episode -->
    <!-- and so on, with more items -->

My goal is to get everything in this feed except the <item>s for shows I don’t listen to. I don’t care about the structure of the nodes that provide the general podcast information, I just know that I want them in the output. The “get everything” stanza in Lines 6-11 provides that.

Similarly, the stanza in Lines 13-20 gets all the <item>s I want based entirely on what’s in their <title>s. I don’t need to know anything else about the internals of an <item>.

With this filter, I can generate my own After Dark feed via

curl -s http://feeds.feedburner.com/5by5-afterdark \
 | xsltproc rssfilter.xslt - \
 | sed '/^ *$/d' > afterdark.rss

Aren’t pipelines fun? I’ve split it over several lines to make it easier to read. The command

  1. Downloads the feed.
  2. Filters it as described above.
  3. Deletes the blank lines.
  4. Saves it to a file.

Deleting the blank lines with sed wasn’t really necessary, but it neatened up the output feed.

So what good is this? Well, you can set up a schedule to run the pipeline locally and upload the resulting file to your server, where all your devices can get to it at any time. Or, if you have shell access to a server with curl and xsltproc,1 you can set up a cron process to run this pipeline periodically and then subscribe to the afterdark.rss file. Or you could turn the pipeline (sans the final file redirect) into a CGI script and subscribe to that.

Update 5/18/12
Forgot to mention the content-type line. If you want to use that pipeline as a CGI script, make it

echo "Content-type: text/xml"
xsltproc rssfilter.xslt http://feeds.feedburner.com/5by5-afterdark \
 | sed '/^ *$/d'

I guess there’s some controversy over the proper content-type for a feed, but that’s what I tried and it worked fine. Both iTunes and Instacast recognized the output as an RSS feed. Make sure the XSLT file is in the same directory as the CGI script.

You’ll notice also that I’ve removed the curl command, passing the feed URL directly to xsltproc. I didn’t know initially that xsltproc will accept a URL to the XML input. The man page doesn’t mention it.

Whatever the process, you’ll have a filtered feed that uses only standard tools and doesn’t rely on a free online service that could be yanked at any time.

  1. Or equivalent programs.