Filtering my RSS reading
August 28, 2022 at 2:41 PM by Dr. Drang
A couple of weeks ago, I decided to cut back on my RSS feed reading.1 Not by reducing the number of feeds I’m subscribed to, but by filtering articles to eliminate those that would just be a waste of my time. The change was inspired by a particularly stupid post by Erik Loomis at Lawyers, Guns & Money. I realized that in all the years I’ve been reading LGM, I’ve liked very few Loomis articles. I start out thinking “maybe this one will be different,” but it seldom is. I just needed to cut him out.
My feedreader is NetNewsWire, which has been working well for me since I started using it about a year ago. Although there’s been some talk of adding filtering to NNW, it hasn’t happened yet. So what I need to do is set up filtered feeds and subscribe to them.
In olden times, I might have used Yahoo Pipes to do the filtering. Today’s equivalents are Zapier and IFTTT. After a bit of reading, it seemed like the parts of Zapier I’d need would require a $20/month subscription. And while I feel certain IFTTT could do what I wanted, I’m not interested in learning to write IFTTT applets—if I’m going to write filtering code, I’d rather do it in a more general purpose way.
I could subscribe to Feedbin or a similar service and point NetNewsWire to my subscription. This would be the right choice if, in addition to filtering, I wanted to fold a bunch of other things Feedbin does—like email newsletters, for example—into my RSS reading, but I’m not interested in that. If I’m going to spend %5/month, I’ll get a lot more out of a low-end virtual machine at Linode or Digital Ocean, which could host both my RSS filtering and other cloud-related services I build. And since I already have such a subscription…
My approach is very Web 1.0. For each feed I want to filter, I create a CGI script on my server. The script reads the original feed, filters out the articles I don’t want, and returns the rest. The URL of that script is what I subscribe to in NetNewsWire.
So what should the script be? My first thought was to use Python. It has the feedreader library, which I’ve used before. It parses the feed—from almost any format—and builds a dictionary from it. At that point, it’s easy to filter the articles using standard dictionary methods. Unfortunately, the filtered dictionary then has to be converted back out into a feed, which feedreader can’t do. I got around this by printing out the filtered dictionary as a JSON Feed. Since Brent Simmons is the driving force behind both NetNewsWire and the JSON Feed standard, I knew NNW would be able to parse the output of my filtering script.
This worked fine, and I used it for a couple of days, but it felt wrong. RSS and Atom feeds are XML files, and XML is supposed to be filtered using XSLT. The thing is, I haven’t used XSLT in ages, and I didn’t much care for it then. It was invented back when clever people thought everything was going to be put in XML format, so they built a programming language in XML. I’m sure they thought this was great—just like Lisp programs being written as Lisp lists—but it wasn’t. I’m sure there are many reasons XML hasn’t turned out to be as revolutionary as was thought 20 years ago, but one of them has to be the shitty language used for XML transformations.
Still, all I wanted to do was search for certain text in a certain node and prevent those records from appearing in the output. Everything else would be passed through as-is. Sal Mangano’s XSLT Cookbook has an example of a simple pass-through XSLT file (also known as the identity transform), which I used as the basis for my script:2
xml:
1: <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
2:
3: <xsl:template match="node() | @*">
4: <xsl:copy>
5: <xsl:apply-templates select="@* | node()"/>
6: </xsl:copy>
7: </xsl:template>
8:
9: </xsl:stylesheet>
XSLT is a rule-based language. The rules define how the various elements of the incoming XML document are to be treated. In the pass-through example, the match
in the template
rule on Line 3 matches all the elements (node()
) and all the attributes (@*
). The copy
command then copies whatever was matched, which was everything.
With the pass-through rule in place, the script can be expanded to add additional rules that are more specific matches to particular elements or attributes. The Lawyers, Guns & Money feed identifies the author of each post this way:
xml:
<item>
[other tags]
<dc:creator><![CDATA[Erik Loomis]]></dc:creator>
[more tags]
</item>
So I needed to add the following to the pass-through script:
- An attribute to the
<xsl>
tag to add thedc
namespace. I got this from the LGM feed’s<xsl>
tag. - A rule that matches “Loomis” in the
<dc:creator>
tag and does nothing with it.
Here’s what I came up with:
xml:
1: <xsl:stylesheet version="1.0"
2: xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
3: xmlns:dc="http://purl.org/dc/elements/1.1/">
4:
5: <xsl:output method="xml" encoding="UTF-8" indent="yes"/>
6:
7: <xsl:template match="node() | @*">
8: <xsl:copy>
9: <xsl:apply-templates select="node() | @*"/>
10: </xsl:copy>
11: </xsl:template>
12:
13: <xsl:template match="item[contains(dc:creator, 'Loomis')]"/>
14: </xsl:stylesheet>
You can see the namespace addition in Line 3 and the new rule for the <dc:creator>
element in Line 13. Because there’s no action within this rule, nothing is done when an <item>
contains “Loomis” in its <dc:creator>
tag. And by “nothing,” I really mean nothing—there’s no output associated with this rule, which means Loomis’s posts are omitted.
With this XSLT file in place, I just needed a shell script to download the original feed and process it through the filter.
bash:
1: #!/bin/bash
2:
3: echo "Content-Type: application/rss+xml"
4: echo
5:
6: curl -s https://www.lawyersgunsmoneyblog.com/feed \
7: | xsltproc loomis-filter.xslt -
Lines 3–4 provide the header and blank separator line. Lines 6–7 contain the pipeline that downloads the LGM feed via curl
and passes it to xsltproc
for filtering with the above XSLT file. xsltproc
is part of the GNOME XML/XSLT project. It’s not the most capable XSLT processor around (it’s limited to XSLT 1.0, which is missing a lot of nice features), but it’s perfectly fine for this simple application, and it’s quite fast.
Assuming the CGI shell script is named filtered-lgm-feed
and it’s on a server called mycheapserver.com
, the URL I use for the subscription is
https://mycheapserver.com/cgi-bin/filtered-lgm-feed
Once I had this filtered feed working, I thought of other parts of my regular reading that could use some pruning. Here’s the filter I wrote for the Mac Power Users forum:
xml:
1: <xsl:stylesheet version="1.0"
2: xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
3: xmlns:dc="http://purl.org/dc/elements/1.1/">
4:
5: <xsl:output method="xml" encoding="UTF-8" indent="yes"/>
6:
7: <xsl:template match="node() | @*">
8: <xsl:copy>
9: <xsl:apply-templates select="node() | @*"/>
10: </xsl:copy>
11: </xsl:template>
12:
13: <xsl:template match="item[contains(translate(title, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'obsidian')]"/>
14: <xsl:template match="item[contains(translate(title, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'devon')]"/>
15: </xsl:stylesheet>
I wish all of you who use Obsidian and DEVONthink the best, but I don’t want to read about them anymore.
The translate
function in Lines 13 and 14 converts all uppercase letters to lowercase before passing the result on to the contains
function. Unlike the previous filter, which expects “Loomis” to have consistent capitalization (it does), this one doesn’t trust the forum users to capitalize the trigger words in any standardized way. This is especially important for the various products from DEVONtechnologies, which get almost every possible permutation of capitalization: DevonThink, devonTHINK, DevonTHINK, etc.
Using translate
is a verbose way of making the change, but unfortunately XSLT 1.0 doesn’t have a lower-case
function. XSLT 2.0 does, but xsltproc
doesn’t support XSLT 2.0. The Java-based XSLT processor, Saxon, does, and for a while I had an XSLT 2.0 version of the MPU filter running through Saxon. But it was way slower than using xsltproc
, so I returned to the more clumsy filter you see above.
The script that runs the filter and returns the Obsidian- and DEVON-less MPU posts looks pretty much like the LGM script:
bash:
#!/bin/bash
echo "Content-Type: application/rss+xml"
echo
curl -s https://talk.macpowerusers.com/latest.rss \
| xsltproc topic-filter.xslt -
Although this post is kind of long-winded. building the filters didn’t take much time. It’s easy to download an RSS feed and look through it to see which nodes and attributes to use for the filter. Now that I have a couple of examples to build on, I expect to be adding more filters soon.
-
Throughout this post, I’ll be using “RSS” as a catch-all term for any kind of feed, regardless of format. My apologies to all the Atom fans out there. ↩
-
No, I don’t own the XSLT Cookbook, but my local library provides its patrons with a subscription to O’Reilly’s ebooks and courseware. It’s a good service, and you should look into whether your library does the same. ↩