My feed reading system

As promised, or threatened, here’s my setup for RSS feed reading. It consists of a few scripts that run periodically throughout the day on a server I control and which is accessible to me from any browser on any device. The idea is to have a system that fits the way I read and doesn’t rely on any particular service or company. If my current web host went out of business tomorrow, I could move this system to another and be back up and running in an hour or so—less time than it would take to research and decide on a new feed reading service.

The linchpin of the system is the getfeeds script:

  1  #!/usr/bin/env python
  2  # coding=utf8
  3  
  4  import feedparser as fp
  5  import time
  6  from datetime import datetime, timedelta
  7  import pytz
  8  from collections import defaultdict
  9  import sys
 10  import dateutil.parser as dp
 11  import urllib2
 12  import json
 13  import sqlite3
 14  import urllib
 15  
 16  def addItem(db, blog, id):
 17    add = 'insert into items (blog, id) values (?, ?)'
 18    db.execute(add, (blog, id))
 19    db.commit()
 20  
 21  jsonsubscriptions = [
 22    'http://leancrew.com/all-this/feed.json',
 23    'https://daringfireball.net/feeds/json',
 24    'https://sixcolors.com/feed.json',
 25    'https://www.robjwells.com/feed.json',
 26    'http://inessential.com/feed.json',
 27    'https://macstories.net/feed/json']
 28  
 29  xmlsubscriptions = [
 30    'http://feedpress.me/512pixels',
 31    'http://alicublog.blogspot.com/feeds/posts/default',
 32    'http://blog.ashleynh.me/feed',
 33    'http://www.betalogue.com/feed/',
 34    'http://bitsplitting.org/feed/',
 35    'https://kieranhealy.org/blog/index.xml',
 36    'http://blueplaid.net/news?format=rss',
 37    'http://brett.trpstra.net/brettterpstra',
 38    'http://feeds.feedburner.com/NerdGap',
 39    'http://www.libertypages.com/clarktech/?feed=rss2',
 40    'http://feeds.feedburner.com/CommonplaceCartography',
 41    'http://kk.org/cooltools/feed',
 42    'https://david-smith.org/atom.xml',
 43    'http://feeds.feedburner.com/drbunsenblog',
 44    'http://stratechery.com/feed/',
 45    'http://feeds.feedburner.com/IgnoreTheCode',
 46    'http://indiestack.com/feed/',
 47    'http://feeds.feedburner.com/theendeavour',
 48    'http://feed.katiefloyd.me/',
 49    'http://feeds.feedburner.com/KevinDrum',
 50    'http://www.kungfugrippe.com/rss',
 51    'http://www.caseyliss.com/rss',
 52    'http://www.macdrifter.com/feeds/all.atom.xml',
 53    'http://mackenab.com/feed',
 54    'http://macsparky.com/blog?format=rss',
 55    'http://www.marco.org/rss',
 56    'http://themindfulbit.com/feed.xml',
 57    'http://merrillmarkoe.com/feed',
 58    'http://mjtsai.com/blog/feed/',
 59    'http://feeds.feedburner.com/mygeekdaddy',
 60    'https://nathangrigg.com/feed/all.rss',
 61    'http://onethingwell.org/rss',
 62    'http://www.practicallyefficient.com/feed.xml',
 63    'http://www.red-sweater.com/blog/feed/',
 64    'http://blog.rtwilson.com/feed/',
 65    'http://feedpress.me/candlerblog',
 66    'http://inversesquare.wordpress.com/feed/',
 67    'http://joe-steel.com/feed',
 68    'http://feeds.veritrope.com/',
 69    'https://with.thegra.in/feed',
 70    'http://xkcd.com/atom.xml',
 71    'http://doingthatwrong.com/?format=rss']
 72  
 73  # Feedparser filters out certain tags and eliminates them from the
 74  # parsed version of a feed. This is particularly troublesome with
 75  # embedded videos. This can be fixed by changing how the filter
 76  # works. The following is based these tips:
 77  #
 78  # http://rumproarious.com/2010/05/07/\
 79  #  universal-feed-parser-is-awesome-except-for-embedded-videos/
 80  #
 81  # http://stackoverflow.com/questions/30353531/\
 82  #  python-rss-feedparser-cant-parse-description-correctly
 83  #
 84  # There is some danger here, as the included elements may contain
 85  # malicious code.
 86  fp._HTMLSanitizer.acceptable_elements |= {'object', 'embed', 'iframe'}
 87  
 88  # Connect to the database of read posts.
 89  db = sqlite3.connect('/path/to/read-feeds.db')
 90  query = 'select * from items where blog=? and id=?'
 91  
 92  # Collect all unread posts and put them in a list of tuples. The items
 93  # in each tuple are when, blog, title, link, body, n, and author. 
 94  posts = []
 95  n = 0
 96  
 97  # We're not going to accept items that are more than 3 days old, even
 98  # if they aren't in the database of read items. These typically come up
 99  # when someone does a reset of some sort on their blog and regenerates
100  # a feed with old posts that aren't in the database or posts that are
101  # in the database but have different IDs.
102  utc = pytz.utc
103  homeTZ = pytz.timezone('US/Central')
104  daysago = datetime.today() - timedelta(days=3)
105  daysago = utc.localize(daysago)
106  
107  # Start with the JSON feeds.
108  for s in jsonsubscriptions:
109    try:
110      feed = urllib2.urlopen(s).read()
111      jfeed = json.loads(feed)
112      blog = jfeed['title']
113      for i in jfeed['items']:
114        try:
115          id = i['id']
116        except KeyError:
117          id = i['url']
118      
119        # Add item only if it hasn't been read.
120        match = db.execute(query, (blog, id)).fetchone()
121        if not match:
122          try:
123            when = i['date_published']
124          except KeyError:
125            when = i['date_modified']
126          when = dp.parse(when)
127          when = utc.localize(when)
128          
129          try:
130            author = ' ({})'.format(i['author']['name'])
131          except KeyError:
132            author = ''
133          try:
134            title = i['title']
135          except KeyError:
136            title = blog
137          link = i['url']
138          body = i['content_html']
139          
140          # Include only posts that are less than 3 days old. Add older posts
141          # to the read database.
142          if when > daysago:
143            posts.append((when, blog, title, link, body, "{:04d}".format(n), author, id))
144            n += 1
145          else:
146            addItem(db, blog, id)
147    except:
148      pass
149      
150  # Add the RSS/Atom feeds.
151  for s in xmlsubscriptions:
152    try:
153      f = fp.parse(s)
154      try:
155        blog = f['feed']['title']
156      except KeyError:
157        blog = "---"
158      for e in f['entries']:
159        try:
160          id = e['id']
161          if id == '':
162            id = e['link']
163        except KeyError:
164          id = e['link']
165      
166        # Add item only if it hasn't been read.
167        match = db.execute(query, (blog, id)).fetchone()
168        if not match:
169      
170          try:
171            when = e['published_parsed']
172          except KeyError:
173            when = e['updated_parsed']
174          when =  datetime(*when[:6])
175          when = utc.localize(when)
176      
177          try:
178            title = e['title']
179          except KeyError:
180            title = blog
181          try:
182            author = " ({})".format(e['authors'][0]['name'])
183          except KeyError:
184            author = ""
185          try:
186            body = e['content'][0]['value']
187          except KeyError:
188            body = e['summary']
189          link = e['link']
190          
191          # Include only posts that are less than 3 days old. Add older posts
192          # to the read database.
193          if when > daysago:
194            posts.append((when, blog, title, link, body, "{:04d}".format(n), author, id))
195            n += 1
196          else:
197            addItem(db, blog, id)
198    except:
199      pass
200  
201  # Sort the posts in reverse chronological order.
202  posts.sort()
203  posts.reverse()
204  toclinks = defaultdict(list)
205  for p in posts:
206    toclinks[p[1]].append((p[2], p[5]))
207  
208  # Create an HTML list of the posts.
209  listTemplate = '''<li>
210    <p class="title" id="{5}"><a href="{3}">{2}</a></p>
211    <p class="info">{1}{6}<br />{0}</p>
212    <p>{4}</p>
213    <form action="/path/to/addreaditem.py" method="post" name="readform{5}" onsubmit="return markAsRead(this);">
214      <input type="hidden" name="blog" value="{8}" />
215      <input type="hidden" name="id" value="{9}" />
216      <input class="mark-button" type="submit" value="Mark as read" name="readbutton{5}"/>
217    </form>
218    <br />
219    <form action="/path/to/addpinboarditem.py" method="post" name="pbform{5}" onsubmit="return addToPinboard(this);">
220      <input type="hidden" name="url" value="{11}" />
221      <input type="hidden" name="title" value="{10}" />
222      <input class="pinboard-field" type="text" name="tags" size="30" /><br />
223      <input class="pinboard-button" type="submit" value="Pinboard" name="pbbutton{5}" />
224    </form>
225    </li>'''
226  litems = []
227  for p in posts:
228    q = [ x.encode('utf8') for x in p[1:] ]
229    timestamp = p[0].astimezone(homeTZ)
230    q.insert(0, timestamp.strftime('%b %d, %Y %I:%M %p'))
231    q += [urllib.quote_plus(q[1]),
232          urllib.quote_plus(q[7]),
233          urllib.quote_plus(q[2]),
234          urllib.quote_plus(q[3])]
235    litems.append(listTemplate.format(*q))
236  body = '\n<hr />\n'.join(litems)
237  
238  # Create a table of contents organized by blog.
239  tocTemplate = '''<li class="toctitle"><a href="#{1}">{0}</a></li>\n'''
240  toc = ''
241  blogs = toclinks.keys()
242  blogs.sort()
243  for b in blogs:
244    toc += '''<p class="tocblog">{0}</p>
245  <ul class="rss">
246    '''.format(b.encode('utf8'))
247    for p in toclinks[b]:
248      q = [ x.encode('utf8') for x in p ]
249      toc += tocTemplate.format(*q)
250    toc += '</ul>\n'
251  
252  # Print the HTMl.
253  print '''<html>
254  <meta charset="UTF-8" />
255  <meta name="viewport" content="width=device-width" />
256  <head>
257  <style>
258  body {{
259    background-color: #555;
260    width: 750px;
261    margin-top: 0;
262    margin-left: auto;
263    margin-right: auto;
264    padding-top: 0;
265    font-family: Georgia, Serif;
266  }}
267  h1, h2, h3, h4, h5, h6 {{
268    font-family: Helvetica, Sans-serif;
269  }}
270  h1 {{
271    font-size: 110%;
272  }}
273  h2 {{
274    font-size: 105%;
275  }}
276  h3, h4, h5, h6 {{
277    font-size: 100%;
278  }}
279  .content {{
280    padding-top: 1em;
281    background-color: white;
282  }}
283  .rss {{
284    list-style-type: none;
285    margin: 0;
286    padding: .5em 1em 1em 1.5em;
287    background-color: white;
288  }}
289  .rss li {{
290    margin-left: -.5em;
291    line-height: 1.4;
292  }}
293  .rss li pre {{
294    overflow: auto;
295  }}
296  .rss li p {{
297    overflow-wrap: break-word;
298    word-wrap: break-word;
299    word-break: break-word;
300    -webkit-hyphens: auto;
301    hyphens: auto;
302  }}
303  .rss li figure {{
304    -webkit-margin-before: 0;
305    -webkit-margin-after: 0;
306    -webkit-margin-start: 0;
307    -webkit-margin-end: 0;
308  }}
309  .title {{
310    font-weight: bold;
311    font-family: Helvetica, Sans-serif;
312    font-size: 120%;
313    margin-bottom: .25em;
314  }}
315  .title a {{
316    text-decoration: none;
317    color: black;
318  }}
319  .info {{
320    font-size: 85%;
321    margin-top: 0;
322    margin-left: .5em;
323  }}
324  .tocblog {{
325    font-weight: bold;
326    font-family: Helvetica, Sans-serif;
327    font-size: 100%;
328    margin-top: .25em;
329    margin-bottom: 0;
330  }}
331  .toctitle {{
332    font-weight: medium;
333    font-family: Helvetica, Sans-serif;
334    font-size: 100%;
335    padding-left: .75em;
336    text-indent: -.75em;
337    margin-bottom: 0;
338  }}
339  .toctitle a {{
340    text-decoration: none;
341    color: black;
342  }}
343  .tocinfo {{
344    font-size: 75%;
345    margin-top: 0;
346    margin-left: .5em;
347  }}
348  img, embed, iframe, object {{
349    max-width: 700px;
350  }}
351  .mark-button {{
352    width: 15em;
353    border: none;
354    border-radius: 4px;
355    color: black;
356    background-color: #B3FFB2;
357    text-align: center;
358    padding: .25em 0 .25em 0;
359    font-weight: bold;
360    font-size: 1em;
361  }}
362  .pinboard-button {{
363    width: 7em;
364    border: none;
365    border-radius: 4px;
366    color: black;
367    background-color: #B3FFB2;
368    text-align: center;
369    padding: .25em 0 .25em 0;
370    font-weight: bold;
371    font-size: 1em;
372    margin-left: 11em;
373  }}
374  .pinboard-field {{
375    font-size: 1em;
376    font-family: Helvetica, Sans-serif;
377  }}
378  
379  @media only screen
380    and (max-width: 667px)
381    and (-webkit-device-pixel-ratio: 2)
382    and (orientation: portrait) {{
383    body {{
384      font-size: 200%;
385      width: 640px;
386      background-color: white;
387    }}
388    .rss li {{
389      line-height: normal;
390    }}
391    img, embed, iframe, object {{
392      max-width: 550px;
393    }}
394  }}
395  @media only screen
396    and (min-width: 668px)
397    and (-webkit-device-pixel-ratio: 2) {{
398    body {{
399      font-size: 150%;
400      width: 800px;
401      background-color: #555;
402    }}
403    .rss li {{
404      line-height: normal;
405    }}
406    img, embed, iframe, object {{
407      max-width: 700px;
408    }}
409  }}
410  </style>
411  
412  <script language=javascript type="text/javascript">
413  function markAsRead(theForm) {{
414    var mark = new XMLHttpRequest();
415    mark.open(theForm.method, theForm.action, true);
416    mark.send(new FormData(theForm));
417    mark.onreadystatechange = function() {{
418      if (mark.readyState == 4 && mark.status == 200) {{
419        var buttonName = theForm.name.replace("readform", "readbutton");
420        var theButton = document.getElementsByName(buttonName)[0];
421        theButton.value = "Marked!";
422        theButton.style.backgroundColor = "#FFB2B2";
423      }}
424    }}
425    return false;
426  }}
427  
428  function addToPinboard(theForm) {{
429    var mark = new XMLHttpRequest();
430    mark.open(theForm.method, theForm.action, true);
431    mark.send(new FormData(theForm));
432    mark.onreadystatechange = function() {{
433      if (mark.readyState == 4 && mark.status == 200) {{
434        var buttonName = theForm.name.replace("pbform", "pbbutton");
435        var theButton = document.getElementsByName(buttonName)[0];
436        theButton.value = "Saved!";
437        theButton.style.backgroundColor = "#FFB2B2";
438      }}
439    }}
440    return false;
441  }}
442  
443  </script>
444  
445  <title>Today’s RSS</title>
446  </head>
447  <body>
448  <div class="content">
449  <ul class="rss">
450  {}
451  </ul>
452  <hr />
453  <a name="start" />
454  <ul class="rss">
455  {}
456  </ul>
457  </div>
458  </body>
459  </html>
460  '''.format(toc, body)

For me, this is a very long script, but most of it is just the HTML template. What getfeeds does is go through my subscription list, gather all the articles from those feeds that I haven’t already read, and generate a static HTML file with the unread articles laid out in reverse chronological order. At the end of each article, it puts a button to mark the article as read and a form for adding a link to the article to my account at Pinboard.

Start by noticing that this is a Python 2 script, so Line 2 is a comment that tells Python that UTF-8 characters will be in the source code. We’ll also run into decode/encode invocations that wouldn’t be necessary if I’d written this in Python 3. I suppose I’ll translate it at some point.

Lines 16–19 are a function for adding an article to the database of read items. This is an SQLite database that’s also kept on the server. The database has a single table whose schema consists of just two fields: the blog name and the article GUID. Each article that I’ve marked as read gets entered as a new record in the database. The addItem function runs a simple SQL insertion command via Python’s sqlite3 library.

Lines 21–27 and 29–71 define my subscriptions: two lists of feed URLs, one for JSON feeds and the other for traditional RSS/Atom feeds. A lot of these feeds have gone silent over the past year, but I remain subscribed to them in the hope that they’ll come back to life.

Line 86 sets a parameter in the feedparser library that relaxes some of the filtering that library does by default. There is some danger to this, but I’ve found that some blogs are essentially worthless if I don’t do this. The comments above Line 86 contain links to discussions of feedparser’s filtering.

Lines 89–90 connect to the database of read items (note the fake path to the database file) and create a query string that we’ll use later to determine whether an article is in the database.

Lines 94–95 initialize the list of posts that will ultimately be turned into the HTML page and the n variable that keeps track of the post count.

Lines 102–105 initialize a set of variables used to handle timezone information and the filtering of older articles that aren’t in the database of read items. As discussed in the comments above Line 102 and in my previous post, old articles that aren’t in the database can sometimes appear in a blog’s RSS feed when the blog gets updated.

Lines 108–148 assemble the unread articles from the JSON feeds. For each subscription, the feed is downloaded, converted into a dictionary, and run through to extract information on each article. Articles that are in the database of read items are ignored (Lines 120-121). Articles that aren’t in the database are appended to the posts list, unless they’re more than three days old, in which case they are added to the database of read items instead of to posts (Lines 142–146).

Much of Lines 108–148 is devoted to error handling and the normalization of disparate input into a uniform output. Each item of the posts list is a tuple with

Lines 151–199 do for RSS/Atom feeds what Lines 108–148 do for JSON feeds. The main difference is that the feedparser library is used to download and convert the feed into a dictionary.

Lines 202–203 sort the posts in reverse chronological order. This is made easy by my choice to put the article date as the first item in the tuple described above.

Lines 204–206 generate a dictionary of lists of tuples, toclinks, for the HTML page’s table of contents, which appears at the top of the page. A table of contents isn’t really necessary, but I like seeing an overview of what’s available before I start reading. The keys of the dictionary are the blog names, and each tuple in the list consists of the article’s title and its number, as given in the running post count, n. The number will be used to create internal links in the HTML page.

From this point on, it’s all HTML templating. I suppose I could’ve used one of the myriad Python libraries for this, but I didn’t feel like doing the research to figure out which would be best for my needs. The ol’ format command works pretty well.

Lines 209–225 define the template for each article. It starts with the title (which links to the original article), the date, and the author. The id attribute in the title provides the internal target for the link in the table of contents. After the post contents come two forms. The first has two hidden fields with the blog name and the article GUID and a visible button that marks the article as read. The second form has the same hidden fields, a visible text field for Pinboard tags, and button to add a link to the original article to my Pinboard list. We’ll see later how these buttons work.

Lines 227–236 concatenate all of the posts, though their template, into one long stretch of HTML that will make up the bulk of the body of the page.

Line 239 defines a template for a table of contents entry (note the internal link), and Lines 240–250 then use that template to assemble the toclinks dictionary into the HTML for the table of contents.

The last piece, Lines 253–460, assembles and outputs the final, full HTML file. It’s as long as it is because I wanted a single, self-contained file with all the CSS and JavaScript in it. I’m sure this doesn’t comport with best practices, but I’ve noticed that best practices in web programming and design change more often than I have time to keep track of. Whenever I need to change something, I know it’ll be here in getfeeds.

The CSS is in Lines 257–410 and is set up to look decent (to me) on my computer, iPad, and iPhone. There’s a lot I don’t know about responsive web design, and I’m sure it shows here.

Lines 412–426 and Lines 428–441 define the markAsRead and addToPinboad JavaScript functions, which are activated by the buttons described above. These are basic AJAX functions that do not rely on any outside library. They’re based on what I read in David Flanagan’s JavaScript: The Definitive Guide and, I suspect, a Stack Overflow page or two that I forgot to preserve the links to. There’s a decent chance they don’t work in Internet Explorer, which I will worry about in the next life.

The markAsRead function triggers this addreaditem.py script on the server:

 1  #!/usr/bin/python
 2  # coding=utf8
 3  
 4  import sqlite3
 5  import cgi
 6  import sys
 7  import urllib
 8  import cgitb
 9  
10  def addItem(db, blog, id):
11    add = 'insert into items (blog, id) values (?, ?)'
12    db.execute(add, (blog, id))
13    db.commit()
14  
15  def markedItem(db, blog, id):
16    check = 'select * from items where blog=? and id=?'
17    return db.execute(check, (blog, id)).fetchone()
18  
19  # Connect to database of read items
20  db = sqlite3.connect('/path-to/read-feeds.db')
21  
22  # Get the item from the request and add it to the database
23  form = cgi.FieldStorage()
24  blog = urllib.unquote_plus(form.getvalue('blog')).decode('utf8')
25  id = urllib.unquote_plus(form.getvalue('id')).decode('utf8')
26  if markedItem(db, blog, id):
27    answer = 'Already marked'
28  else:
29    addItem(db, blog, id)
30    answer = 'OK'
31  
32  minimal='''Content-Type: text/html
33  
34  <html>
35  <head>
36    <title>Add Item</title>
37  <body>
38    <h1>{}</h1>
39  </body>
40  </html>'''.format(answer)
41  
42  print(minimal)

There’s not much to this script. It uses the same addItem function we saw before and a markedItem function uses the same query we saw earlier to check if an item is in the database. Lines 23–30 get the input from the form that called it, check whether that item is already in the database, and add it if it isn’t. There’s some minimal HTML for output, but that’s of no importance. What matters is that if the script returns a success, the markAsRead function changes the color of the button from green to red and the text of the button from “Mark as read” to “Marked!”

Before: Before marking as read

After: After marking as read

The addToPinboard JavaScript function does essentially the same thing, except it triggers this addpinboarditem.py script on the server:

 1  #!/usr/bin/python
 2  # coding=utf8
 3  
 4  import cgi
 5  import pinboard
 6  import urllib
 7  
 8  # Pinboard token
 9  token = 'myPinboardName:myPinboardToken'
10  
11  # Get the page info from the request
12  form = cgi.FieldStorage()
13  url = urllib.unquote_plus(form.getvalue('url')).decode('utf8')
14  title = urllib.unquote_plus(form.getvalue('title')).decode('utf8')
15  tagstr = urllib.unquote_plus(form.getvalue('tags')).decode('utf8')
16  tags = tagstr.split()
17  
18  # Add the item to Pinboard
19  pb = pinboard.Pinboard(token)
20  result = pb.posts.add(url=url, description=title, tags=tags)
21  if result:
22    answer = "OK"
23  else:
24    answer = "Failed"
25  
26  minimal='''Content-Type: text/html
27  
28  <html>
29  <head>
30    <title>Add To Pinboard</title>
31  <body>
32    <h1>{}</h1>
33  </body>
34  </html>'''.format(answer)
35  
36  print(minimal)

This script uses the Pinboard API to add a link to the original article. Line 9 defines my Pinboard credentials. Lines 12–16 extract the article and tag information from the form. Lines 19–24 connect to Pinboard and add the item to my list. If the script returns a success, the addToPinboard function changes the color of the button from green to red and the text of the button from “Pinboard” to “Saved!”

Before: Before saving to Pinboad

After: After saving to Pinboard

The overall system is controlled by this short shell script, runrss.sh:

1  #!/bin/bash
2  
3  /path/to/getfeeds > /other/path/to/rsspage-tmp.html
4  cd /other/path/to
5  mv rsspage-tmp.html rsspage.html

Line 3 runs the getfeeds script, sending the HTML output to a temporary file. Line 4 then changes to the directory that contains the temporary file, and Line 5 renames it. The file I direct my browser to is rsspage.html. This seeming extra step with the temporary file is there because the getfeeds script takes several seconds to run, and if I sent its output directly to rsspage.html, that file would be in a weird state during that run time. I don’t want to browse the page when it isn’t finished.

Finally, runrss.sh is executed periodically throughout the day by cron. The crontab entry is

*/20 0,6-23 * * * /path/to/runrss.sh

This runs the script every 20 minutes from 6:00 am through midnight every day.

So that’s it. Three Python scripts, one of which is long but mostly HTML templating, a short shell script, and a crontab entry. Was it easier to do this than set up a Feedbin (or whatever) account? Of course not. But I won’t have to worry if I see that Feedbin’s owners have written a Medium post.