Date-without-time stamps
May 11, 2016 at 11:59 PM by Dr. Drang
A while ago, I lost Eddie Smith’s Practically Efficient blog from my homemade RSS aggregator. At first, I thought it was because his feed URL had changed when he switched from Squarespace to Jekyll, but today I learned it was because of a more subtle change: the datestamps on his feed’s new entries didn’t include the time, and my aggregator was misinterpreting that and filtering them out. I’ve rewritten the datestamp filtering portion of the aggregator to fix this problem, which not only brought back not only Eddie’s feed but also XKCD’s.
It’s not that I stopped reading Practically Efficient during these past several weeks. Eddie always tweets a link to his latest post, so I was keeping up via Twitter, like all the cool kids do nowadays. But I really wanted him back in my RSS aggregator, so today I decided to dig in and find out what was keeping him out.
It wasn’t, as I hinted above, because he’d changed his feed URL. I checked, and it’s the same as it’s been for years. So I used curl
to look at his feed directly:
curl http://www.practicallyefficient.com/feed.xml
His feed includes entries for the last ten posts. The entry for today’s post looks like this:
<item>
<title>Fear more</title>
<description><p>Focusing your time on a narrower set of priorities is the biggest gut check you can experience in business and life. It is a never-ending fight with your primal self. Feature bloat, mass marketing, and “do it all” are driven by fear, and the internet age enables that fear like no other medium devised by humanity.</p>
<p>Before you decide to spend time making <em>more</em> instead of making something <em>better</em>, ask yourself: what am I afraid of?</p>
</description>
<pubDate>Wed, 11 May 2016 00:00:00 +0000</pubDate>
<link>http://www.practicallyefficient.com/2016/05/11/fear-more.html</link>
<guid isPermaLink="true">http://www.practicallyefficient.com/2016/05/11/fear-more.html</guid>
</item>
The key to his disappearance from my aggregator is the <pubDate>
, the contents of which are
Wed, 11 May 2016 00:00:00 +0000
Did Eddie really post his article at midnight UTC? No, he posted it sometime on the 11th, and Jekyll (or maybe some other part of his publishing system) set the pubDate
to the correct date but somehow defaulted to midnight for the time portion. If you look through all ten entries in Eddie’s feed, you’ll see that all of them have a time entry of
00:00:00 +0000
This time stamp is why my aggregator was filtering out his posts. As I said in my initial article about the aggregator script, it runs periodically on my server to create an HTML file with “today’s” posts from the sites I subscribe to. To avoid missing articles published late in the evening, I define “today” as anytime after 10:00 pm US Central Time on the previous day. So whenever the script runs on May 11, it creates a page with every post published after 10:00 pm Central Time on May 10. Eddie’s post for May 11 has a timestamp of midnight UTC, which in Central Time is 7:00 pm of May 10, so the script saw it as an older post and filtered it out. And that’s what’s been happening to every one of Eddie’s posts for several weeks.
What to do? I could ask Eddie to change his publishing system to accommodate my aggregator, but that seems presumptuous. Better to rewrite the aggregator to handle this case, especially since his isn’t the only feed that defaults to midnight UTC timestamps. XKCD does the same thing, and I’m sure there are others.
So here’s the new aggregator script:
1 #!/usr/bin/env python
2 # coding=utf8
3
4 import feedparser as fp
5 import time
6 from datetime import datetime, timedelta
7 import pytz
8
9 subscriptions = [
10 'http://feedpress.me/512pixels',
11 'http://www.leancrew.com/all-this/feed/',
12 'http://ihnatko.com/feed/',
13 'http://blog.ashleynh.me/feed',
14 'http://www.betalogue.com/feed/',
15 'http://bitsplitting.org/feed/',
16 'http://feedpress.me/jxpx777',
17 'http://kieranhealy.org/blog/index.xml',
18 'http://blueplaid.net/news?format=rss',
19 'http://brett.trpstra.net/brettterpstra',
20 'http://feeds.feedburner.com/NerdGap',
21 'http://www.libertypages.com/clarktech/?feed=rss2',
22 'http://feeds.feedburner.com/CommonplaceCartography',
23 'http://kk.org/cooltools/feed',
24 'http://danstan.com/blog/imHotep/files/page0.xml',
25 'http://daringfireball.net/feeds/main',
26 'http://david-smith.org/atom.xml',
27 'http://feeds.feedburner.com/drbunsenblog',
28 'http://stratechery.com/feed/',
29 'http://www.gnuplotting.org/feed/',
30 'http://feeds.feedburner.com/jblanton',
31 'http://feeds.feedburner.com/IgnoreTheCode',
32 'http://indiestack.com/feed/',
33 'http://feedpress.me/inessential',
34 'http://feeds.feedburner.com/theendeavour',
35 'http://feed.katiefloyd.me/',
36 'http://feeds.feedburner.com/KevinDrum',
37 'http://www.kungfugrippe.com/rss',
38 'http://lancemannion.typepad.com/lance_mannion/rss.xml',
39 'http://www.caseyliss.com/rss',
40 'http://www.macdrifter.com/feeds/all.atom.xml',
41 'http://mackenab.com/feed',
42 'http://hints.macworld.com/backend/osxhints.rss',
43 'http://macsparky.com/blog?format=rss',
44 'http://www.macstories.net/feed/',
45 'http://www.marco.org/rss',
46 'http://merrillmarkoe.com/feed',
47 'http://mjtsai.com/blog/feed/',
48 'http://feeds.feedburner.com/mygeekdaddy',
49 'http://nathangrigg.net/feed.rss',
50 'http://onethingwell.org/rss',
51 'http://schmeiser.typepad.com/penny_wiseacre/rss.xml',
52 'http://www.practicallyefficient.com/feed.xml',
53 'http://robjwells.com/rss',
54 'http://www.red-sweater.com/blog/feed/',
55 'http://blog.rtwilson.com/feed/',
56 'http://feedpress.me/sixcolors',
57 'http://feedpress.me/candlerblog',
58 'http://inversesquare.wordpress.com/feed/',
59 'http://high90.com/feed',
60 'http://joe-steel.com/feed',
61 'http://feeds.veritrope.com/',
62 'http://xkcd.com/atom.xml',
63 'http://doingthatwrong.com/?format=rss']
64
65 # Date and time setup. I want only posts from today,
66 # where "today" starts at 10 PM of the previous day and
67 # lasts until 2 AM of the following day.
68 # Exception: if the entry's date is today with a timestamp
69 # of exactly midnight (00:00:00), include that, too, even
70 # if its timezone is UTC, as that probably represents a
71 # datestamp of today without a real timestamp.
72 utc = pytz.utc
73 homeTZ = pytz.timezone('US/Central')
74 mnToday = datetime.today().replace(hour=0, minute=0, second=0, microsecond=0)
75 dt = datetime.now(homeTZ)
76 if dt.hour < 2:
77 dt -= timedelta(hours=48)
78 else:
79 dt -= timedelta(hours=24)
80 start = dt.replace(hour=22, minute=0, second=0, microsecond=0)
81 start = start.astimezone(utc)
82
83
84 # Collect all of today's posts and put them in a list of tuples.
85 posts = []
86 for s in subscriptions:
87 f = fp.parse(s)
88 try:
89 blog = f['feed']['title']
90 except KeyError:
91 continue
92 for e in f['entries']:
93 try:
94 when = e['published_parsed']
95 except KeyError:
96 when = e['updated_parsed']
97 when = datetime(*when[:6])
98 # This is the exception. Change it to midnight today, local time.
99 if when == mnToday:
100 when = homeTZ.localize(when).astimezone(utc)
101 else:
102 when = utc.localize(when)
103 if when > start:
104 title = e['title']
105 try:
106 body = e['content'][0]['value']
107 except KeyError:
108 body = e['summary']
109 link = e['link']
110 posts.append((when, blog, title, link, body))
111
112 # Sort the posts in reverse chronological order.
113 posts.sort()
114 posts.reverse()
115
116 # Turn them into an HTML list.
117 listTemplate = '''<li>
118 <p class="title"><a href="{3}">{2}</a></p>
119 <p class="info">{1}<br />{0}</p>
120 <p>{4}</p>\n</li>'''
121 litems = []
122 for p in posts:
123 q = [ x.encode('utf8') for x in p[1:] ]
124 timestamp = p[0].astimezone(homeTZ)
125 q.insert(0, timestamp.strftime('%b %d, %Y %I:%M %p'))
126 litems.append(listTemplate.format(*q))
127 ul = '\n<hr />\n'.join(litems)
128
129 # Print the HTMl.
130 print '''<html>
131 <meta charset="UTF-8" />
132 <meta name="viewport" content="width=device-width" />
133 <head>
134 <style>
135 body {{
136 background-color: #555;
137 width: 750px;
138 margin-top: 0;
139 margin-left: auto;
140 margin-right: auto;
141 padding-top: 0;
142 }}
143 h1, h2, h3, h4, h5, h6 {{
144 font-family: Helvetica, Sans-serif;
145 }}
146 h1 {{
147 font-size: 110%;
148 }}
149 h2 {{
150 font-size: 105%;
151 }}
152 h3, h4, h5, h6 {{
153 font-size: 100%;
154 }}
155 .rss {{
156 list-style-type: none;
157 margin: 0;
158 padding: .5em 1em 1em 1.5em;
159 background-color: white;
160 }}
161 .rss li {{
162 margin-left: -.5em;
163 line-height: 1.4;
164 }}
165 .rss li pre {{
166 overflow: auto;
167 }}
168 .rss li p {{
169 overflow-wrap: break-word;
170 word-wrap: break-word;
171 word-break: break-word;
172 -webkit-hyphens: auto;
173 hyphens: auto;
174 }}
175 .rss li figure {{
176 -webkit-margin-before: 0;
177 -webkit-margin-after: 0;
178 -webkit-margin-start: 0;
179 -webkit-margin-end: 0;
180 }}
181 .title {{
182 font-weight: bold;
183 font-family: Helvetica, Sans-serif;
184 font-size: 120%;
185 margin-bottom: .25em;
186 }}
187 .title a {{
188 text-decoration: none;
189 color: black;
190 }}
191 .info {{
192 font-size: 85%;
193 margin-top: 0;
194 margin-left: .5em;
195 }}
196 img {{
197 max-width: 700px;
198 }}
199 @media screen and (max-width:667px) {{
200 body {{
201 font-size: 200%;
202 width: 650px;
203 background-color: white;
204 }}
205 .rss li {{
206 line-height: normal;
207 }}
208 img {{
209 max-width: 600px;
210 }}
211 }}
212 </style>
213 <title>Today’s RSS</title>
214 <body>
215 <ul class="rss">
216 {}
217 </ul>
218 </body>
219 </html>
220 '''.format(ul)
The new stuff starts on Line 74, with the variable mnToday
being defined as a datetime
object set to midnight today with no assigned time zone. Then Lines 99–102 check to see if the publication date of an entry (the value of when
) matches mnToday
.
If it matches mnToday
, we assume the time portion has been set by default and should be considered somewhat fictional. We set it to midnight Central Time so it will be included in today’s feed list and then convert it to UTC, because that’s how all the other timestamps are handled. If the publication date of the entry doesn’t match mnToday
, it’s taken as a legitimate UTC time and not adjusted.
So now I have Eddie back, and as a bonus, my aggregator is more robust than it was before. I think I’ve said this before: I understand why Brent Simmons got out of the RSS parsing business.