Some file exploration with Unix tools
September 26, 2014 at 12:09 AM by Dr. Drang
In Sunday’s post, I mentioned I’ve has been having trouble with intermittent Internal Server Errors here for the past couple of months. On Tuesday, Seth Brown sent me this tweet:
@drdrang You may be aware of this, but your site appears to be down.
— Seth Brown (@DrBunsen) Sep 23 2014 2:22 PM
He wasn’t kidding. I logged into the server’s control panel and saw that the virtual memory had been pegged for about half an hour. After a little digging, I found that one IP number, 144.76.32.147, was continually making requests. After checking with whois
to make sure that number didn’t belong to something like Google—which I wouldn’t want to block—I added this line to my .htaccess
file:
deny from 144.76.32.147
The virtual memory use backed off and the site was soon accessible again. Crisis averted, but it did give me more impetus to shift from WordPress to a static blog and possibly to a new web host.
Today I decided to investigate Tuesday’s problem a little more. First, I found that the top ten pages for the day were goofy.
Sep 23, 2014 pages
1033 /all-this/
469 /all-this/tag/programming/
460 /all-this/tag/blogging/
413 /all-this/tag/python/
324 /all-this/tag/mac/
312 /all-this/tag/wordpress/
294 /all-this/2014/09/end-of-summer-miscellany/
250 /all-this/2014/09/an-anachronistic-survey-of-twitter-apps/
232 /all-this/tag/twitter/
207 /all-this/tag/text-editing/
18139 total
I don’t think a single “tag” page has ever been in a day’s top ten, but on Tuesday there were seven. It looked like the hits from 144.76.32.147 were the result of a web spider run amok. I downloaded my Apache log file for the month and pulled out all the entries for that IP number.
awk '/144\.76\.32\.147/ {print}' < leancrew.com > spider.txt
Before opening the file in BBEdit, I checked how many entries there were with
wc -l spider.txt
and learned that 144.76.32.147 was responsible for over 25,000 hits. Since 25,000-line files are just an appetizer for BBEdit, I opened up spider.txt
and looked at the details.
I learned that all of the hits from 144.76.32.147 had come on Tuesday. They started before 9:00 that morning and kept going until a little before 2:30, when I blocked access. There were intermittent 404 responses throughout the day, but they turned into mostly 404s shortly before 2:00. That matched up pretty well with what I’d learned about the site hitting the virtual memory limit.
The user agent string for all of these hits was the same:
Mozilla/5.0 (X11; compatible; semantic-visions.com crawler; HTTPClient 3.1)
That confirmed that the hits were coming from a web spider, one apparently written by these guys:
Well, that certainly looks like a group that crawls the web, but maybe what I was seeing was someone else misusing Semantic Visions’ software. I ran
dig semantic-visions.com +short
to get their IP number. It was 144.76.32.142. Not exactly the number that had been hitting me, but only five away in the last octet. Certainly part of Semantic Visions’ block of addresses. So Semantic Visions was the culprit.
Further down on their home page are these claims:
We run an unrivaled web-mining system that provides an unprecedented view of what is going on in the world, at any moment.
and
We run an unrivaled web-mining system around a powerful Ontology which is unparalleled in its flexibility, reach and sophistication.
They certainly did an unprecedented mining of my site, hitting every page dozens or hundreds of times over the course of five to six hours. The runaway growth of their process, though, put me more in the mind of oncology than ontology.
Knowing that Semantic Visions is likely to be using a block of IP numbers, I rejiggered my search through the Apache log file,
awk '/144\.76\.32\./ {print}' < leancrew.com > semantic.txt
and found that they started crawling my site again today, using IP numbers from 144.76.32.132 through 144.76.32.148. Although their spider seemed to be well behaved today, I’ve blocked that whole range of numbers anyway. Fuck them and the ontology they rode in on.
Blocking Semantic Visions was fun, but it isn’t a long-term solution. As I thought today about building a static blogging system, I figured it’d be worthwhile to have a list of all my posts in chronological order. I already have the Markdown input for each post saved locally in a directory structure that looks like this:
yyyy/mm/slug.md
(The slug is a transformation of the post’s title that WordPress uses to generate the URL.)
Unfortunately, this doesn’t allow me to figure out the order of the posts within each month. But that information is in the header of each file. Here’s an example header:
Title: Oh, Domino!
Keywords: news
Date: 2005-09-02 23:05:12
Post: 32
Slug: oh-domino
Link: http://leancrew.com/all-this/2005/09/oh-domino/
Status: publish
Comments: 0
The Date and Slug lines are what I need to make the ordered list. Starting in Terminal, cd
’d to the directory that contains all the year folders, I run this pipeline
cat */*/*.md | awk '/^Date:/ {printf "%s %s ", $2, $3; split($2, d, "-")}; /^Slug:/ {printf "%s/%s/%s.md\n", d[1], d[2], $2}' | sort > all-posts.txt
which creates a file called all-posts.txt
that’s filled with lines that look like this:
2014-09-12 00:04:03 2014/09/terminal-velocity.md
2014-09-13 21:25:40 2014/09/sgml-nostalgia.md
2014-09-16 08:31:54 2014/09/pcalc-construction-set.md
2014-09-16 20:34:06 2014/09/scipy-and-image-analysis.md
2014-09-19 09:14:45 2014/09/an-unexpected-decision.md
2014-09-21 21:39:02 2014/09/end-of-summer-miscellany.md
2014-09-24 00:47:09 2014/09/engineering-language.md
2014-09-24 09:37:28 2014/09/shortcut-stupidity.md
The cat
command concatenates all 2,000-plus articles, which is about eight megabytes. Old-time Unix users may blanch at this, but we’re not living in the ’80s anymore—eight megabytes is nothing. This chunk of text is then fed to awk
, which does the following:
- Extracts the date and time from the Date line.
- Prints them to
stdout
. - Splits the components of the date and stores them in an array called
d
. - Extracts the filename (sans extension) from the Slug line.
- Prints the year, month, and filename (with extension) separated by slashes to
stdout
. This is the path to the Markdown source file.
At this point I have a chunk of text with a line for every post, but they’re not ordered chronologically yet. But because of the way the date and time are formatted, lexical order is the same as chronological order, and a simple pass through sort
does what I need. Finally, the text is saved to a file.
As I get further down the road on building my system, it may turn out that I don’t need this file. But at least I didn’t expend much effort in making it.
Sunday’s post prompted a few suggestions for a new web host. After the downtime on Tuesday, Seth suggested Amazon’s S3. I spent a little time today looking into it, and it looks like a good choice. The upside is reliability and scale; the downside is unfamiliarity. I’m used to configuring Apache and will have to learn some new skills to tweak an S3 site. But I think it’ll be worth it. I can’t imagine Semantic Visions taking down S3.