A few tweet archive utilities

You’re probably sick of my posts about making a local tweet archive. So this one’s about querying that archive and finding the tweets that are in it. You can, of course, open your archive file up in a text editor and use its search tools, but often it’s more convenient to do these things from the command line.

Tweet count

Let’s start with a simple one. This shell script, which I call tweet-count, returns the number of tweets in the archive.


fgrep -- '- - - - -' ~/Dropbox/twitter/twitter.txt | wc -l | tr -d ' '

It’s just a pipeline using some classic Unix tools. fgrep is just like grep, only it treats the search string literally, not as a regular expression. If you recall from my earlier post, my tweet archive uses lines with five hyphens separated by spaces as the tweet separator. Because hyphens are special to the shell, I needed to put that double hyphen before the search string to tell fgrep that it was done processing command line switches.

The fgrep command returns all the separator lines, one for each tweet. That output gets piped to wc, which normally counts words, lines, and characters. The -l option tells wc to count only lines and return that number. This will be the number of tweets in the archive.

For reasons not entirely clear to me, wc indents its output. I didn’t want the tweet count to be indented, so I passed the output through one more command: tr. With the -d option, tr deletes characters in the input that match the given string. Here, I’m telling it to delete spaces.

Update 8/9/12
As Aristotle Pagaltzis pointed out in the comments, this version of tweet-count is unnecessarily complicated. There’s a -c switch for fgrep that returns the number of matches directly:

fgrep -c -- '- - - - -' ~/Dropbox/twitter/twitter.txt

Last tweet

The next command, last-tweet, returns the last tweet in the archive. I’m not sure I’ll be using this much in the future, but I used it a lot while I was debugging my tweet archiving script and making sure it was run periodically.


tail ~/Dropbox/twitter/twitter.txt | perl -n0777e '@a=split("- - - - -"); $a[-2] =~ s/^\s+//; print $a[-2]'

It starts by using the tail command to get the last ten lines of the archive, then passes that to a Perl one-liner that does the following:

  1. Slurps in the file as a single string via the -n option and the -0777 option.
  2. Splits that string at the separator lines into a list of strings, @a. Because the archive has a separator line at the end, the last tweet will be the second to last item in the list, $a[-2].
  3. Strips the leading whitespace from the last tweet.
  4. Prints the cleaned-up last tweet.

Running last-tweet now returns

Just got an email from Jason Quackenbush. Surprisingly, it’s neither spam nor Groucho Marx.
August 8, 2012 at 2:14 PM

Finding tweets

This script, called find-tweet, is the most useful. It’s a Perl script that runs through the archive, printing out the tweets that match the search string.

 1:  #!/usr/bin/perl
 3:  $/ = "- - - - -\n";
 5:  open TWEETS, '<', "$ENV{'HOME'}/Dropbox/twitter/twitter.txt" or die $!;
 6:  while (<TWEETS>) {
 7:    if (/$ARGV[0]/i) {
 8:      print $_;
 9:    }
10:  }

Line 3 sets Perl’s input record separator to the tweet separator string (with the trailing line break). This causes the while loop starting on Line 6 to read the file one tweet at a time. Line 7 tests whether the tweet contains the search string, and Line 8 prints it out (including the separator line) if it does. The i after the final slash in Line 7 specifies that the search should be case-insensitive, which is pretty much always what I want. It’s been so long since I was deeply into Perl regular expressions, I can’t remember whether an m or s would be helpful here, too. So far, I haven’t needed them.

Update 8/9/12
In the comments, Ricky showed me the three-argument version of open, which I’ve incorporated in the code above. I don’t remember running across it before; is it possible that version didn’t exist when I was learning Perl in the ’90s?

The search string passed to find-tweet will be treated as a regular expression, so care must be taken if periods, parentheses, or other special characters are part of the search. Because the shell also treats certain characters as special (semicolons, hyphens, and spaces, in particular), it’s probably best to wrap all but the simplest search strings in single quotes.

Most of the time, though, I find that simple search strings are what I use. For example,

find-tweet pickup


Hit by a pickup while on my bike. Scraped up, bruised, and sore, but otherwise OK. Bike is OK, too. Wife, who saw it all, is a bit shaken.
April 25, 2009 at 9:09 AM
- - - - -

The pickup driver did stop and was ticketed. I rode home and am now at one of our endless Little League games.
April 25, 2009 at 10:39 AM
- - - - -

My one bit of satisfaction: the pickup that hit me was new, and I put a dent in its fender.
April 25, 2009 at 11:42 AM
- - - - -

You’ll note that the tweets themselves aren’t word-wrapped to make them easier to read. I thought about doing that but decided it wasn’t worth the effort. The Terminal wraps the results well enough.