Factories
August 27, 2015 at 11:39 PM by Dr. Drang
Since writing this post about using Pandas and Matplotlib to plot the progression of the National League’s wildcard race, I’ve been periodically updating the input text files and remaking the graph. Here’s what it looked like as of this morning:
As I mentioned in that earlier post, I get the data for each game by copying and pasting from the mobile version of the Baseball Reference site. The game results come in looking like this:
CHC 7 v. ATL 1, Aug 20th
CHC 5 v. ATL 3, Aug 21st
CHC 9 v. ATL 7, Aug 22nd
CHC 9 v. ATL 3, Aug 23rd
CHC 2 v. CLE 1, Aug 24th
CHC 8 @ SFG 5, Aug 25th
CHC 2 @ SFG 4, Aug 26th
and I need to transform them into this
7 1 ATL H Aug-20
5 3 ATL H Aug-21
9 7 ATL H Aug-22
9 3 ATL H Aug-23
2 1 CLE H Aug-24
8 5 SFG A Aug-25
2 4 SFG A Aug-26
with tabs between each column. When I first did this transformation for that post, I just ran through a few manipulations using BBEdit’s find and replace tools, kind of making it up as I went along. But when I found myself in the habit of updating the graph a couple of times a week, I realized I needed something more automated. The quickest way to get what I wanted was to build a Text Factory.
Text Factories are a BBEdit feature that I don’t use as often as I should. They consist of a simple list of transformations—replacements, deletions, case changes, entabbing/detabbing, etc.—that are applied one after another to either the selected text (if there is any) or the the document as a whole. Their genius lies in their simplicity: there’s nothing a Text Factory can do that a Perl, Python, or Ruby script couldn’t do, but for simple series of transformations, Text Factories take less time to build and debug.
Here’s my Text Factory for transforming the Baseball Reference game data:
The first step is the trickiest. It’s a Grep (regular expression) that searches for
^[A-Z]{3}\s+(\d+)\s+([^ ]+)\s+([A-Z]{3})\s+(\d+),\s+([A-Za-z]+)\s(\d+)(st|nd|rd|th)$
and replaces it with
\1\t\4\t\3\t\2\t\5-\6
By itself, it turns lines like
CHC 2 v. CLE 1, Aug 24th
CHC 8 @ SFG 5, Aug 25th
into
2 1 CLE v. Aug-24
8 5 SFG @ Aug-25
The regex search pattern is probably longer than it needs to be because I’ve added some defensive features, like using \s+
as field separators even though every example I’ve seen separates the parts with just a single space character. I’ve had enough experience with regexes breaking because of an unexpected extra space to know that sacrificing brevity for robustness is a worthwhile tradeoff.
With the hardest part done, the following two steps change the “v.” into “H” and the “@” into “A.” That completes the transformation and makes the input files ready for the plotting script.
The downside of using a Text Factory is that it locks me to BBEdit. I can’t do the transformation from the command line or use it as part of some longer pipeline, as I could if I’d written it as a Python script. For this transformation, that loss of flexibility is no big deal, as I don’t expect to be making this plot much longer. Rewriting it as a regular script isn’t worth the effort.
What’s kept me interested the past few weeks has been the tremendous run of success the Cubs have had recently, but I know myself and the Cubs too well to expect that to continue.
- I was driving across Ohio on October 14, 2003, screaming at the radio in my rental car when the Steve Bartman incident occurred and the Cubs self-destructed, losing a game they had well in hand and fumbling away their chance to go to the World Series.
- I was in graduate school, watching every game on TV in 1984 when the Cubs blew a 2–0 series lead in a best-of-five playoff to the Padres, capping it off with Leon Durham’s error in Game 5.
- I was a young and impressionable fan in 1969, when the Cubs underwent the most celebrated collapse in baseball history, going from an 9½ game lead on the Mets in mid-August to 8 games behind them by the end of the season.
So I’m enjoying this August but waiting for the inevitable disgust to set in. When it does, the Text Factory will be shuttered and its regular expressions laid off.