Pandas and Cubs

A few days ago, while looking at the major league standings, I felt certain the Cubs have been at a standstill since May. For the Cubs, this is an improvement—they spend most seasons in decline—but it’s less than what was expected of them according to several preseason predictions. I decided to use Pandas and Matplotlib to see how (and if) their season was progressing.

I last wrote about Matplotlib a couple of weeks ago, right after Apple disclosed its quarterly financials. Mike Williams looked over what I’d done and made a suggestion:

@drdrang 👍 did you try pandas? makes typical times series plots like this much easier (incl moving average and human-friendly date labels)
mikew (@williamsmj_) Jul 22 2015 2:44 PM

He also whipped out a quick IPython notebook and posted it on Gist. I don’t agree that Mike’s work is equivalent to mine—part of what I did was deliberately more complex than necessary because I wanted to practice certain data transformations, and Mike avoided those—but I do agree that Pandas would eliminate some of the drudgery of data import.

Pandas, in case you haven’t heard of it, is Wes McKinney’s Python module for data analysis. If you’re familiar with the statistics language R, you’ll recognize certain parts of Pandas. I have McKinney’s book, and I’ve used Pandas a few times at work for analyzing test results, but it’s not something I’ve used enough to feel completely comfortable with. I decided the Cubs problem would be a good one to practice on.

Here’s the plot I ended up with. I expanded the problem to include the Pirates and the Giants, the teams the Cubs have been competing with recently for the National League wildcard spots. I suppose I should’ve included the Mets, too, but I’m still pissed at the Mets for 1969, so to hell with them. You can click/tap on the plot to see a bigger version.

Wildcard plot

What we see here is that the Cubs have progressed a little bit over the past few months, but only a little bit. The Giants have been a streaky team in both directions. And the Pirates have been consistently good since pulling themselves out of the hole they dug in May.

The data for this year’s games came from Baseball Reference. I found that the more limited information in the mobile version of the site was easier to use. The Cubs data, for example, looked like this:

CHC 0 v. STL 3, Apr 5th
CHC 2 v. STL 0, Apr 8th
CHC 1 @ COL 5, Apr 10th
CHC 9 @ COL 5, Apr 11th
CHC 6 @ COL 5, Apr 12th
CHC 7 v. CIN 6, Apr 13th
CHC 2 v. CIN 3, Apr 14th
CHC 5 v. CIN 0, Apr 15th
etc.

Each line represents one game. I copied it out of the web page and pasted it into a BBEdit document. After a bit of manipulation, it looked like this:

R     A     O     W     D
0     0     XXX   X     Apr-1
0     3     STL   A     Apr-5
2     0     STL   A     Apr-8
1     5     COL   H     Apr-10
9     5     COL   H     Apr-11
6     5     COL   H     Apr-12
7     6     CIN   A     Apr-13
2     3     CIN   A     Apr-14
5     0     CIN   A     Apr-15
etc.

In the real file, called cubs-2015.txt, the columns are separated by tabs, whereas here I’ve separated them by enough spaces to look like tabs. The first line is for the column labels: Runs for, runs Against, Opponent, Where played, and Date. The second line is a fake game I put there for reasons I’ll explain later.

The script that generates the plot is this:

python:
 1:  #!/usr/bin/env python
 2:  
 3:  import numpy as np
 4:  import pandas as pd
 5:  import matplotlib.pyplot as plt
 6:  from matplotlib.ticker import MultipleLocator
 7:  import sys
 8:  
 9:  cubs = pd.read_table('cubs-2015.txt')
10:  cubs['G'] = np.cumsum(np.sign(cubs['R'] - cubs['A']))
11:  pirates = pd.read_table('pirates-2015.txt')
12:  pirates['G'] = np.cumsum(np.sign(pirates['R'] - pirates['A']))
13:  giants = pd.read_table('giants-2015.txt')
14:  giants['G'] = np.cumsum(np.sign(giants['R'] - giants['A']))
15:  
16:  fig, ax = plt.subplots(figsize=(12, 4))
17:  fig.set_tight_layout({'pad': 1.75})
18:  ax.plot(pirates['G'], color='#F8C633', linestyle='-', linewidth=3, label='Pirates')
19:  ax.plot(giants['G'], color='#EB5234', linestyle='-', linewidth=3, label='Giants')
20:  ax.plot(cubs['G'], color='#0B2B85', linestyle='-', linewidth=3, label='Cubs')
21:  
22:  # Tweak the layout.
23:  plt.title('2015 Wildcard')
24:  plt.xlabel('Game number')
25:  plt.ylabel('Games above .500')
26:  ax.grid(linewidth=1, which='major', color='#dddddd', linestyle='-')
27:  ax.set_axisbelow(True)
28:  plt.xlim((0, 162))
29:  ax.xaxis.set_major_locator(MultipleLocator(10))
30:  ax.xaxis.set_minor_locator(MultipleLocator(2))
31:  plt.ylim((-5, 20))
32:  ax.yaxis.set_major_locator(MultipleLocator(5))
33:  ax.yaxis.set_minor_locator(MultipleLocator(1))
34:  plt.legend(loc=(.04, .52))
35:  
36:  
37:  plt.savefig('20150802-Wildcard plot.png', format='png', dpi=100)

My use of Pandas is in Lines 9–14, where the data files are read in and the “Games above .500” value is calculated. Pandas reads the data into what it calls a DataFrame (R users are now nodding). From McKinney’s book:

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index.

Our DataFrames, which are given the team names, are created by the read_table commands on Lines 9, 11, and 13. Because read_table is smart, it figures out that the first line of each of the data files contains the names of the columns. The columns can then be referenced by cubs['R'], cubs['A'], cubs['O'], and so on.

The column indices start at zero; therefore the “first” game is assigned an index of zero. Because I wanted the index to represent the game number, and because I wanted each team to start at .500 after “Game 0,” I inserted a fake game, a 0–0 tie against opponent XXX, at the top of each data file.

Lines 10, 12, and 14 create a new column for each DataFrame: the number of games above .500. For each game, the runs against is subtracted from the runs scored, and the np.sign function is applied to the result, giving a +1 for a win and a –1 for a loss. Then the np.cumsum function is then applied to the series of positive and negative ones, and that gives us the number of games above .500. This new column is added to the DataFrame and assigned the index G. G is what we’re going to plot.

Lines 16–20 set up the figure and plot the data series. For fun, I used Acorn’s eyedropper tool to get the teams’ colors from their official MLB home pages.

Lines 23–34 format the plot. As I’ve mentioned before, I’m picky about how my plots look, and I’ve never been satisfied with the default layout of any plotting package. Default layouts are fine during the exploratory phase—there’s no point in tweaking plots that aren’t going to seen by anyone else—but not for final copy. Even the supposedly good ones like Seaborn need extra work. So I’m putting together a little template I can insert into every plotting script to set the limits, the ticks, the labels, etc. I’m not sure yet whether I want the template to be a TextExpander snippet or a BBEdit clipping, but whatever it turns out to be, Lines 23–34 are going to be the basis for it.

After I wrote out Lines 23–34, I realized that almost every line in it is analogous to a part of the preparatory work I used to have to do when I made plots by hand on graph paper. This is not a coincidence. A good plot needs attention to how it’s scaled and labeled.

Speaking of which, you might be wondering why I set the horizontal axis to run out to the full 162 game length of a season, even though the teams have played just over 100 games so far. Large empty areas are generally considered a mistake, but here I wanted the plot to give a sense of how much of the season is left.

Overall, I was happy with Pandas. For each team, I needed just one line to read in the file and one line to create the series I wanted to plot. That’s pretty damned efficient. I suspect that as I get more comfortable with it, I’ll be able to cut back on the file preparation and let Pandas parse more complicated input files.