Baseball as an excuse for programming

I find it really hard to watch baseball nowadays because the game moves so slowly, but I do still like to look at statistics and standings. The standings in Yahoo! Sports include a figure that was uncommon when I was a kid: the teams’ run differential, the difference their runs scored and runs given up.

Final standings

What jumped out at me this year was how small the Orioles’ run differential was for a playoff-bound team. They ended up 24 games above .500 but outscored their opponents by only 7 runs over the course of the year. Contrast that with Tampa Bay, who had the third highest run differential in the majors but will be watching the playoffs on TV.

I decided to see how well a team’s run differential predicts its final record. I copied the standings to a text file and deleted most of the columns, ending up with a file that looked like this:

NYY 95  67  .586  +136
BAL 93  69  .574  +7
TAM 90  72  .556  +120
TOR 73  89  .451  -68
BOS 69  93  .426  -72
DET 88  74  .543  +56
CHS 85  77  .525  +72
KAN 72  90  .444  -70
CLE 68  94  .420  -178
MIN 66  96  .407  -131
OAK 94  68  .580  +99
TEX 93  69  .574  +101
LAA 89  73  .549  +68
SEA 75  87  .463  -32
WAS 98  64  .605  +137
ATL 94  68  .580  +100
PHI 81  81  .500  +4
NYM 74  88  .457  -59
MIA 69  93  .426  -115
CIN 97  65  .599  +81
STL 88  74  .543  +117
MIL 83  79  .512  +43
PIT 79  83  .488  -23
CHC 61  101 .377  -146
HOU 55  107 .340  -211
SFG 94  68  .580  +69
LAD 86  76  .531  +40
ARI 81  81  .500  +46
SDP 76  86  .469  -59
COL 64  98  .395  -132

Because every team had a 162-game season I was able to use wins as a proxy for a team’s record. Here’s the plot of wins vs. run differential for all the teams, along with the regression line.

Wins vs. run differential

As expected, the Orioles are way above the regression line, but so are the Reds and the Indians (this is cold comfort for Cleveland fans). The plot also shows how Tampa Bay’s wins are indeed fewer than expected for its run differential, but the Cardinals and Diamondbacks got even less out of their run differentials. St. Louis, at least, got into the playoffs.

This graph was, as much as anything, an excuse for me to practice using the NumPy and Matplotlib Python packages. Here’s the code that generated the plot:

python:
 1:  #!/usr/bin/python
 2:  
 3:  from scipy import stats
 4:  from pylab import *
 5:  
 6:  # Read in the data.
 7:  mlb = loadtxt('mlb.txt', dtype=[('team', 'S3'), ('w', 'i'), ('l', 'i'), 
 8:                                   ('pct', 'f'),  ('rdiff', 'i')])
 9:  
10:  # Plot the data with invisible points.
11:  scatter(mlb['rdiff'], mlb['w'], s=0)
12:  xlabel('Run differential')
13:  ylabel('Wins')
14:  
15:  # Put team names at the data points.
16:  for (t, w, rd) in zip(mlb['team'], mlb['w'], mlb['rdiff']):
17:    text(rd, w, t, size=9,
18:         horizontalalignment='center', verticalalignment='center')
19:  
20:  # Perform the linear regression
21:  m, b, r, p, stderr = stats.linregress(mlb['rdiff'], mlb['w'])
22:  
23:  # Get endpoints of regression line and plot it.
24:  rdMin = min(mlb['rdiff'])
25:  wMin = m*rdMin + b
26:  rdMax = max(mlb['rdiff'])
27:  wMax = m*rdMax + b
28:  plot([rdMin, rdMax], [wMin, wMax])
29:  
30:  show()

As you can see in Line 4, I’m using the PyLab package, which is supposed to provide a combination of NumPy, SciPy, and Matplotlib. And as you can see in Line 3, I was still forced to import the stats library from NumPy to do the regression analysis in Line 21. I’m afraid I’m not experienced enough with these libraries to understand why that’s the case.

Overall, I think the script is pretty easy to understand, with only a couple of things that may need additional explanation.

Because the input file has a mix of string, integer, and floating point values, I had to specify on Line 7 how the loadtxt function was to interpret the fields.

The s=0 parameter in the scatter call on Line 11 sets the size of the markers to zero, rendering them invisible. I did this so I could then plot the team names at their respective points in Lines 16-18. I had hopes that the scatter function could be instructed to set the markers to the team names directly, but I couldn’t find a way to do that. Consider this two-part solution a workaround.

I could have spent more time formatting the axes and doing other beautification, but the defaults seemed decent enough. At this point in the learning curve, I’m more interested in plotting than formatting.