September 22, 2016 at 7:15 PM by Dr. Drang
Although Jake Arrieta still has good looking statistics for the season, my sense is that he’s been a mediocre pitcher since the All Star break. Am I right about this? Because it’s easy to get game-by-game statistics for almost any aspect of baseball, I figured it wouldn’t be hard to answer this question. More important, I could use this as an opportunity to practice my Python, Matplotlib, and Pandas skills.
Modern, stats-loving baseball fans can probably come up with a dozen ways to measure the quality of a pitcher, but I wasn’t interested in cutting-edge sabremetric research, so I stuck with pitching quality statistic of my youth, the earned run average. Remember, this is mostly an excuse to keep my Python skills sharp. Don’t write to tell me about your favorite pitching stat—I don’t care.
I started with a simple text file that looked like this:
Date,IP,ER 4/4,7.0,0 4/10,7.0,3 4/16,8.0,0 4/21,9.0,0 4/28,5.0,1 etc.
Each line represents one game. The date is in month/day format, the earned runs are integers, and the innings pitched are in a hybrid decimal/ternary format common to baseball, with the full innings before the period and the partial innings after it. The full innings are regular decimal numbers, but the partials represent thirds of an inning. The number after the period can take on only the values 0, 1, or 2.
Because I knew Pandas would look at the IP values and interpret them as regular decimal numbers, I decided to massage the input file first. Also, I figured it wouldn’t hurt to add the year to the game dates. After a couple of find-and-replaces in BBEdit, the file looked like this:
Date,FIP,PIP,ER 4/4/2016,7,0,0 4/10/2016,7,0,3 4/16/2016,8,0,0 4/21/2016,9,0,0 4/28/2016,5,0,1 etc.
Now there are separate fields for the full innnings and partial innings pitched. Time to fire up Jupyter and commence to analyzin’.
python: import pandas as pd from functools import partial import matplotlib.pyplot as plt %matplotlib inline plt.rcParams['figure.figsize'] = (12,9)
This is mostly standard preamble stuff. The odd ones are the
rcParams call, which makes the inline graphs bigger than the tiny Jupyter default, and the
functools import, which will help us create ERAs over small portions of the season.
Next, we read in the data, tell Pandas how to interpret the dates, and create an innings pitched field in standard decimal form:
python: df = pd.read_csv('arrieta.txt') df['Date'] = pd.to_datetime(df.Date, format='%m/%d/%Y') df['IP'] = df.FIP + df.PIP/3
Now we can calclulate Jake’s ERA for each game.
python: df['GERA'] = df.ER/df.IP*9
To get his season ERA as the season develops, we need the cumulative numbers of innings pitched and earned runs given up. That turns out to be easy to do with the Panda’s
python: df['CIP'] = df.IP.cumsum() df['CER'] = df.ER.cumsum() df['ERA'] = df.CER/df.CIP*9
Let’s see how his ERA has developed:
python: plt.plot_date(df.Date, df.ERA, '-k', lw=2)
The rise in May, though steep, doesn’t indicate poor pitching. Arrieta just started the season out so well that even very good pitching looks bad in comparison. It’s the jump in late June and early July that looks really bad. And now that I think of it, the Cubs did have a terrible three-week stretch just before the All Star break. Arrieta’s performance was part of that, so his slide started earlier than I was thinking.
But even after the big jump, there’s been a slow climb. So even though the Cubs have played excellent ball since mid-July, Arrieta hasn’t. To get a better handle on how poorly he’s been pitching, we could plot the game-by-game ERAs, but that’s likely to be too jumpy to see any patterns. A way to smooth that out is to calculate moving averages.
But what’s the appropriate number of games to average over? Two is certainly too few, but three might work. Or maybe four. I decided to try both three and four. To do this, I defined a function that operates on a row of the DataFrame to create a running average of ERA over the last n games:
python: def rera(games, row): if row.name+1 < games: ip = df.IP[:row.name+1].sum() er = df.ER[:row.name+1].sum() else: ip = df.IP[row.name+1-games:row.name+1].sum() er = df.ER[row.name+1-games:row.name+1].sum() return er/ip*9
if part handles the early portion of the season when there aren’t yet n names to average over. Looking at the code now, I realize this could have been simplified by eliminating the
if/else and just using
python: max(0, row.name+1-games)
as the lower bound of the slice. Oh, well. I’ll leave my crude code here as a reminder to think more clearly next time.
With this general function defined, we can use the
partial function imported from
functools to quickly define running average functions for three and four games and add those fields to the DataFrame.
python: era4 = partial(rera, 4) era3 = partial(rera,3) df['ERA4'] = df.apply(era4, axis=1) df['ERA3'] = df.apply(era3, axis=1)
Now we can plot everything:
python: plt.plot_date(df.Date, df.ERA3, '-b', lw=2) plt.plot_date(df.Date, df.ERA4, '-r', lw=2) plt.plot_date(df.Date, df.GERA, '.k', ms=10) plt.plot_date(df.Date, df.ERA, '--k', lw=2) plt.show()
As you can see, I didn’t bother to make this look nice. I just wanted to dump it all out and take a look at it. I don’t see much practical difference between the three-game average (blue) and the four-game average (red). Arrieta did have a good stretch there in late July and early August (which I hadn’t noticed), but he’s not been pitching well since then. It’s not unlike his early years with Baltimore. I’ll be curious to see where he’s put in the playoff rotation.
As is usually the case with my baseball posts, I’ve learned more about programming tools than I have about the game. I’ve used
partial many times in the past, and I always feel like a real wizard when I do. But I’ve never used
cumsum before, and I’m really impressed with it. Not only does it perform a useful function, it’s implemented in a way that couldn’t be easier to use for common cases like the ones here. I hope I don’t forget it.
Update Sep 24, 2016 9:33 PM
Of course I take full credit for Arrieta’s performance the day after this was posted. Jake obviously reads this blog, especially for the Python-related posts, and realized he needed to step up his game after seeing the clear evidence of how he’s gone downhill recently.