Cleaning and graphing baseball data

If any of you followed the link in yesterday’s post to my old writeup from a decade ago, you noticed that the scripts I wrote back then used the csv and cPickle libraries to read, filter, and store the baseball game data. Now I find it more natural to use pandas for all of that, so I wrote all new code instead of trying to edit the old stuff. This post goes through the steps.

Baseball game durations

First, I downloaded all the game data from Retrosheet. Back in 2012, I had to download each year’s data individually; yesterday, I was able get every year in a single zip file called gl1871_1922.zip. Unzipping this gave me a text file for each year, with names like gl1871.txt. Also, there were files for playoff, World Series, and All Star games.

I wanted to look at only regular season games from the “modern era.” The modern era could have lots of definitions, but I chose to start with 1920, the year Babe Ruth began playing with the Yankees. So I deleted all the files except gl1920.txt through gl2022.txt.

These are CSV files without a header line. Each line represents a game, and there are 161 fields in each line; Retrosheet has an index that describes the fields. The files have DOS/Windows (CRLF) linebreaks, so I started off by converting them to Unix (LF) linebreaks with

dos2unix gl*.txt

You can find the dos2unix utility in many places on the internet. I installed it via Homebrew.

Strictly speaking, this conversion wasn’t necessary, as none of my subsequent manipulations required Unix linebreaks, but I did it anyway because it’s safer in general to have Unix-style files when you’re going work in a Unix environment.

I concatenated all files together:

cat gl.*.txt > gl.csv

This gave me a 184 MB CSV file with no header line. There are command-line tools for manipulating and filtering such files, but I decided to use pandas because that’s what I’m most familiar with. I opened a Jupyter console session and ran these commands:

python:
import pandas as pd
df = pd.read_csv('gl.csv', header=None, parse_dates=[0], usecols=[0, 3, 4, 6, 7, 9, 10, 18])
df.to_csv('games.csv', index=False)

The second line imported just the fields I wanted into the dataframe. In order, they were the date (0), visiting team (3), visiting team league (4), home team (6), home team league (7), visiting team score (9), home team score (10), and game duration in minutes (18).1 If you compare these number to the Retrosheet index, you’ll see that my numbers are one less than the index’s. That’s because pandas starts its numbering at zero instead of one.

With this done, I had a new CSV file called games.csv that was a lot smaller, only 6.1 MB. This file had a header line of

0,3,4,6,7,9,10,18

which isn’t very useful. I opened it in BBEdit and edited that line to be

Date,VTeam,VLeague,HTeam,HLeague,VScore,HScore,Time

To get the statistics I wanted to plot, I opened a new Jupyter console session and ran these commands:

python:
1:  import pandas as pd
2:  from scipy.stats import scoreatpercentile
3:  df = pd.read_csv('games.csv', parse_dates=[0])
4:  df['Year'] = df.Date.dt.year
5:  for y in df.Year.unique():
6:    p25 = scoreatpercentile(df.Time[df.Year==y], 25)
7:    p50 = scoreatpercentile(df.Time[df.Year==y], 50)
8:    p75 = scoreatpercentile(df.Time[df.Year==y], 75)
9:    print(f'{y},{p25:.2f},{p50:.2f},{p75:.2f}')

The dt.year property gets the year from the date. The scoreatpercentile function returns the value in the given list at the given percentile. So the loop in Lines 5–9 goes through each year, determines the 25th, 50th, and 75th percentile values for all the game durations of that year, and prints out the result in this form:

1920,99.00,109.00,120.00
1921,100.00,111.00,125.00
1922,100.00,110.50,124.00
1923,102.00,112.00,125.00
1924,101.00,112.00,125.00
1925,102.00,114.00,127.00
<etc>

I copied the output, saved it to a file called gametimes.csv, and added this header line:

Year,Q1,Median,Q3

This is the data I wanted to plot.

I didn’t trust myself to write the plotting code interactively, so I created this gametime-plot.py file:

python:
 1:  #!/usr/bin/env python3
 2:  
 3:  import pandas as pd
 4:  import matplotlib.pyplot as plt
 5:  
 6:  # Import game time data
 7:  df = pd.read_csv('gametimes.csv')
 8:  
 9:  # Create the plot with a given size in inches
10:  fig, ax = plt.subplots(figsize=(6, 4))
11:  
12:  # Add the interquartile range and the median
13:  plt.fill_between(df.Year, df.Q1, df.Q3, alpha=.25, linewidth=0, color='#0066ff')
14:  ax.plot(df.Year, df.Median, '-', color='black', lw=2)
15:  
16:  # Gridlines and ticks
17:  ax.grid(linewidth=.5, axis='x', which='major', color='#bbbbbb', linestyle='-')
18:  ax.grid(linewidth=.5, axis='y', which='major', color='#bbbbbb', linestyle='-')
19:  ax.tick_params(which='both', width=.5)
20:  
21:  # Title and axis labels
22:  plt.title('Baseball game durations')
23:  plt.xlabel('Year')
24:  plt.ylabel('Minutes per game')
25:  
26:  # Save as PNG
27:  plt.savefig('20230227-Baseball game durations.png', format='png', dpi=200)
28:  

I think this is pretty straightforward. Normally, my plotting scripts include commands for setting the tickmark placement, but in this case matplotlib’s defaults were just fine. The main work is done in Lines 13–14, where the fill_between shades in the interquartile range and plot runs a thick black line along the median.

I could’ve combined the code that calculates the percentiles with the code that does the plotting into a single script, but for a one-off like this, I usually prefer to take it one step at a time so I can check the intermediate results. Even if I had combined all the code here into a single script, it would have been simpler and easier to read than what I did back in 2012. That’s all due to pandas.


  1. Most of these fields were unnecessary for what I was plotting, but I kept them in case I wanted to do some other analyses.