Cleaning and graphing baseball data

If any of you followed the link in yesterday’s post to my old writeup from a decade ago, you noticed that the scripts I wrote back then used the csv and cPickle libraries to read, filter, and store the baseball game data. Now I find it more natural to use pandas for all of that, so I wrote all new code instead of trying to edit the old stuff. This post goes through the steps.

Baseball game durations

First, I downloaded all the game data from Retrosheet. Back in 2012, I had to download each year’s data individually; yesterday, I was able get every year in a single zip file called Unzipping this gave me a text file for each year, with names like gl1871.txt. Also, there were files for playoff, World Series, and All Star games.

I wanted to look at only regular season games from the “modern era.” The modern era could have lots of definitions, but I chose to start with 1920, the year Babe Ruth began playing with the Yankees. So I deleted all the files except gl1920.txt through gl2022.txt.

These are CSV files without a header line. Each line represents a game, and there are 161 fields in each line; Retrosheet has an index that describes the fields. The files have DOS/Windows (CRLF) linebreaks, so I started off by converting them to Unix (LF) linebreaks with

dos2unix gl*.txt

You can find the dos2unix utility in many places on the internet. I installed it via Homebrew.

Strictly speaking, this conversion wasn’t necessary, as none of my subsequent manipulations required Unix linebreaks, but I did it anyway because it’s safer in general to have Unix-style files when you’re going work in a Unix environment.

I concatenated all files together:

cat gl.*.txt > gl.csv

This gave me a 184 MB CSV file with no header line. There are command-line tools for manipulating and filtering such files, but I decided to use pandas because that’s what I’m most familiar with. I opened a Jupyter console session and ran these commands:

import pandas as pd
df = pd.read_csv('gl.csv', header=None, parse_dates=[0], usecols=[0, 3, 4, 6, 7, 9, 10, 18])
df.to_csv('games.csv', index=False)

The second line imported just the fields I wanted into the dataframe. In order, they were the date (0), visiting team (3), visiting team league (4), home team (6), home team league (7), visiting team score (9), home team score (10), and game duration in minutes (18).1 If you compare these number to the Retrosheet index, you’ll see that my numbers are one less than the index’s. That’s because pandas starts its numbering at zero instead of one.

With this done, I had a new CSV file called games.csv that was a lot smaller, only 6.1 MB. This file had a header line of


which isn’t very useful. I opened it in BBEdit and edited that line to be


To get the statistics I wanted to plot, I opened a new Jupyter console session and ran these commands:

1:  import pandas as pd
2:  from scipy.stats import scoreatpercentile
3:  df = pd.read_csv('games.csv', parse_dates=[0])
4:  df['Year'] = df.Date.dt.year
5:  for y in df.Year.unique():
6:    p25 = scoreatpercentile(df.Time[df.Year==y], 25)
7:    p50 = scoreatpercentile(df.Time[df.Year==y], 50)
8:    p75 = scoreatpercentile(df.Time[df.Year==y], 75)
9:    print(f'{y},{p25:.2f},{p50:.2f},{p75:.2f}')

The dt.year property gets the year from the date. The scoreatpercentile function returns the value in the given list at the given percentile. So the loop in Lines 5–9 goes through each year, determines the 25th, 50th, and 75th percentile values for all the game durations of that year, and prints out the result in this form:


I copied the output, saved it to a file called gametimes.csv, and added this header line:


This is the data I wanted to plot.

I didn’t trust myself to write the plotting code interactively, so I created this file:

 1:  #!/usr/bin/env python3
 3:  import pandas as pd
 4:  import matplotlib.pyplot as plt
 6:  # Import game time data
 7:  df = pd.read_csv('gametimes.csv')
 9:  # Create the plot with a given size in inches
10:  fig, ax = plt.subplots(figsize=(6, 4))
12:  # Add the interquartile range and the median
13:  plt.fill_between(df.Year, df.Q1, df.Q3, alpha=.25, linewidth=0, color='#0066ff')
14:  ax.plot(df.Year, df.Median, '-', color='black', lw=2)
16:  # Gridlines and ticks
17:  ax.grid(linewidth=.5, axis='x', which='major', color='#bbbbbb', linestyle='-')
18:  ax.grid(linewidth=.5, axis='y', which='major', color='#bbbbbb', linestyle='-')
19:  ax.tick_params(which='both', width=.5)
21:  # Title and axis labels
22:  plt.title('Baseball game durations')
23:  plt.xlabel('Year')
24:  plt.ylabel('Minutes per game')
26:  # Save as PNG
27:  plt.savefig('20230227-Baseball game durations.png', format='png', dpi=200)

I think this is pretty straightforward. Normally, my plotting scripts include commands for setting the tickmark placement, but in this case matplotlib’s defaults were just fine. The main work is done in Lines 13–14, where the fill_between shades in the interquartile range and plot runs a thick black line along the median.

I could’ve combined the code that calculates the percentiles with the code that does the plotting into a single script, but for a one-off like this, I usually prefer to take it one step at a time so I can check the intermediate results. Even if I had combined all the code here into a single script, it would have been simpler and easier to read than what I did back in 2012. That’s all due to pandas.

  1. Most of these fields were unnecessary for what I was plotting, but I kept them in case I wanted to do some other analyses. 

Revisiting baseball game durations

This morning, I learned from Paul Kafasis that major league baseball will be going to a pitch clock, something they tried out in the minor leagues last year. The idea, of course, is to speed up our interminable national pastime.

The story reminded me of a post I wrote about a decade ago in which I plotted the generally upward trend of baseball game durations over about 90 years. I grabbed a new set of data from Retrosheet and made this updated plot:

Baseball game durations

The last decade has not been kind to fans’ tushies. The black line is the median game length, and the blue zone is the interquartile range, which runs from the 25th percentile to the 75th percentile. Games have gotten 10–12 minutes longer in the past 10 years.

Paul quotes a Washington Post article about the how the pitch clock

has reduced the average [minor league] game time from 3 hours 4 minutes in 2021 to 2:36 in 2022

My first thought was How can minor league games possibly last as long as major league games? Isn’t that a violation of the Geneva Convention? But then I remembered that minor league teams put on a lot of non-baseball show between innings. Myron Noodleman (RIP) ate up a lot of time.

I wish baseball luck. A half-hour drop would bring the game back down to the running time it had when I was a fan. I think that’s quite optimistic, but we’ll revisit this graph at the end of the season and see how it went.


I left something out of yesterday’s post. Another mathematical aspect of Underscore David Smith’s unitSquareIntersectionPoint function is this set of lines at the top of the function:

var normalizedDegree = angle.degrees
while normalizedDegree > 360.0 {
    normalizedDegree -= 360.0
while normalizedDegree < 0.0 {
    normalizedDegree += 360.0

There are lots of ways to normalize an angle. Here, UDS wants to take the input angle and turn it into the equivalent angle between 0° and 360°. If angle is already in that range, neither of the while loops is entered and normalizedDegree remains equal to the input angle.

You may recognize this as an example of modular arithmetic extended to include floating point numbers. SSteve did, and UDS included their code that makes that connection more explicit:

var normalizedDegree = angle.degrees % 360;
normalizedDegree = normalizedDegree < 0 ? normalizedDegree + 360 : normalizedDegree;

Unfortunately, SSteve wasn’t able use Swift’s % operator alone, because it returns a value with a sign that matches the sign of the dividend. Thus the second line with the ternary operator was needed to ensure a positive result.

Sign ambiguity can be a nasty problem in modular arithmetic because different programming languages treat modular arithmetic differently. If I had to normalize the input angle in my xy function to get it between 0° and 360°—which I don’t because the trig functions handle angles outside that range correctly—I’d be able to use

normalizedAngle = angle % 360

in Python without further adjustment. Python uses a floored division definition of %, so it always returns a value with the same sign as the divisor—in this case, +360.

When I read Underscore’s first post, I didn’t know whether Swift even had a modulo operator or how it worked. I went to the Wikipedia page looking for an answer and found three:

So none of these can be used without an adjustment.

Modular arithmetic doesn’t come up often in my work, but when it does, I get nervous about the sign of the result. I’ve programmed in too many languages over the years to remember what type of division is used for the modulo operator in the language I’m currently writing in. Even the language I use the most, Python, has different ways of doing modulo arithmetic. Many different ways:

This is the sort of thing that makes you think Underscore was right in writing his own modular arithmetic code. It may take up more space, but at least the way it works is obvious.

Underscore David Smith and linearization for large angles

Underscore David Smith wrote a post earlier this week that I had trouble with. Part of the trouble was that the code he wrote was in Swift, but that was a minor problem, as UDS’s code was clear enough to understand even though I don’t know Swift. The bigger problem was how he managed to transform an angle into a pair of xy coordinates without using any trigonometric functions. Once I figured it out, I saw that it was an interesting example of linearization without using the typical small angle approximation. So I decided to write this post.

Unsurprisingly, other people had the same trouble with UDS’s code, and this morning he published an updated post with their “corrections” to his transformation function. In some sense, these truly are corrections in that they provide a more accurate transformation. But UDS’s function is pretty accurate, probably accurate enough for his purposes, so they’re not necessarily correcting a mistake. Still, they presented some interesting ways of performing the transformation, and I’ll comment on them at the end of this post.

Let’s summarize Underscore’s problem. He wants to use Swift’s LinearGradient fill style for an arbitrary angle. Unfortunately, LinearGradient won’t accept an angle as its argument. It will, however, take a UnitPoint argument. A UnitPoint is a pair of coordinates on the perimeter of the unit square shown below,

Underscore's unit box

where I’ve also shown the angle that UDS wants to use as his input.

Getting \(x\) and \(y\) from \(\theta\) is a fairly straightforward trig problem, although there’s some annoying bookkeeping necessary to deal with the corners. With the hope of avoiding that bookkeeping, I decided to look into conformal mapping from a disc onto a square. I didn’t have my copy of Churchill’s book handy, so I did some Googling and ran across this nice writeup by Chamberlain Fong. Fong shows several ways of mapping discs onto rectangles. Most important, he shows that conformal mapping (which can be very nasty mathematically) is unnecessary for our problem—we can use what he calls the Simple Stretching transformation.

Disc-box transformation from Fong

As you can see, the angles in the box are the same as those in the disc, which we’ll need to do what UDS wants. The transformation formulas Fong gives for this mapping (on p. 3 of his paper) work over the entire disc and square. We can specialize them for just the perimeter:

\[\xi = \mathrm{sgn}(u), \quad \eta = \mathrm{sgn}(u) \frac{v}{u} \qquad \mathrm{for}\, u^2 \ge v^2\] \[\xi = \mathrm{sgn}(v)\frac{u}{v}, \quad \eta = \mathrm{sgn}(v) \qquad \mathrm{for}\, u^2 \lt v^2\]

Here, \(\mathrm{sgn}\) is the signum or sign function—it returns \(+1\) if the argument is positive, \(-1\) if the argument is negative, and zero if the argument is zero. The coordinate systems are shown below:

Origin-centered circle and square

So what we have now is a series of transformations. First, we go from \(\theta\) to \((u, v)\):

\[u = \cos \theta\] \[v = \sin \theta\]

Then we go from \((u, v)\) to \((\xi, \eta)\) using the formulas above. Finally, we go from \((\xi, \eta)\) to \((x, y)\):

\[x = \frac{1 + \xi}{2}\] \[y = \frac{1 - \eta}{2}\]

This last one is, I think, pretty obvious. We flip the vertical coordinate, then squeeze the 2×2 box into 1×1 and move its center from the origin to \((1/2, 1/2)\).

You might be looking at this and thinking the bookkeeping I wanted to avoid would be easier than cascading three coordinate transformations like this. I guess it depends on what you’re used to. To me, putting one transformation after another after another is easier to remember than keeping track of which side of the box you’re on.

Before turning this into code, there’s a simplification we can make to the transformation from \((u, v)\) to \((\xi, \eta)\). Note that

\[\mathrm{sgn}(u) = \frac{u}{|u|}\]

So the transformation equations can be changed to

\[\xi = \frac{u}{|u|}, \quad \eta = \frac{v}{|u|} \qquad \mathrm{for}\, |u| \ge |v|\] \[\xi = \frac{u}{|v|}, \quad \eta = \frac{v}{|v|} \qquad \mathrm{for}\, |u| \lt |v|\]

where we’ve also noted that the ordering of the squares is the same as the ordering of the absolute values.

In Python (because I don’t know Swift), the three-part transformation from \(\theta\) to \((x, y)\) is

 1:  def xy(a):
 2:    # Coords on circle of unit radius with angle a (in degrees)
 3:    u = cosd(a)
 4:    v = sind(a)
 6:    # Convert to coords on square from (-1, -1) to (1, 1)
 7:    if abs(u) >= abs(v):
 8:      xi = u/abs(u)
 9:      eta = v/abs(u)
10:    else:
11:      xi = u/abs(v)
12:      eta = v/abs(v)
14:    # Convert to coords on square from (0, 0) to (1, 1) with flipped
15:    # vertical axis
16:    return (1 + xi)/2, (1 - eta)/2

This is not necessarily the fastest implementation, or the shortest, but it’s pretty simple, and you can see each transformation in turn. Note that if you think of the \((\xi, \eta)\) box as being divided into 90° sectors like this,

Box with green and yellow triangular sectors

the green areas correspond to \(|u| \ge |v|\) and the yellow areas correspond to \(|u| \lt |v|\), which is how we split the if conditional.

So how does this compare with what Underscore does? Here’s his code:

func unitSquareIntersectionPoint(_ angle:Angle) -> UnitPoint {
var normalizedDegree = angle.degrees
while normalizedDegree > 360.0 {
    normalizedDegree -= 360.0
while normalizedDegree < 0.0 {
    normalizedDegree += 360.0
if normalizedDegree < 45.0 || normalizedDegree >= 315 {
    //Right Edge, x = 1.0
    var degreeToConsider = normalizedDegree
    if degreeToConsider < 45.0 {
        degreeToConsider = normalizedDegree +  360.0
        //angle now between 315 & 405
    let degreeProportion = (degreeToConsider - 315.0) / 90.0
    return UnitPoint(x: 1.0, y: 1.0 - degreeProportion)
} else if normalizedDegree < 135.0 {
    //Top Edge, y = 0.0
    let degreeProportion = (normalizedDegree - 45.0) / 90.0
    return UnitPoint(x: 1.0 - degreeProportion, y: 0.0)
} else if normalizedDegree < 225.0 {
    //left Edge, x = 0.0
    let degreeProportion = (normalizedDegree - 135) / 90.0
    return UnitPoint(x: 0.0, y: degreeProportion)
} else if normalizedDegree < 315.0 {
    //Bottom Edge, y = 1.0
    let degreeProportion = (normalizedDegree - 225) / 90.0
    return UnitPoint(x: degreeProportion, y: 1.0)
return .zero

(Sorry, I don’t have syntax highlighting set up here for Swift.)

As you can see, he splits his calculations into the four colored sectors—now on the unit square—but he has to treat all four sectors separately

Unit box with triangular sectors

The main difference, though, is that he doesn’t use trig functions. For each side of the square, he sets one of the coordinates to 0 or 1, as appropriate, and sets the other as

\[\frac{\phi - 90}{90}\]

where \(\phi\) is the angle from the closest corner clockwise from \(\theta\). In other words, to get the y coordinate along the right edge, \(\phi\) is measured from the bottom right corner; to get the x coordinate along the top edge, \(\phi\) is measured from the top right corner; and so on.

In my xy function, the key calculations are

\[\frac{u}{|u|}, \quad \frac{v}{|v|}, \quad, \frac{u}{|v|}, \quad \frac{v}{|u|}\]

which give us \(\pm 1\) for the first two and either \(\pm \cot \theta\) or \(\pm \tan \theta\) for the other two. My \(\pm 1\) correspond to UDS’s 0 and 1.1 And since

\[\cot \theta = \tan (90° - \theta)\]

and tangent repeats every 180°, my other calculations always return a result equivalent to the tangent of an angle between -45° and +45°.

The upshot is that UDS’s function is approximating the tangent like this,

Tangent vs Underscore graph

where the blue line is the tangent and the orange line is UDS’s approximation. As you can see, they’re fairly close to each other.

You’ll notice I’ve plotted \(\theta\) in radians. That’s because I also wanted to show the usual small angle approximation of tangent,

\[\tan \theta \approx \theta\]

which I’ve plotted as the dotted black line. The small angle approximation works only when the angle is given in radians.

As expected, the small angle linear approximation gets crappy as you get further from 0°. By tilting his approximating line, UDS gets perfect alignment at the corners and midsides of the box (the most important places) and decent alignment elseswhere.

How decent? Let’s plot the angle associated with UDS’s output and compare it to the input angle:

Underscore input-output angles

The symmetry (actually antisymmetry) means we can restrict ourselves to 0° to 45°. The dashed line is what the plot would look like if Underscore’s function wasn’t doing an approximation.

Here’s a plot of the error:

Underscore angular error graph

It’s not a parabola, nor is it symmetric, but it’s close. The maximum error is 4.07° when the input angle is 23.52°

Is a maximum error of 4° acceptable? I suspect it is, especially because of where it occurs. People tend to be better at detecting angular errors near the horizontal, vertical, and 45° than they are at other angles.2 UDS’s function is good where it needs to be and less good where it doesn’t.

On the other hand, it could well be argued that this is not an especially time-critical calculation, and you might as well do the trig calculations3 instead of estimating through proportions. A lot depends on what sort of calculations you’re most comfortable with.

Which brings us to Underscore’s update and the “corrections” therein. Most of them include code that looks a lot like mine, which isn’t surprising, since they’re doing the same sort of trigonometry I am. Math is math. But there are some interesting variations.

DrewFitz does two thing I like:

  1. They start by changing the sign of the angle, which has the effect of flipping the vertical coordinate without flipping the horizontal coordinate because sine is antisymmetric and cosine is symmetric.
  2. They determine the maximum of \(|u|\) and \(|v|\), and uses that as his denominator. This eliminates the if/else clause (it’s still under the hood of the max function, but you don’t see it).

Sometimes being clever like this gets you in trouble, but I think the comments in the code explain what’s going on quite well.

robb’s solution is very close to mine and even closer to the equations in Fong’s paper. Like DrewFitz, robb tweaks the angle upon input to handle the flipping of the vertical axis, but in this code the angle is increased by 90° and the sine and cosine are reversed. That tweak was a little harder for me to understand, but it works.

Like Underscore, I find Rob Mayoff’s solution hard to understand. Mayoff uses a combination of max, min and copysign (which takes the place of signum) to deal with the sector choice. It works, but I would have a hard time explaining why without going through numerical examples. Also, I think they’re using an opposite sign convention for the angle—UDS’s increases counterclockwise while Rob Mayoff’s seems to increase clockwise—but that’s a minor issue.

Of course, I think my solution is the “best” (followed closely—maybe edged out?—by DrewFitz’s) but that’s because it fits my way of thinking. As we learn more, our ways of thinking change and “best” changes, too. That’s why you’re constantly told to comment your code.

  1. Remember, at this stage my calculations are done on the 2×2 square. My \(\pm 1\) get squeezed and shifted to 0 and 1 in the last line of the function. 

  2. This is related to the idea of “banking to 45°” described by Cleveland in his The Elements of Graphing Data

  3. Are trig calculations still expensive compared to regular arithmetic? I was taught that they are, but that was long ago.