Cleaning and graphing baseball data
February 28, 2023 at 12:16 PM by Dr. Drang
If any of you followed the link in yesterday’s post to my old writeup from a decade ago, you noticed that the scripts I wrote back then used the csv
and cPickle
libraries to read, filter, and store the baseball game data. Now I find it more natural to use pandas for all of that, so I wrote all new code instead of trying to edit the old stuff. This post goes through the steps.
First, I downloaded all the game data from Retrosheet. Back in 2012, I had to download each year’s data individually; yesterday, I was able get every year in a single zip file called gl1871_1922.zip
. Unzipping this gave me a text file for each year, with names like gl1871.txt
. Also, there were files for playoff, World Series, and All Star games.
I wanted to look at only regular season games from the “modern era.” The modern era could have lots of definitions, but I chose to start with 1920, the year Babe Ruth began playing with the Yankees. So I deleted all the files except gl1920.txt
through gl2022.txt
.
These are CSV files without a header line. Each line represents a game, and there are 161 fields in each line; Retrosheet has an index that describes the fields. The files have DOS/Windows (CRLF) linebreaks, so I started off by converting them to Unix (LF) linebreaks with
dos2unix gl*.txt
You can find the dos2unix
utility in many places on the internet. I installed it via Homebrew.
Strictly speaking, this conversion wasn’t necessary, as none of my subsequent manipulations required Unix linebreaks, but I did it anyway because it’s safer in general to have Unix-style files when you’re going work in a Unix environment.
I concatenated all files together:
cat gl.*.txt > gl.csv
This gave me a 184 MB CSV file with no header line. There are command-line tools for manipulating and filtering such files, but I decided to use pandas because that’s what I’m most familiar with. I opened a Jupyter console session and ran these commands:
python:
import pandas as pd
df = pd.read_csv('gl.csv', header=None, parse_dates=[0], usecols=[0, 3, 4, 6, 7, 9, 10, 18])
df.to_csv('games.csv', index=False)
The second line imported just the fields I wanted into the dataframe. In order, they were the date (0), visiting team (3), visiting team league (4), home team (6), home team league (7), visiting team score (9), home team score (10), and game duration in minutes (18).1 If you compare these number to the Retrosheet index, you’ll see that my numbers are one less than the index’s. That’s because pandas starts its numbering at zero instead of one.
With this done, I had a new CSV file called games.csv
that was a lot smaller, only 6.1 MB. This file had a header line of
0,3,4,6,7,9,10,18
which isn’t very useful. I opened it in BBEdit and edited that line to be
Date,VTeam,VLeague,HTeam,HLeague,VScore,HScore,Time
To get the statistics I wanted to plot, I opened a new Jupyter console session and ran these commands:
python:
1: import pandas as pd
2: from scipy.stats import scoreatpercentile
3: df = pd.read_csv('games.csv', parse_dates=[0])
4: df['Year'] = df.Date.dt.year
5: for y in df.Year.unique():
6: p25 = scoreatpercentile(df.Time[df.Year==y], 25)
7: p50 = scoreatpercentile(df.Time[df.Year==y], 50)
8: p75 = scoreatpercentile(df.Time[df.Year==y], 75)
9: print(f'{y},{p25:.2f},{p50:.2f},{p75:.2f}')
The dt.year
property gets the year from the date. The scoreatpercentile
function returns the value in the given list at the given percentile. So the loop in Lines 5–9 goes through each year, determines the 25th, 50th, and 75th percentile values for all the game durations of that year, and prints out the result in this form:
1920,99.00,109.00,120.00
1921,100.00,111.00,125.00
1922,100.00,110.50,124.00
1923,102.00,112.00,125.00
1924,101.00,112.00,125.00
1925,102.00,114.00,127.00
<etc>
I copied the output, saved it to a file called gametimes.csv
, and added this header line:
Year,Q1,Median,Q3
This is the data I wanted to plot.
I didn’t trust myself to write the plotting code interactively, so I created this gametime-plot.py
file:
python:
1: #!/usr/bin/env python3
2:
3: import pandas as pd
4: import matplotlib.pyplot as plt
5:
6: # Import game time data
7: df = pd.read_csv('gametimes.csv')
8:
9: # Create the plot with a given size in inches
10: fig, ax = plt.subplots(figsize=(6, 4))
11:
12: # Add the interquartile range and the median
13: plt.fill_between(df.Year, df.Q1, df.Q3, alpha=.25, linewidth=0, color='#0066ff')
14: ax.plot(df.Year, df.Median, '-', color='black', lw=2)
15:
16: # Gridlines and ticks
17: ax.grid(linewidth=.5, axis='x', which='major', color='#bbbbbb', linestyle='-')
18: ax.grid(linewidth=.5, axis='y', which='major', color='#bbbbbb', linestyle='-')
19: ax.tick_params(which='both', width=.5)
20:
21: # Title and axis labels
22: plt.title('Baseball game durations')
23: plt.xlabel('Year')
24: plt.ylabel('Minutes per game')
25:
26: # Save as PNG
27: plt.savefig('20230227-Baseball game durations.png', format='png', dpi=200)
28:
I think this is pretty straightforward. Normally, my plotting scripts include commands for setting the tickmark placement, but in this case matplotlib’s defaults were just fine. The main work is done in Lines 13–14, where the fill_between
shades in the interquartile range and plot
runs a thick black line along the median.
I could’ve combined the code that calculates the percentiles with the code that does the plotting into a single script, but for a one-off like this, I usually prefer to take it one step at a time so I can check the intermediate results. Even if I had combined all the code here into a single script, it would have been simpler and easier to read than what I did back in 2012. That’s all due to pandas.
-
Most of these fields were unnecessary for what I was plotting, but I kept them in case I wanted to do some other analyses. ↩
Revisiting baseball game durations
February 27, 2023 at 9:08 PM by Dr. Drang
This morning, I learned from Paul Kafasis that major league baseball will be going to a pitch clock, something they tried out in the minor leagues last year. The idea, of course, is to speed up our interminable national pastime.
The story reminded me of a post I wrote about a decade ago in which I plotted the generally upward trend of baseball game durations over about 90 years. I grabbed a new set of data from Retrosheet and made this updated plot:
The last decade has not been kind to fans’ tushies. The black line is the median game length, and the blue zone is the interquartile range, which runs from the 25th percentile to the 75th percentile. Games have gotten 10–12 minutes longer in the past 10 years.
Paul quotes a Washington Post article about the how the pitch clock
has reduced the average [minor league] game time from 3 hours 4 minutes in 2021 to 2:36 in 2022
My first thought was How can minor league games possibly last as long as major league games? Isn’t that a violation of the Geneva Convention? But then I remembered that minor league teams put on a lot of non-baseball show between innings. Myron Noodleman (RIP) ate up a lot of time.
I wish baseball luck. A half-hour drop would bring the game back down to the running time it had when I was a fan. I think that’s quite optimistic, but we’ll revisit this graph at the end of the season and see how it went.
Modulo
February 26, 2023 at 10:26 AM by Dr. Drang
I left something out of yesterday’s post. Another mathematical aspect of Underscore David Smith’s unitSquareIntersectionPoint
function is this set of lines at the top of the function:
var normalizedDegree = angle.degrees
while normalizedDegree > 360.0 {
normalizedDegree -= 360.0
}
while normalizedDegree < 0.0 {
normalizedDegree += 360.0
}
There are lots of ways to normalize an angle. Here, UDS wants to take the input angle
and turn it into the equivalent angle between 0° and 360°. If angle
is already in that range, neither of the while loops is entered and normalizedDegree
remains equal to the input angle
.
You may recognize this as an example of modular arithmetic extended to include floating point numbers. SSteve did, and UDS included their code that makes that connection more explicit:
var normalizedDegree = angle.degrees % 360;
normalizedDegree = normalizedDegree < 0 ? normalizedDegree + 360 : normalizedDegree;
Unfortunately, SSteve wasn’t able use Swift’s %
operator alone, because it returns a value with a sign that matches the sign of the dividend. Thus the second line with the ternary operator was needed to ensure a positive result.
Sign ambiguity can be a nasty problem in modular arithmetic because different programming languages treat modular arithmetic differently. If I had to normalize the input angle in my xy
function to get it between 0° and 360°—which I don’t because the trig functions handle angles outside that range correctly—I’d be able to use
python:
normalizedAngle = angle % 360
in Python without further adjustment. Python uses a floored division definition of %
, so it always returns a value with the same sign as the divisor—in this case, +360.
When I read Underscore’s first post, I didn’t know whether Swift even had a modulo operator or how it worked. I went to the Wikipedia page looking for an answer and found three:
%
uses truncated division, which, as we’ve seen, returns an answer with the same sign as the dividend. Unfortunately, it’s limited to integers, so it probably shouldn’t be used in USD’s function (unless there’s some typecasting going on in SSteve’s code that I don’t understand).truncatingRemainder(dividingBy:)
, like%
, uses truncated division, but it can be used with floating point numbers. This is probably what should be used in SSteve’s code snippet.remainder(dividingBy:)
uses rounded division, where the sign depends on how close the dividend is to an integer multiple of the divisor. The sign can be either positive or negative, which I hate.
So none of these can be used without an adjustment.
Modular arithmetic doesn’t come up often in my work, but when it does, I get nervous about the sign of the result. I’ve programmed in too many languages over the years to remember what type of division is used for the modulo operator in the language I’m currently writing in. Even the language I use the most, Python, has different ways of doing modulo arithmetic. Many different ways:
- The modulo operator,
%
, uses floored division. - The
divmod
function uses floored division. - The
fmod
function in themath
library uses truncated division. - The
remainder
function in themath
library uses rounded division. - The
mod
function in the NumPy library uses floored division. - The
fmod
function in the NumPy library uses truncated division. - The
remainder
function in the NumPy library uses floored division.
This is the sort of thing that makes you think Underscore was right in writing his own modular arithmetic code. It may take up more space, but at least the way it works is obvious.
Underscore David Smith and linearization for large angles
February 25, 2023 at 1:46 PM by Dr. Drang
Underscore David Smith wrote a post earlier this week that I had trouble with. Part of the trouble was that the code he wrote was in Swift, but that was a minor problem, as UDS’s code was clear enough to understand even though I don’t know Swift. The bigger problem was how he managed to transform an angle into a pair of xy coordinates without using any trigonometric functions. Once I figured it out, I saw that it was an interesting example of linearization without using the typical small angle approximation. So I decided to write this post.
Unsurprisingly, other people had the same trouble with UDS’s code, and this morning he published an updated post with their “corrections” to his transformation function. In some sense, these truly are corrections in that they provide a more accurate transformation. But UDS’s function is pretty accurate, probably accurate enough for his purposes, so they’re not necessarily correcting a mistake. Still, they presented some interesting ways of performing the transformation, and I’ll comment on them at the end of this post.
Let’s summarize Underscore’s problem. He wants to use Swift’s LinearGradient
fill style for an arbitrary angle. Unfortunately, LinearGradient
won’t accept an angle as its argument. It will, however, take a UnitPoint
argument. A UnitPoint
is a pair of coordinates on the perimeter of the unit square shown below,
where I’ve also shown the angle that UDS wants to use as his input.
Getting \(x\) and \(y\) from \(\theta\) is a fairly straightforward trig problem, although there’s some annoying bookkeeping necessary to deal with the corners. With the hope of avoiding that bookkeeping, I decided to look into conformal mapping from a disc onto a square. I didn’t have my copy of Churchill’s book handy, so I did some Googling and ran across this nice writeup by Chamberlain Fong. Fong shows several ways of mapping discs onto rectangles. Most important, he shows that conformal mapping (which can be very nasty mathematically) is unnecessary for our problem—we can use what he calls the Simple Stretching transformation.
As you can see, the angles in the box are the same as those in the disc, which we’ll need to do what UDS wants. The transformation formulas Fong gives for this mapping (on p. 3 of his paper) work over the entire disc and square. We can specialize them for just the perimeter:
\[\xi = \mathrm{sgn}(u), \quad \eta = \mathrm{sgn}(u) \frac{v}{u} \qquad \mathrm{for}\, u^2 \ge v^2\] \[\xi = \mathrm{sgn}(v)\frac{u}{v}, \quad \eta = \mathrm{sgn}(v) \qquad \mathrm{for}\, u^2 \lt v^2\]Here, \(\mathrm{sgn}\) is the signum or sign function—it returns \(+1\) if the argument is positive, \(-1\) if the argument is negative, and zero if the argument is zero. The coordinate systems are shown below:
So what we have now is a series of transformations. First, we go from \(\theta\) to \((u, v)\):
\[u = \cos \theta\] \[v = \sin \theta\]Then we go from \((u, v)\) to \((\xi, \eta)\) using the formulas above. Finally, we go from \((\xi, \eta)\) to \((x, y)\):
\[x = \frac{1 + \xi}{2}\] \[y = \frac{1 - \eta}{2}\]This last one is, I think, pretty obvious. We flip the vertical coordinate, then squeeze the 2×2 box into 1×1 and move its center from the origin to \((1/2, 1/2)\).
You might be looking at this and thinking the bookkeeping I wanted to avoid would be easier than cascading three coordinate transformations like this. I guess it depends on what you’re used to. To me, putting one transformation after another after another is easier to remember than keeping track of which side of the box you’re on.
Before turning this into code, there’s a simplification we can make to the transformation from \((u, v)\) to \((\xi, \eta)\). Note that
\[\mathrm{sgn}(u) = \frac{u}{|u|}\]So the transformation equations can be changed to
\[\xi = \frac{u}{|u|}, \quad \eta = \frac{v}{|u|} \qquad \mathrm{for}\, |u| \ge |v|\] \[\xi = \frac{u}{|v|}, \quad \eta = \frac{v}{|v|} \qquad \mathrm{for}\, |u| \lt |v|\]where we’ve also noted that the ordering of the squares is the same as the ordering of the absolute values.
In Python (because I don’t know Swift), the three-part transformation from \(\theta\) to \((x, y)\) is
python:
1: def xy(a):
2: # Coords on circle of unit radius with angle a (in degrees)
3: u = cosd(a)
4: v = sind(a)
5:
6: # Convert to coords on square from (-1, -1) to (1, 1)
7: if abs(u) >= abs(v):
8: xi = u/abs(u)
9: eta = v/abs(u)
10: else:
11: xi = u/abs(v)
12: eta = v/abs(v)
13:
14: # Convert to coords on square from (0, 0) to (1, 1) with flipped
15: # vertical axis
16: return (1 + xi)/2, (1 - eta)/2
This is not necessarily the fastest implementation, or the shortest, but it’s pretty simple, and you can see each transformation in turn. Note that if you think of the \((\xi, \eta)\) box as being divided into 90° sectors like this,
the green areas correspond to \(|u| \ge |v|\) and the yellow areas correspond to \(|u| \lt |v|\), which is how we split the if
conditional.
So how does this compare with what Underscore does? Here’s his code:
func unitSquareIntersectionPoint(_ angle:Angle) -> UnitPoint {
var normalizedDegree = angle.degrees
while normalizedDegree > 360.0 {
normalizedDegree -= 360.0
}
while normalizedDegree < 0.0 {
normalizedDegree += 360.0
}
if normalizedDegree < 45.0 || normalizedDegree >= 315 {
//Right Edge, x = 1.0
var degreeToConsider = normalizedDegree
if degreeToConsider < 45.0 {
degreeToConsider = normalizedDegree + 360.0
//angle now between 315 & 405
}
let degreeProportion = (degreeToConsider - 315.0) / 90.0
return UnitPoint(x: 1.0, y: 1.0 - degreeProportion)
} else if normalizedDegree < 135.0 {
//Top Edge, y = 0.0
let degreeProportion = (normalizedDegree - 45.0) / 90.0
return UnitPoint(x: 1.0 - degreeProportion, y: 0.0)
} else if normalizedDegree < 225.0 {
//left Edge, x = 0.0
let degreeProportion = (normalizedDegree - 135) / 90.0
return UnitPoint(x: 0.0, y: degreeProportion)
} else if normalizedDegree < 315.0 {
//Bottom Edge, y = 1.0
let degreeProportion = (normalizedDegree - 225) / 90.0
return UnitPoint(x: degreeProportion, y: 1.0)
}
return .zero
}
(Sorry, I don’t have syntax highlighting set up here for Swift.)
As you can see, he splits his calculations into the four colored sectors—now on the unit square—but he has to treat all four sectors separately
The main difference, though, is that he doesn’t use trig functions. For each side of the square, he sets one of the coordinates to 0 or 1, as appropriate, and sets the other as
\[\frac{\phi - 90}{90}\]where \(\phi\) is the angle from the closest corner clockwise from \(\theta\). In other words, to get the y coordinate along the right edge, \(\phi\) is measured from the bottom right corner; to get the x coordinate along the top edge, \(\phi\) is measured from the top right corner; and so on.
In my xy
function, the key calculations are
which give us \(\pm 1\) for the first two and either \(\pm \cot \theta\) or \(\pm \tan \theta\) for the other two. My \(\pm 1\) correspond to UDS’s 0 and 1.1 And since
\[\cot \theta = \tan (90° - \theta)\]and tangent repeats every 180°, my other calculations always return a result equivalent to the tangent of an angle between -45° and +45°.
The upshot is that UDS’s function is approximating the tangent like this,
where the blue line is the tangent and the orange line is UDS’s approximation. As you can see, they’re fairly close to each other.
You’ll notice I’ve plotted \(\theta\) in radians. That’s because I also wanted to show the usual small angle approximation of tangent,
\[\tan \theta \approx \theta\]which I’ve plotted as the dotted black line. The small angle approximation works only when the angle is given in radians.
As expected, the small angle linear approximation gets crappy as you get further from 0°. By tilting his approximating line, UDS gets perfect alignment at the corners and midsides of the box (the most important places) and decent alignment elseswhere.
How decent? Let’s plot the angle associated with UDS’s output and compare it to the input angle:
The symmetry (actually antisymmetry) means we can restrict ourselves to 0° to 45°. The dashed line is what the plot would look like if Underscore’s function wasn’t doing an approximation.
Here’s a plot of the error:
It’s not a parabola, nor is it symmetric, but it’s close. The maximum error is 4.07° when the input angle is 23.52°
Is a maximum error of 4° acceptable? I suspect it is, especially because of where it occurs. People tend to be better at detecting angular errors near the horizontal, vertical, and 45° than they are at other angles.2 UDS’s function is good where it needs to be and less good where it doesn’t.
On the other hand, it could well be argued that this is not an especially time-critical calculation, and you might as well do the trig calculations3 instead of estimating through proportions. A lot depends on what sort of calculations you’re most comfortable with.
Which brings us to Underscore’s update and the “corrections” therein. Most of them include code that looks a lot like mine, which isn’t surprising, since they’re doing the same sort of trigonometry I am. Math is math. But there are some interesting variations.
DrewFitz does two thing I like:
- They start by changing the sign of the angle, which has the effect of flipping the vertical coordinate without flipping the horizontal coordinate because sine is antisymmetric and cosine is symmetric.
- They determine the maximum of \(|u|\) and \(|v|\), and uses that as his denominator. This eliminates the
if/else
clause (it’s still under the hood of themax
function, but you don’t see it).
Sometimes being clever like this gets you in trouble, but I think the comments in the code explain what’s going on quite well.
robb’s solution is very close to mine and even closer to the equations in Fong’s paper. Like DrewFitz, robb tweaks the angle upon input to handle the flipping of the vertical axis, but in this code the angle is increased by 90° and the sine and cosine are reversed. That tweak was a little harder for me to understand, but it works.
Like Underscore, I find Rob Mayoff’s solution hard to understand. Mayoff uses a combination of max
, min
and copysign
(which takes the place of signum) to deal with the sector choice. It works, but I would have a hard time explaining why without going through numerical examples. Also, I think they’re using an opposite sign convention for the angle—UDS’s increases counterclockwise while Rob Mayoff’s seems to increase clockwise—but that’s a minor issue.
Of course, I think my solution is the “best” (followed closely—maybe edged out?—by DrewFitz’s) but that’s because it fits my way of thinking. As we learn more, our ways of thinking change and “best” changes, too. That’s why you’re constantly told to comment your code.
-
Remember, at this stage my calculations are done on the 2×2 square. My \(\pm 1\) get squeezed and shifted to 0 and 1 in the last line of the function. ↩
-
This is related to the idea of “banking to 45°” described by Cleveland in his The Elements of Graphing Data. ↩
-
Are trig calculations still expensive compared to regular arithmetic? I was taught that they are, but that was long ago. ↩