Dot plots in Mathematica
October 2, 2025 at 5:52 PM by Dr. Drang
I ran across Mathematica’s NumberLinePlot
function recently and wondered if I could use it to make dot plots. The answer was yes, but not directly.
By “dot plots,” I mean the things that look sort of like histogram but with stacked dots instead of boxes to represent each count. Here’s an example from a nice paper on dot plots by Leland Wilkinson (the Grammar of Graphics guy):
The paper is referenced in the Wikipedia article on dot plots, and rightly so. Here’s another in a slightly different style from Statistics for Experimenters by Box, Hunter, and Hunter (who call them dot diagrams):
(The vertical line and arrow are add-ons to the dot plot pertinent to the particular problem BH&H were discussing.)
If you’re a William Cleveland fan, you should be aware that the kind of dot plots I’ll be talking about here are not what he calls dot plots. His dot plots are basically bar charts with dots where the ends of the bars would be. Like this example from The Elements of Graphing Data:
With that introduction out of the way, let’s talk about Wilkinson-style dot plots and NumberLinePlot
.
In a nutshell, NumberLinePlot
takes a list of numbers and produces a graph in which the numbers are plotted as dots along a horizontal axis—very much like the number lines I learned about in elementary school. I wondered what it would do if some of the numbers were repeated, so I defined this list of 30 integers with a bunch of repeats,
a = {7, 2, 4, 8, 5, 7, 4, 2, 1, 3, 7, 5, 7, 8, 7,
2, 5, 6, 7, 6, 1, 5, 2, 1, 6, 5, 8, 3, 8, 3}
and ran NumberLinePlot[a]
. Here’s the output:
Unfortunately, it plots the repeated numbers on top of one another so you can’t see how many repeats there are. I looked through the documentation to see if I could pass an option to NumberLinePlot
to get it to stack the repeated dots vertically, but no go.
There is, however, a way to get NumberLinePlot
to stack dots. You have to pass it a nested list set up with sublists having no repeated elements. For example, if we rewrite our original flat list, a
, this way,
b = {{1, 2, 3, 4, 5, 6, 7, 8}, {1, 2, 3, 4, 5, 6, 7, 8},
{1, 2, 3, 5, 6, 7, 8}, {2, 5, 7, 8}, {5, 7}, {7}}
and run NumberLinePlot[b]
, we get this graph:
As you can see, each sublist is plotted as its own row. That’s why each row has a different color: Mathematica sees this as a combination of six number line plots and gives each its own color by default. But don’t worry about the colors or the size and spacing of the dots; those can all be changed by passing options to NumberLinePlot
. What you should worry about is how to turn flat list a
into nested list b
. Doing it by hand is both tedious and prone to error.
As it happens, someone has already worked that out and combined it with NumberLinePlot
to make an external function, DotPlot
, that makes the kind of graph we want. Running ResourceFunction["DotPlot"][a]
produces
Again, we can add options to DotPlot
to give it the style Wilkinson suggests. Calling it this way,
ResourceFunction["DotPlot"][a, Ticks -> {Range[8]},
AspectRatio -> .2, PlotRange -> {.5, 8.5},
PlotStyle -> Directive[Black, PointSize[.028]]]
gives us
After setting the aspect ratio to 1:5, I had to fiddle around with the PointSize
to get the dots to sit on top of one another. It didn’t take long.
This is great, but I wanted to know what DotPlot
was doing behind the scenes. So I downloaded its source notebook and poked around. The code wasn’t especially complicated, but there was one very clever piece. Let’s go through the steps.
First, we Sort
the original flat list and Split
it into a list of lists in which each sublist consists of a single repeated value.
nest = Split[Sort[a]]
yields
{{1, 1, 1}, {2, 2, 2, 2}, {3, 3, 3}, {4, 4}, {5, 5, 5, 5, 5},
{6, 6, 6}, {7, 7, 7, 7, 7, 7}, {8, 8, 8, 8}}
Compare this with the nested list b
we used above and you can see that they’re sort of opposites. If we think of these nested lists as matrices, the rows of b
are the columns of nest
, and vice versa. That suggests a Transpose
of nest
would get us to b
, but the problem is that neither nest
nor b
are full matrices. They both have rows of different lengths.
The solution to this is a three-step process:
- Pad the rows of
nest
with dummy values so they’re all the same length. - Transpose that padded list of lists.
- Delete the dummy values from the transpose.
The code for the first step is
maxlen = Max[Map[Length, nest]];
padInnerLists[l_] := PadRight[l, maxlen, "x"]
padded = Map[padInnerLists, nest]
This sets maxlen
to 6 and pads all of nest
’s rows out to that length by adding x
strings. The result is this value for padded
:
{{1, 1, 1, "x", "x", "x"}, {2, 2, 2, 2, "x", "x"},
{3, 3, 3, "x", "x", "x"}, {4, 4, "x", "x", "x", "x"},
{5, 5, 5, 5, 5, "x"}, {6, 6, 6, "x", "x", "x"},
{7, 7, 7, 7, 7, 7}, {8, 8, 8, 8, "x", "x"}}
I should mention here that Map
works the same way mapping functions work in other languages: it applies function to each element of a list and returns the list of results.
Now we can transpose it with
paddedT = Transpose[padded]
which yields
{{1, 2, 3, 4, 5, 6, 7, 8}, {1, 2, 3, 4, 5, 6, 7, 8},
{1, 2, 3, "x", 5, 6, 7, 8}, {"x", 2, "x", "x", 5, "x", 7, 8},
{"x", "x", "x", "x", 5, "x", 7, "x"},
{"x", "x", "x", "x", "x", "x", 7, "x"}}
The last step is to use DeleteCases
to get rid of the dummy values.
nestT = DeleteCases[paddedT, "x", All]
yields
{{1, 2, 3, 4, 5, 6, 7, 8}, {1, 2, 3, 4, 5, 6, 7, 8},
{1, 2, 3, 5, 6, 7, 8}, {2, 5, 7, 8}, {5, 7}, {7}}
which is exactly what we want. This is what DotPlot
does to prepare the original flat list for passing to NumberLinePlot
.1
Now that I know how DotPlot
works, I feel comfortable using it.
You may be wondering why I don’t just run
Histogram[a, {1}, Ticks -> {Range[8]}]
to get a histogram that presents basically the same information.
First, where’s the fun in that? Second, dot plots appeal to my sense of history. They have a kind of hand-drawn look that reminds me of using tally marks to count items in the days before we all used computers.
-
DotPlot
uses anonymous functions and more compact notation than I’ve used here. But the effect is the same, and I wanted to avoid weird symbols like#
,&
, and/@
. ↩
Baseball durations after the pitch clock
September 30, 2025 at 8:20 PM by Dr. Drang
A couple of years ago, after Major League Baseball announced they’d start using a pitch clock, I said I would write a post about the change in the duration of regular season games. I didn’t. My recent post about Apple’s Sports app and its misunderstanding of baseball standings reminded me of my broken promise. I decided to write a new game duration post after the 2025 season ended.
Unfortunately, Retrosheet, which is where I get my data, hasn’t published its 2025 game logs yet. I decided to go ahead with the post anyway, mainly because putting it off would probably lead to my forgetting again. So here’s the graph of regular season durations through the 2024 season. When the 2025 data comes in, I’ll update it (I hope).
The black line is the median duration, and the light blue zone behind it is the interquartile range. The middle 50% of games lie within that range.
Clearly, there was a huge dropoff in 2023, so I’d say the pitch clock was a big success. The further reduction in 2024 was within the historical year-to-year duration change—I wouldn’t attribute any significance to it.
Games are now roughly as long as they were in the early ’80s, so the powers at MLB have cut about four decades of fat from the game. And they did it without reducing the number of commercials, because they’d never do that.
I made the graph more or less the same way I made the last one, although I combined some of the steps into a single script. First, I downloaded and unzipped all the yearly game logs since 1920 from Retrosheet. They have names like gl1920.txt
. Then I converted the line endings from DOS-style to Unix-style via
dos2unix gl*.txt
I got the dos2unix
command from Homebrew.
I concatenated all the data into one big (189 MB) file using
cat gl*.txt > gl.csv
Although the files from Retrosheet have a .txt
extension, they’re actually CSV files (albeit without a header line). That’s why I gave the resulting file a .csv
extension.
I then ran the following script, which uses Python and Pandas to make a dataframe with just the columns I want from the big CSV file and calculate the quartile values for game durations on a year-by-year basis.
python:
1: #!/usr/bin/env python3
2:
3: import pandas as pd
4: from scipy.stats import scoreatpercentile
5: import sys
6:
7: # Make a simplified dataframe of all games with just the columns I want
8: cols = [0, 3, 4, 6, 7, 9, 10, 18]
9: colnames = 'Date VTeam VLeague HTeam HLeague VScore HScore Time'.split()
10: df = pd.read_csv('gl.csv', usecols=cols, names=colnames, parse_dates=[0])
11:
12: # Add a column for the year
13: df['Year'] = df.Date.dt.year
14:
15: # Use the dataframe created above to make a new dataframe
16: # with the game duration quartiles for each year
17: cols = 'Year Q1 Median Q3'.split()
18: dfq = pd.DataFrame(columns=cols)
19: for y in df.Year.unique():
20: p25 = scoreatpercentile(df.Time[df.Year==y], 25)
21: p50 = scoreatpercentile(df.Time[df.Year==y], 50)
22: p75 = scoreatpercentile(df.Time[df.Year==y], 75)
23: dfq.loc[y] = [y, p25, p50, p75]
24:
25: # Write a CSV file for the yearly duration quartiles
26: dfq.Year = dfq.Year.astype('int32')
27: dfq.to_csv('gametimes.csv', index=False)
This code is very similar to what I used a couple of years ago. You can read that post if you want an explanation of any of the details.
Now I had a new CSV file, called gametimes.csv
, that looked like this
Year,Q1,Median,Q3
1920,99.0,109.5,120.0
1921,100.0,111.0,125.0
1922,100.0,110.5,124.0
1923,102.0,112.0,125.0
1924,101.0,112.0,125.0
1925,102.0,114.0,127.0
[and so on]
The graph was made by running this script, which reads the data from gametimes.csv
and uses Matplotlib to output the PNG image shown above:
python:
1: #!/usr/bin/env python3
2:
3: import pandas as pd
4: import matplotlib.pyplot as plt
5: from datetime import date
6:
7: # Import game time data
8: df = pd.read_csv('gametimes.csv')
9:
10: # Create the plot with a given size in inches
11: fig, ax = plt.subplots(figsize=(6, 4))
12:
13: # Add the interquartile range and the median
14: plt.fill_between(df.Year, df.Q1, df.Q3, alpha=.25, linewidth=0, color='#0066ff')
15: ax.plot(df.Year, df.Median, '-', color='black', lw=2)
16:
17: # Gridlines and ticks
18: ax.grid(linewidth=.5, axis='x', which='major', color='#bbbbbb', linestyle='-')
19: ax.grid(linewidth=.5, axis='y', which='major', color='#bbbbbb', linestyle='-')
20: ax.tick_params(which='both', width=.5)
21:
22: # Title and axis labels
23: plt.title('Baseball game durations')
24: plt.xlabel('Year')
25: plt.ylabel('Minutes per game')
26:
27: # Save as PNG
28: day = date.today()
29: dstring = f'{day.year:4d}{day.month:02d}{day.day:02d}'
30: plt.savefig(f'{dstring}-Baseball game durations.png', format='png', dpi=200)
I don’t know when the folks at Retrosheet will add the 2025 data. Maybe after the World Series. I’ll check back then and update the post if there’s a new year of game logs.
Sparking joy
September 28, 2025 at 3:55 PM by Dr. Drang
I watched this Numberphile video a few days ago and learned not only about Harshad numbers, which I’d never heard of before, but also some new things about defining functions in Mathematica.
Harshad numbers1 are defined as integers that are divisible by the sum of their digits. For example,
so 42 is a Harshad number. On the other hand,
so 76 is not. You can use any base to determine whether a number is Harshad or not, but I’m going to stick to base 10 here.
After watching the video, I started playing around with Harshad numbers in Mathematica. Because Mathematica has both a DigitSum
function and a Divisible
function, it’s fairly easy to define a Boolean function that determines whether a number is Harshad or not:
myHarshadQ[n_] := Divisible[n, DigitSum[n]]
Mathematica functions that return Boolean values often end with Q (Divisible
being an obvious exception), so that’s why I put a Q at the end of myHarshadQ
. A couple of things to note:
- Both
Divisible
andDigitSum
accept and return lists. Therefore, we can givemyHarshadQ
a list of numbers and it will return a list of Boolean values. - Because
DigitSum
accepts only integer arguments,myHarshadQ
throws an error if you feed it noninteger arguments.
To get all the Harshad numbers up through 100, we can use the Select
and Range
functions like this:
Select[Range[100], myHarshadQ]
which returns the list
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12,
18, 20, 21, 24, 27, 30, 36, 40, 42, 45, 48,
50, 54, 60, 63, 70, 72, 80, 81, 84, 90, 100}
You can check this against Sequence A005349 in the On-Line Encyclopedia of Integer Sequences.
There are a couple of ways to count how many Harshad numbers are in a given range. Either
Length[Select[Range[100], myHarshadQ]]
which is a straightforward extension of what we did above, or the less obvious
Count[myHarshadQ[Range[100]], True]
which works by making a list of 100 Boolean values and counting how many are True
. The Count
function is like Length
, but it returns only the number of elements in a list that match a pattern. Both methods return the correct answer of 33, and they take about the same amount of time (as determined through the Timing
function) to run.
While Mathematica doesn’t have a built-in function for determining whether a number is Harshad, it does have an external function, HarshadNumberQ
, that does. This can be used in conjunction with ResourceFunction
. The call
ResourceFunction["HarshadNumberQ"][42]
returns True
. Similarly,
Select[Range[100], ResourceFunction["HarshadNumberQ"]]
returns the list of 33 numbers given above.
The resource function is considerably faster than mine. Running
Timing[Count[ResourceFunction["HarshadNumberQ"][Range[10000]], True]]
returns {0.027855, 1538}
, while running
Timing[Count[myHarshadQ[Range[10000]], True]]
returns {0.06739, 1538}
. The runtime, which is the first item in the list, changes slightly from one trial to the next, but myHarshadQ
always takes nearly three times as long.
To figure out why, I downloaded the source notebook for HarshadNumberQ
and took a look. Here’s the source code for the function:
SetAttributes[HarshadNumberQ, Listable];
iHarshadNumberQ[n_, b_] := Divisible[n, Total[IntegerDigits[n, Abs[b]]]]
HarshadNumberQ[n_Integer, b_Integer] :=
With[{res = iHarshadNumberQ[Abs[n], b]}, res /; BooleanQ[res]]
HarshadNumberQ[n_Integer] := HarshadNumberQ[n, 10];
HarshadNumberQ[_, Repeated[_, {0, 1}]] := False
I won’t pretend to understand everything that’s going on here, but I have figured out a few things:
First, HarshadNumberQ
is obviously defined to handle any number base, not just base-10. And it can handle a list of bases, too, so if you want to check on how many bases in which a number is Harshad, this is the function you want.
Second, HarshadNumberQ
uses an argument pattern I’ve never seen before: n_Integer
. This restricts it to accepting only integer inputs. It returns False
when given noninteger arguments; it doesn’t just throw an error the way myHarshadQ
does.
Third, HarshadNumberQ
uses Total[IntegerDigits[n]]
instead of DigitSum[n]
. This, I believe, is at least part of the reason it runs faster than myHarshadQ
. For example,
Timing[Total[IntegerDigits[99999^10]]]
returns {0.000084, 216}
, while
Timing[DigitSum[99999^10]]
returns {0.000146, 216}
.
Finally, the SetAttributes
function is needed because Total
would otherwise have trouble handling the output of IntegerDigits
when the first argument is a list. Let’s look at some examples:
IntegerDigits[123456]
returns {1, 2, 3, 4, 5, 6}
, a simple list that Total
knows how to sum to 21. But
IntegerDigits[{12, 345, 6789}]
returns a lists of lists, {{1, 2}, {3, 4, 5}, {6, 7, 8, 9}}
, which Total
has trouble with. Calling
Total[IntegerDigits[{12, 345, 6789}]]
returns an error, saying that lists of unequal length cannot be added. We can get around this error by giving Total
a second argument. Calling it as
Total[IntegerDigits[{12, 345, 6789}], {2}]
tells Total
to add the second-level lists, returning {3, 12, 30}
, which is just what we want. The problem is that this second argument to leads to an error when the first argument is a scalar instead of a list.
This, as best I can tell, is where the
SetAttributes[HarshadNumberQ, Listable];
line comes in. By telling Mathematica that HarshadNumberQ
is Listable
, we can define the function as if Total
were always being used on scalars, and the system will thread over lists just like we want.
Since I’m just writing myHarshadQ
for my own entertainment and not for others to use, I can take some of what I learned above and rewrite it to run faster. Like this:
SetAttributes[myHarshadQ, Listable];
myHarshadQ[n_] := Divisible[n, Total[IntegerDigits[n]]]
With this new definition, myHarshadQ
runs considerably faster. Calling
Timing[Count[myHarshadQ[Range[10000]], True]]
returns {0.013527, 1538}
. This is roughly twice as fast as HarshadNumberQ
, probably because the definition doesn’t deal with other bases or handle errors gracefully.
Now I can look at Harshad numbers over a broad range without having to wait long for results. To see how they’re distributed over the first million numbers, I ran
Histogram[Select[Range[999999], HarshadNumberQ], {25000}, ImageSize -> Large]
and got this histogram (with a bin width of 25,000):
The stepdown pattern repeats every 100,000 numbers, moving down each time. Let’s stretch out the range to three million and see what happens.
Histogram[Select[Range[2999999], HarshadNumberQ], {25000}, ImageSize -> Large]
The pattern repeats at the larger scale, too.
If you want to learn some other Harshad facts, you should watch this extra footage at Numberphile. There doesn’t seem to be anything directly practical you can do with Harshad numbers, but you can use them to exercise your mathematical muscles. Or learn new things about the Wolfram Language.
-
You might think Harshad numbers are named after their discoverer or, in accordance with Stigler’s Law, their popularizer, but no. See either the video or the Wikipedia article for the origin of the name. ↩
Clinching
September 27, 2025 at 11:32 AM by Dr. Drang
Although I complained last year of the Sports app’s deficiencies, I’ve been giving it another try. I do like having the score of a favorite team continually updated in the Dynamic Island, but just this morning I learned of another problem.
With the baseball season winding down, I looked in at the wild card standings. Here’s the American League:
The legend (just out of sight below the bottom of the screenshot) says that the little green arrows in the left margin mean “Clinched Playoffs” and the red bars mean “Out of Playoff Contention.” Unlike most sports apps, there’s no designation for clinching the division, an important omission.
Worse than the omission, though, is the outright error. The Mariners have definitely clinched a spot in the playoffs. They’re five games ahead of their nearest rival in the AL West, the Astros, and there are only two games left in the season. How did the Sports app screw this up? I suspect it comes down to the omission mentioned above. The Mariners are in the playoffs because they’ve clinched their division; if Sports doesn’t account for divisional clinching, it may not be able to tell that the Mariners are guaranteed a spot in the playoffs.
(And more than just a spot. Because they’ll have at least the second best record among the AL division winners, the Mariners will get a first-round bye.)
We see the same problem in the National League:
Like the Mariners, the Dodgers are definitely in the playoffs but haven’t been awarded the green arrow. What’s worse about this omission is that the Padres—the team the Dodgers beat out for the NL West—have been marked as a playoff team.
So I’m happy to keep Sports around for its Dynamic Island support, but I still won’t be using it as my main sports app. I’ll continue bouncing between the apps from CBS and ESPN. Neither is great, but they’re both better than Apple’s.
Update 27 Sep 2025 4:16 PM
A few people have suggested that I wasn’t seeing green arrows for the Mariners and Dodgers because I was looking at the Wild Card tabs instead of the Divisional tabs. As it happens, I still have a screenshot of the AL Divisional tab taken at the same time as the two screenshots above. No arrow for the Mariners.
I didn’t take a screenshot of the NL Divisional tab, but I looked at it, and there was no arrow for the Dodgers at the time. Since I took these screenshots, all the tabs have been updated and all the teams that have clinched a playoff spot have arrows. I don’t know what caused the update, but it’s kind of late—the Mariners clinched their division days ago.