Ticks tricks
November 4, 2024 at 3:46 PM by Dr. Drang
As I often do after a post with a graph, I’m going to follow up by showing how I made it. In this case, the plotting was done with Matplotlib, the Python graphing library with which I’m most familiar.
Matplotlib makes decent graphs by default. If you’re just doing some exploratory plotting for your own purposes, the defaults can be just fine. But when you want your graphs to look better than “just fine,” you need to dig into Matplotlib’s many (many) functions to customize the look.
Here’s yesterday’s graph:
And here’s the code that produced it:
python:
1: #!/usr/bin/env python3
2:
3: import pandas as pd
4: import matplotlib.pyplot as plt
5: from datetime import datetime
6: from matplotlib.ticker import MultipleLocator
7: from matplotlib.dates import DateFormatter, YearLocator
8:
9: # Import data
10: df = pd.read_csv('unemployment.csv')
11: x = pd.to_datetime(df.Date, format="%b %Y")
12: y = df.Rate
13:
14: # Create the plot with a given size in inches
15: fig, ax = plt.subplots(figsize=(5,5))
16:
17: # Add a line
18: ax.plot(x, y, '-', color='black', lw=2)
19:
20: # Set the limits
21: plt.xlim(xmin=datetime(2014,1,1), xmax=datetime(2025,1,1))
22: plt.ylim(ymin=2, ymax=8)
23:
24: # Set the major and minor ticks and add a grid
25: ax.xaxis.set_major_locator(YearLocator(1))
26: ax.xaxis.set_major_formatter(DateFormatter(' ’%y'))
27: plt.setp(ax.get_xticklabels()[-1], visible=False)
28: ax.yaxis.set_major_locator(MultipleLocator(2))
29: ax.yaxis.set_major_formatter('{x:.0f}%')
30: ax.yaxis.set_minor_locator(MultipleLocator(1))
31: ax.grid(linewidth=.5, axis='y', which='both', color='#dddddd', linestyle='-')
32:
33: # Title and axis labels
34: plt.title('Civilian Unemployment')
35:
36: # Annotations
37: plt.text(datetime(2020, 4, 1), 2.45, "Peak of 14.8% in April 2020", ha='center')
38: plt.arrow(datetime(2020, 4, 1), 2.75, 0, .45, head_width=50, head_length=.25, lw=.75, fc='black', zorder=100)
39:
40: # Make the border and tick marks 0.5 points wide
41: [ i.set_linewidth(0.5) for i in ax.spines.values() ]
42: ax.tick_params(which='both', width=.5)
43:
44: # Save as PNG
45: plt.savefig('20241103-Improved unemployment graph.png', format='png', dpi=200)
Lines 10–12 get the unemployment data from a file, unemployment.csv
, and put the values into the variables x
and y
for later use. As I said in yesterday’s post, the data came from the US Bureau of Labor Statistics. The Bureau’s page is dominated by a graph of civilian unemployment that goes back about 20 years, but there’s also a link you can use to bring up a table of data, which looks like this:
To match the graph in the Chicago Tribune, I selected the data from January 2014 through October 2024, copied it, and pasted it into a new Numbers spreadsheet (I used unemployment.csv
. Here are the first several lines of that file:
Date,Rate
Jan 2014,6.6
Feb 2014,6.7
Mar 2014,6.7
Apr 2014,6.2
May 2014,6.3
June 2014,6.1
July 2014,6.2
Aug 2014,6.1
Sept 2014,5.9
Oct 2014,5.7
Nov 2014,5.8
Dec 2014,5.6
This is not quite in the form I need. Although most of the months are given with the standard three-letter abbreviations, June, July, and September are not. With the file open in BBEdit, I did this Find and Replace operation:
The regular expression in the Find section is
^(\w\w\w)\w
If you select that line you’ll see there’s a space at the end of it, which is important. The Replace regex is
\1
and there’s a space character at the end of it, too. There may be more clever regexes that will do the job, but as this is a one-off filter, I used what came to mind first. Now the file looks like this
Date,Rate
Jan 2014,6.6
Feb 2014,6.7
Mar 2014,6.7
Apr 2014,6.2
May 2014,6.3
Jun 2014,6.1
Jul 2014,6.2
Aug 2014,6.1
Sep 2014,5.9
Oct 2014,5.7
Nov 2014,5.8
Dec 2014,5.6
and the dates are in a standard format for parsing.
The table is imported into a Pandas data frame in Line 10, and the Date field is converted to datetime
objects in Line 11. The %b %Y
formatting string is strftime
code for “three-letter month abbreviation followed by a space and the four-digit year.” Now you see why I needed to lop the fourth character off of June
, July
, and Sept
.
(There are, by the way, ways to do the date parsing within the read_csv
function itself. I seldom do the parsing that way, because I find the parameters to read_csv
too complex. I prefer to do the importing in one step and the date parsing in another.)
(To compound parentheticals, I suppose I didn’t need to use the Pandas library just to import a CSV file; there are other ways to do that. But I use Pandas a lot and I’m comfortable with it. My efficiency is more important than library minimalism.)
Line 15 creates a 5″×5″ plot, and Line 18 plots the x
and y
data on it. The data points will be connected by lines ('-'
) that are black (color='black'
) and two points thick (lw=2
). From now on, all of the commands will be customizing the appearance of the graph.
Lines 21–22 set the limits of the plot. The limits of the x-axis have to be specified as datetime
objects because that’s what the x
data is. The ymax
parameter in Line 22 is what cuts off the pandemic spike, limiting the graph to unemployment values of 8% and below.
Lines 25–27 set the tick marks for the x-axis. Matplotlib lets you have both major and minor tick marks, but we’re using only major tick marks on the x-axis. They are set one year apart in Line 25. Line 26 establishes how the tick labels will appear. The DateFormatter
function takes an argument that uses the same strftime
formatting codes we saw earlier. To get an apostrophe followed by the last two digits of the year, the code is ’%y
.
But what about all those space characters in the DateFormatter
argument? That’s the result of a certain eccentricity of mine. Each tick mark on the x-axis is set at January 1 of that year—more specifically, at midnight on the morning of January 1.1 And tick mark labels are centered on the tick. But the year associated with a given tick mark is actually the interval between that tick and the next, so I think the label should go in that interval. What all those spaces do in Line 26 is shift the label over to (about) the midpoint between the ticks.
I seem to be basically alone in this way of formatting date-related labels. Everyone else seems to think it’s just fine to put the label for a year right under the tick associated with January 1 of that year. But I say it’s wrong, and I’m sticking to my guns.
While I may be eccentric, I’m not crazy. If the data set includes just one value for each year, I’m happy to put the year labels right under the tick marks. It’s only when there are multiple values per year that I insist on offset labels.
Line 27 deals with a consequence of my eccentricity. Because the maximum x value is January 1, 2025 (that’s how I get a tick mark at the end of the x-axis), there will be a label for that tick mark, and it will be placed off the end of the axis. Kind of embarrassing. Line 27 eliminates the embarrassment by using the setp
function to make the label for the last ([-1]
) x label invisible.
Lines 28–30 handle the ticks and labels for the y-axis. Here, the major tick marks are set every 2% (Line 28), and the minor tick marks are set every 1% (Line 30). By default, the labels on the y-axis would be presented without percent signs, so Line 29 takes care of that by invoking the set_major_formatter
function with an argument of '{x:.0f}%
. This code comes from Python’s format string syntax. You may be wondering why it has an x
when we’re dealing with the y-axis. In this case, the x
is a local placeholder and has nothing to do with the axis whose labels we’re formatting. Just an unfortunate choice of variable name by the Matplotlib developers.
The last item in this section of the code is Line 31, which tells Matplotlib to put a thin gray horizontal grid aligned with the major and minor tick marks. I have a tendency to use more grid lines than most people. Part of this comes from my belief that they help viewers keep their eyes aligned as they compare values. Another part undoubtedly comes from my using actual graph paper back when I was plotting data in college (long enough ago that students didn’t have computers). I do try to keep the grid unobtrusive.
The annotations that deal with the pandemic spike I cut off are defined in Lines 37–38. The text is centered at April 2020, as is the vertical arrow. Note that the x-coordinates of the positions have to be given as datetime
values because that’s how the x-axis is measured. The large value for the zorder
parameter in Line 38 insures that the arrow is drawn above the grid lines.
I have to say I find the head_width
and head_length
values in Line 38 troublesome. They seem to be given in the same units as the x- and y-axes, respectively, and that’s fine for a vertical arrow. But what should you do if the arrow is drawn at an angle? Discretion being the better part of valor, I didn’t do any experimenting to see how that would work. My ignorance may bite me in the future.
As with the grid lines, I try to make the border and tick marks as unobtrusive as possible by making them thin. That’s what Lines 41–42 do. They’re not gray because the border and ticks shouldn’t be as unobtrusive as the grid.
I guess I lied above when I said that all the code after Line 18 was customization. Line 45 saves the graph as a PNG file. The 200 dpi parameter combines with the 5″×5″ plot size in Line 15 to make the PNG 1000 pixels square.
-
There are ways to set the tick mark elsewhere in the year, but this is the default, and I like it there. ↩
Rescaling a graph
November 3, 2024 at 10:29 AM by Dr. Drang
The business pages of the Chicago Tribune usually have a little graphic showing a plot of some economic data. There’s no story to go along with the graph; it’s meant to stand on its own. In yesterday’s edition, it was a graph of unemployment over the past decade or so:
Is this a good graph? As with all graphs, that depends on what’s being communicated. If the idea is to show how high unemployment got during the pandemic, it’s a good graph. But because it was published shortly after the monthly data for October came out—you can see the callout of 4.1% for the last data point—I’d say it’s really meant to show the current unemployment and put it in the context of recent history. For that purpose, it’s OK, but the graph’s vertical scale reduces its value. The large range of the y-axis interferes with our ability to see variations during the non-pandemic years.
People often use a log scale when they need to plot over a large range of data, but that wouldn’t be appropriate here. A general audience may have difficulty interpreting a log scale. More important, the range here isn’t due to long-term growth or decline, it’s coming entirely from that spike in 2020.
My suggestion would be to cut the y-axis off at about 8%, put labeled ticks every 2%, and add a note showing where the peak unemployment occurred. Something like this:
I didn’t bother matching the font style of the Tribune’s graph, but I did make sure to use an apostrophe on all the years along the x-axis. Like the Tribune, I got the data from the US Bureau of Labor Statistics.
Is this a huge improvement? No, but it brings the unemployment of the past few years up off the bottom of the graph and makes all the data (except the spike) easier to interpret.
Apple Health trends
November 1, 2024 at 2:30 PM by Dr. Drang
There are some odd things about the way Apple tracks your progress in the Health app. These oddities aren’t new with the recent OS updates. They’ve been a part of the app for as long as I can remember; I’ve just finally gotten around to talking about them.
The oddities are centered around what Apple calls trends. Here, for example, are my last six months of step counts:
If you asked me about the trend of this data, I’d say it’s going up roughly linearly with some scatter. I’d probably try to estimate the slope, which would have the unusual units of “average daily steps per week.” Better than estimation would be to do a linear least squares fit, which could be plotted on top of the bar chart:
I did this in Mathematica Wolfram, but you could do it in any number of apps. Most people would probably use a spreadsheet.
But in Health, Apple shows my step trend this way:
It doesn’t try to fit the best straight line to the data, it fits a step function. It’s idea of a trend is discrete rather than continuous: You used to average this many steps per day, now you average this many steps per day.
Apple would probably say this is just as valid a definition of trend as mine, but it sure doesn’t feel that way, especially when the data look like this. There’s no sudden jump in the data anywhere in the period covered by the graph. Whatever ups and downs we see are just scatter.
You may have noticed something funny about the daily average of 17,488 shown in red: it’s higher than all but one of the five weeks it covers. How can that be the daily average of those five weeks? I’m not sure, but I suspect the answer has something to do with how Health defines a week, which runs from Sunday through Saturday. Since I took this screenshot this morning (Friday, November 1), the current week is missing the better part of two days of steps. My guess is that Apple doing two inconsistent estimations:
- The plotted value for the average of the current week, 15,488 steps per day, is based on my total steps for the week so far divided by 6.
- The trend value, 17,488 steps per day, is based on a projection of what my average daily step count will be at the end of Saturday.
Inconsistency doesn’t bother Apple. You’ve probably noticed that the Fitness app, which is closely allied with Health, defines weeks as going from Monday through Sunday. Whatever.
Getting back to trends, I was surprised to see that the Health app considered my cardio recovery data1 to have no trend. What do you think?
A variation from 11,000 to 17,000 steps is a trend, but a variation from 20 bpm to 30 bpm is not. Got it. And I’m sure Apple has a very sound reason for plotting the data in this graph as dots while the steps data are plotted as bars. As sound as the reason Fitness and Health use different definitions of a week.
-
That’s the drop in your heart rate in the first minute after exercise. ↩
Houses of cards and triangular numbers
October 26, 2024 at 2:52 PM by Dr. Drang
November’s issue of Scientific American has a fun little puzzle about houses of cards. I solved it one way and SciAm solved it another, so I thought it worth a quick post.
The houses of cards are arranged in triangular shapes, like this:
This is an end view. SciAm has a nice 3D drawing of the houses, which you may want to look at. If you have an Apple News subscription, SciAm is one of the magazines that comes with it.
I’m showing the tilted “wall” cards in blue and the flat “floor” cards in red to help us with the calculations. The number under each house is the total number of cards needed to build that house. The question is this: Can you build a house like this with 100 cards, and if so, how many stories will it be?
Let’s start by figuring out how many cards of each type there are. Here’s a table for the houses shown above:
Stories | Wall cards | Floor cards | Total cards |
---|---|---|---|
1 | 2 | 0 | 2 |
2 | 6 | 1 | 7 |
3 | 12 | 3 | 15 |
4 | 20 | 6 | 26 |
The number of wall cards go in this sequence:
2
2 + 4
2 + 4 + 6
2 + 4 + 6 + 8
which we can rewrite as
2 (1)
2 (1 + 2)
2 (1 + 2 + 3)
2 (1 + 2 + 3 + 4)
The sums in parentheses form a familiar sequence. They are the trianglar numbers: 1, 3, 6, 10, … I always think of these as Gauss’s numbers because of the (perhaps apocryphal) story of how, as a schoolboy, he figured out how to quickly get the sum of the first hundred natural numbers without actually doing the summation. The sum of the first numbers is the triangular number
So the number of wall cards for an -story house is twice the associated triangular number
The number of floor cards is also based on triangular numbers, but we have to account for the lack of floor cards on the bottom story. The number of floor cards for an -story house is the triangular number:
I suppose it shouldn’t be a surprise to learn that triangular numbers play a role in building triangular houses.
Adding the floor and wall cards together, we get
You can confirm this by checking it against the table above.
Now that we have the formula, we can answer the question. We solve
for . This is a 2nd-degree equation, which we can put in standard form,
and use the quadratic formula to solve:
The positive solution (which is the one we care about) is . Because this is an integer, the full answer is Yes, we can build a house like this with 100 cards. It will have 8 stories.
(Since we all have computers, another obvious solution is to make a spreadsheet with one column of stories and another with the formula. When the number in the stories column is 8, the number in the cards column will be 100.)
When I opened the link to see SciAm’s solution, I was surprised to see that they didn’t do it my way. Instead, they used the table of stories and cards to take differences, like this:
The 1st differences come from subtracting sequential elements in the Cards column:
Similarly, the 2nd differences come from subtracting sequential elements in the 1st differences column:
(The SciAm article includes a 5-story house, which has 40 cards, so it also includes another set of differences. I didn’t bother with the fifth house because I figured 4 stories was enough to get the idea across.)
Since the 2nd differences are constant, the formula for the number of cards must be of the 2nd degree. Differences are like derivatives, so if the 2nd differences are constant, so must be the 2nd derivative of the underlying equation. Therefore, the highest powered term in the formula for the number of cards must be a multiple of .
In general, a 2nd degree formula will be of the form
and we can figure out the three coefficients by plugging in three values for and solving the resulting three simultaneous equations:
The solution is
which, of course, matches the solution we got using triangular numbers.
While it was very easy for me to write “The solution is,” solving three simultaneous equations by hand is actually a pain in the ass, so I don’t much care for SciAm’s solution. Too much effort.
But I’ve never spent much time doing difference calculations, and looking at the difference table, I wonder whether there’s an easier way to get the coefficients from it. I’m sure it’s no coincidence that the 2nd derivative of
is 3, the same as the 2nd differences in the table. So SciAm could have figured out that the value of is straight from that. And I suspect there are similarly clever ways to work out the values of and directly from the difference table without going through the labor of simultaneous equations. Maybe I should spend some time learning more about difference tables.