Semi-automated plotting
November 6, 2024 at 11:18 PM by Dr. Drang
The Matplotlib code in the last post was initially generated with a Typinator abbreviation that I tweaked to make the final script. After writing the post, I decided it would be nice to have a second, similar abbreviation. This post shows you both.
Typinator is like its more well-known competitor, TextExpander, except it hasn’t shifted its focus to “the enterprise,” doesn’t require a subscription, and doesn’t advertise on podcasts. Although the abbreviations discussed here are built for Typinator, they could be easily adapted to TextExpander, TypeIt4Me, or Keyboard Maestro. (Keyboard Maestro isn’t, strictly speaking, an abbreviation utility, but it can be used as one. In fact, the first abbreviation we’ll see started life as a Keyboard Maestro macro.)
For several years I’ve had an abbreviation1 called ;plot
, which, when invoked, brings up a little window that looks like this:
After you enter the title and axis labels, it inserts the following text:
python:
1: #!/usr/bin/env python3
2:
3: import matplotlib.pyplot as plt
4: from matplotlib.ticker import MultipleLocator, AutoMinorLocator
5:
6: # Create the plot with a given size in inches
7: fig, ax = plt.subplots(figsize=(6, 4))
8:
9: # Add a line
10: ax.plot(x, y, '-', color='blue', lw=2, label='Item one')
11:
12: # Set the limits
13: # plt.xlim(xmin=0, xmax=100)
14: # plt.ylim(ymin=0, ymax=50)
15:
16: # Set the major and minor ticks and add a grid
17: # ax.xaxis.set_major_locator(MultipleLocator(20))
18: # ax.xaxis.set_minor_locator(AutoMinorLocator(2))
19: # ax.yaxis.set_major_locator(MultipleLocator(10))
20: # ax.yaxis.set_minor_locator(AutoMinorLocator(5))
21: # ax.grid(linewidth=.5, axis='x', which='major', color='#dddddd', linestyle='-')
22: # ax.grid(linewidth=.5, axis='y', which='major', color='#dddddd', linestyle='-')
23:
24: # Title and axis labels
25: plt.title('Sample')
26: plt.xlabel('Date')
27: plt.ylabel('Value')
28:
29: # Make the border and tick marks 0.5 points wide
30: [ i.set_linewidth(0.5) for i in ax.spines.values() ]
31: ax.tick_params(which='both', width=.5)
32:
33: # Add the legend
34: # ax.legend()
35:
36: # Save as PDF
37: plt.savefig('20241106-Sample.pdf', format='pdf')
This is not quite a full program for producing a graph—the x and y data aren’t defined—but it’s a good start. You’ll note that the items we entered in the input fields show up in Lines 25–27 and 37.
Much of the code is commented out. As I said in the previous post, Matplotlib has decent defaults for many graph features, but I usually like to customize some of them. The commented code is there to remind me of the functions that are needed for some common customizations. I uncomment the lines for the things I want to customize and adjust the function parameters as needed.
The abbreviation is defined with this as the text that replaces ;plot
:
1: #!/usr/bin/env python3
2:
3: import matplotlib.pyplot as plt
4: from matplotlib.ticker import MultipleLocator, AutoMinorLocator
5:
6: # Create the plot with a given size in inches
7: fig, ax = plt.subplots(figsize=(6, 4))
8:
9: # Add a line
10: ax.plot(x, y, '-', color='blue', lw=2, label='Item one')
11:
12: # Set the limits
13: # plt.xlim(xmin=0, xmax=100)
14: # plt.ylim(ymin=0, ymax=50)
15:
16: # Set the major and minor ticks and add a grid
17: # ax.xaxis.set_major_locator(MultipleLocator(20))
18: # ax.xaxis.set_minor_locator(AutoMinorLocator(2))
19: # ax.yaxis.set_major_locator(MultipleLocator(10))
20: # ax.yaxis.set_minor_locator(AutoMinorLocator(5))
21: # ax.grid(linewidth=.5, axis='x', which='major', color='#dddddd', linestyle='-')
22: # ax.grid(linewidth=.5, axis='y', which='major', color='#dddddd', linestyle='-')
23:
24: # Title and axis labels
25: plt.title('{{?Plot title}}')
26: plt.xlabel('{{?X label}}')
27: plt.ylabel('{{?Y label}}')
28:
29: # Make the border and tick marks 0.5 points wide
30: [ i.set_linewidth(0.5) for i in ax.spines.values() ]
31: ax.tick_params(which='both', width=.5)
32:
33: # Add the legend
34: # ax.legend()
35:
36: # Save as PDF
37: plt.savefig('{YYYY}{MM}{DD}-{{?Plot title}}.pdf', format='pdf')
Virtually all the text is expanded verbatim. The exceptions are the plot title and axis labels in Lines 25–27 and the filename in Line 37. As you can see, Typinator uses special codes with curly braces to insert the non-verbatim material. For example
{{?Plot title}}
in Lines 25 and 37 tells Typinator to ask for input and inserts the text provided in the
field. Similarly{YYYY}{MM}{DD}
in Line 37 inserts the four-digit year, the two-digit month, and the two-digit day. The last two will have leading zeros, if necessary.
You don’t have to remember the special curly brace syntax. When you’re defining an abbreviation, Typinator lets you choose the special marker code from a popup menu.
As you can see, there are submenus for the various date and time formats. When you choose an input field, you get this dialog to define the input and what kind of data it can hold:
Note that there’s a way to define a default value. That can be helpful in filling out the input fields quickly. What’s also helpful is that if a field doesn’t have a default value, Typinator remembers what you entered the last time the abbreviation was used and opens the input window with those values in the appropriate fields.
Although ;plot
was helpful in writing the code for the last post, I realized when I was done with all the customizations that it wasn’t especially helpful with the x-axis. Matplotlib has different functions for customizing a date and time axis, and none of those functions are in ;plot
.
I thought about adding commented-out lines with date and time functions to ;plot
, but decided it would be cleaner to have a completely different abbreviation, ;dateplot
, for date and time series plotting. Here’s its definition:
1: #!/usr/bin/env python3
2:
3: from datetime import datetime
4: import matplotlib.pyplot as plt
5: from matplotlib.ticker import MultipleLocator, AutoMinorLocator
6: from matplotlib.dates import DateFormatter, YearLocator, MonthLocator
7:
8: # Create the plot with a given size in inches
9: fig, ax = plt.subplots(figsize=(6, 4))
10:
11: # Add a line
12: ax.plot(x, y, '-', color='blue', lw=2, label='Item one')
13:
14: # Set the limits
15: # plt.xlim(xmin=datetime(2010,1,1), xmax=datetime(2016,1,1))
16: # plt.ylim(ymin=0, ymax=50)
17:
18: # Set the major and minor ticks and add a grid
19: # ax.xaxis.set_major_locator(YearLocator(1))
20: # ax.xaxis.set_minor_locator(MonthLocator())
21: # ax.xaxis.set_major_formatter(DateFormatter('%-m/%Y'))
22: # ax.yaxis.set_major_locator(MultipleLocator(10))
23: # ax.yaxis.set_minor_locator(AutoMinorLocator(5))
24: # ax.grid(linewidth=.5, axis='x', which='major', color='#dddddd', linestyle='-')
25: # ax.grid(linewidth=.5, axis='y', which='major', color='#dddddd', linestyle='-')
26:
27: # Title and axis labels
28: plt.title('{{?Plot title}}')
29: plt.xlabel('{{?X label}}')
30: plt.ylabel('{{?Y label}}')
31:
32: # Make the border and tick marks 0.5 points wide
33: [ i.set_linewidth(0.5) for i in ax.spines.values() ]
34: ax.tick_params(which='both', width=.5)
35:
36: # Add the legend
37: # ax.legend()
38:
39: # Save as PDF
40: plt.savefig('{YYYY}{MM}{DD}-{{?Plot title}}.pdf', format='pdf')
It imports a few more libraries near the top, uses datetime
data to define the x-axis limits on Line 15, and has date-related functions in the code on Lines 19–21 for customizing the x-axis ticks and tick labels. Otherwise, it’s the same as ;plot
.
With this new abbreviation, I should be able to build time series graphs more quickly than before.
-
You may know them as “snippets,” because that’s the term TextExpander uses. Typinator uses “abbreviations,” so that’s what I’ll call them. ↩
Ticks tricks
November 4, 2024 at 3:46 PM by Dr. Drang
As I often do after a post with a graph, I’m going to follow up by showing how I made it. In this case, the plotting was done with Matplotlib, the Python graphing library with which I’m most familiar.
Matplotlib makes decent graphs by default. If you’re just doing some exploratory plotting for your own purposes, the defaults can be just fine. But when you want your graphs to look better than “just fine,” you need to dig into Matplotlib’s many (many) functions to customize the look.
Here’s yesterday’s graph:
And here’s the code that produced it:
python:
1: #!/usr/bin/env python3
2:
3: import pandas as pd
4: import matplotlib.pyplot as plt
5: from datetime import datetime
6: from matplotlib.ticker import MultipleLocator
7: from matplotlib.dates import DateFormatter, YearLocator
8:
9: # Import data
10: df = pd.read_csv('unemployment.csv')
11: x = pd.to_datetime(df.Date, format="%b %Y")
12: y = df.Rate
13:
14: # Create the plot with a given size in inches
15: fig, ax = plt.subplots(figsize=(5,5))
16:
17: # Add a line
18: ax.plot(x, y, '-', color='black', lw=2)
19:
20: # Set the limits
21: plt.xlim(xmin=datetime(2014,1,1), xmax=datetime(2025,1,1))
22: plt.ylim(ymin=2, ymax=8)
23:
24: # Set the major and minor ticks and add a grid
25: ax.xaxis.set_major_locator(YearLocator(1))
26: ax.xaxis.set_major_formatter(DateFormatter(' ’%y'))
27: plt.setp(ax.get_xticklabels()[-1], visible=False)
28: ax.yaxis.set_major_locator(MultipleLocator(2))
29: ax.yaxis.set_major_formatter('{x:.0f}%')
30: ax.yaxis.set_minor_locator(MultipleLocator(1))
31: ax.grid(linewidth=.5, axis='y', which='both', color='#dddddd', linestyle='-')
32:
33: # Title and axis labels
34: plt.title('Civilian Unemployment')
35:
36: # Annotations
37: plt.text(datetime(2020, 4, 1), 2.45, "Peak of 14.8% in April 2020", ha='center')
38: plt.arrow(datetime(2020, 4, 1), 2.75, 0, .45, head_width=50, head_length=.25, lw=.75, fc='black', zorder=100)
39:
40: # Make the border and tick marks 0.5 points wide
41: [ i.set_linewidth(0.5) for i in ax.spines.values() ]
42: ax.tick_params(which='both', width=.5)
43:
44: # Save as PNG
45: plt.savefig('20241103-Improved unemployment graph.png', format='png', dpi=200)
Lines 10–12 get the unemployment data from a file, unemployment.csv
, and put the values into the variables x
and y
for later use. As I said in yesterday’s post, the data came from the US Bureau of Labor Statistics. The Bureau’s page is dominated by a graph of civilian unemployment that goes back about 20 years, but there’s also a link you can use to bring up a table of data, which looks like this:
To match the graph in the Chicago Tribune, I selected the data from January 2014 through October 2024, copied it, and pasted it into a new Numbers spreadsheet (I used unemployment.csv
. Here are the first several lines of that file:
Date,Rate
Jan 2014,6.6
Feb 2014,6.7
Mar 2014,6.7
Apr 2014,6.2
May 2014,6.3
June 2014,6.1
July 2014,6.2
Aug 2014,6.1
Sept 2014,5.9
Oct 2014,5.7
Nov 2014,5.8
Dec 2014,5.6
This is not quite in the form I need. Although most of the months are given with the standard three-letter abbreviations, June, July, and September are not. With the file open in BBEdit, I did this Find and Replace operation:
The regular expression in the Find section is
^(\w\w\w)\w
If you select that line you’ll see there’s a space at the end of it, which is important. The Replace regex is
\1
and there’s a space character at the end of it, too. There may be more clever regexes that will do the job, but as this is a one-off filter, I used what came to mind first. Now the file looks like this
Date,Rate
Jan 2014,6.6
Feb 2014,6.7
Mar 2014,6.7
Apr 2014,6.2
May 2014,6.3
Jun 2014,6.1
Jul 2014,6.2
Aug 2014,6.1
Sep 2014,5.9
Oct 2014,5.7
Nov 2014,5.8
Dec 2014,5.6
and the dates are in a standard format for parsing.
The table is imported into a Pandas data frame in Line 10, and the Date field is converted to datetime
objects in Line 11. The %b %Y
formatting string is strftime
code for “three-letter month abbreviation followed by a space and the four-digit year.” Now you see why I needed to lop the fourth character off of June
, July
, and Sept
.
(There are, by the way, ways to do the date parsing within the read_csv
function itself. I seldom do the parsing that way, because I find the parameters to read_csv
too complex. I prefer to do the importing in one step and the date parsing in another.)
(To compound parentheticals, I suppose I didn’t need to use the Pandas library just to import a CSV file; there are other ways to do that. But I use Pandas a lot and I’m comfortable with it. My efficiency is more important than library minimalism.)
Line 15 creates a 5″×5″ plot, and Line 18 plots the x
and y
data on it. The data points will be connected by lines ('-'
) that are black (color='black'
) and two points thick (lw=2
). From now on, all of the commands will be customizing the appearance of the graph.
Lines 21–22 set the limits of the plot. The limits of the x-axis have to be specified as datetime
objects because that’s what the x
data is. The ymax
parameter in Line 22 is what cuts off the pandemic spike, limiting the graph to unemployment values of 8% and below.
Lines 25–27 set the tick marks for the x-axis. Matplotlib lets you have both major and minor tick marks, but we’re using only major tick marks on the x-axis. They are set one year apart in Line 25. Line 26 establishes how the tick labels will appear. The DateFormatter
function takes an argument that uses the same strftime
formatting codes we saw earlier. To get an apostrophe followed by the last two digits of the year, the code is ’%y
.
But what about all those space characters in the DateFormatter
argument? That’s the result of a certain eccentricity of mine. Each tick mark on the x-axis is set at January 1 of that year—more specifically, at midnight on the morning of January 1.1 And tick mark labels are centered on the tick. But the year associated with a given tick mark is actually the interval between that tick and the next, so I think the label should go in that interval. What all those spaces do in Line 26 is shift the label over to (about) the midpoint between the ticks.
I seem to be basically alone in this way of formatting date-related labels. Everyone else seems to think it’s just fine to put the label for a year right under the tick associated with January 1 of that year. But I say it’s wrong, and I’m sticking to my guns.
While I may be eccentric, I’m not crazy. If the data set includes just one value for each year, I’m happy to put the year labels right under the tick marks. It’s only when there are multiple values per year that I insist on offset labels.
Line 27 deals with a consequence of my eccentricity. Because the maximum x value is January 1, 2025 (that’s how I get a tick mark at the end of the x-axis), there will be a label for that tick mark, and it will be placed off the end of the axis. Kind of embarrassing. Line 27 eliminates the embarrassment by using the setp
function to make the label for the last ([-1]
) x label invisible.
Lines 28–30 handle the ticks and labels for the y-axis. Here, the major tick marks are set every 2% (Line 28), and the minor tick marks are set every 1% (Line 30). By default, the labels on the y-axis would be presented without percent signs, so Line 29 takes care of that by invoking the set_major_formatter
function with an argument of '{x:.0f}%
. This code comes from Python’s format string syntax. You may be wondering why it has an x
when we’re dealing with the y-axis. In this case, the x
is a local placeholder and has nothing to do with the axis whose labels we’re formatting. Just an unfortunate choice of variable name by the Matplotlib developers.
The last item in this section of the code is Line 31, which tells Matplotlib to put a thin gray horizontal grid aligned with the major and minor tick marks. I have a tendency to use more grid lines than most people. Part of this comes from my belief that they help viewers keep their eyes aligned as they compare values. Another part undoubtedly comes from my using actual graph paper back when I was plotting data in college (long enough ago that students didn’t have computers). I do try to keep the grid unobtrusive.
The annotations that deal with the pandemic spike I cut off are defined in Lines 37–38. The text is centered at April 2020, as is the vertical arrow. Note that the x-coordinates of the positions have to be given as datetime
values because that’s how the x-axis is measured. The large value for the zorder
parameter in Line 38 insures that the arrow is drawn above the grid lines.
I have to say I find the head_width
and head_length
values in Line 38 troublesome. They seem to be given in the same units as the x- and y-axes, respectively, and that’s fine for a vertical arrow. But what should you do if the arrow is drawn at an angle? Discretion being the better part of valor, I didn’t do any experimenting to see how that would work. My ignorance may bite me in the future.
As with the grid lines, I try to make the border and tick marks as unobtrusive as possible by making them thin. That’s what Lines 41–42 do. They’re not gray because the border and ticks shouldn’t be as unobtrusive as the grid.
I guess I lied above when I said that all the code after Line 18 was customization. Line 45 saves the graph as a PNG file. The 200 dpi parameter combines with the 5″×5″ plot size in Line 15 to make the PNG 1000 pixels square.
-
There are ways to set the tick mark elsewhere in the year, but this is the default, and I like it there. ↩
Rescaling a graph
November 3, 2024 at 10:29 AM by Dr. Drang
The business pages of the Chicago Tribune usually have a little graphic showing a plot of some economic data. There’s no story to go along with the graph; it’s meant to stand on its own. In yesterday’s edition, it was a graph of unemployment over the past decade or so:
Is this a good graph? As with all graphs, that depends on what’s being communicated. If the idea is to show how high unemployment got during the pandemic, it’s a good graph. But because it was published shortly after the monthly data for October came out—you can see the callout of 4.1% for the last data point—I’d say it’s really meant to show the current unemployment and put it in the context of recent history. For that purpose, it’s OK, but the graph’s vertical scale reduces its value. The large range of the y-axis interferes with our ability to see variations during the non-pandemic years.
People often use a log scale when they need to plot over a large range of data, but that wouldn’t be appropriate here. A general audience may have difficulty interpreting a log scale. More important, the range here isn’t due to long-term growth or decline, it’s coming entirely from that spike in 2020.
My suggestion would be to cut the y-axis off at about 8%, put labeled ticks every 2%, and add a note showing where the peak unemployment occurred. Something like this:
I didn’t bother matching the font style of the Tribune’s graph, but I did make sure to use an apostrophe on all the years along the x-axis. Like the Tribune, I got the data from the US Bureau of Labor Statistics.
Is this a huge improvement? No, but it brings the unemployment of the past few years up off the bottom of the graph and makes all the data (except the spike) easier to interpret.
Apple Health trends
November 1, 2024 at 2:30 PM by Dr. Drang
There are some odd things about the way Apple tracks your progress in the Health app. These oddities aren’t new with the recent OS updates. They’ve been a part of the app for as long as I can remember; I’ve just finally gotten around to talking about them.
The oddities are centered around what Apple calls trends. Here, for example, are my last six months of step counts:
If you asked me about the trend of this data, I’d say it’s going up roughly linearly with some scatter. I’d probably try to estimate the slope, which would have the unusual units of “average daily steps per week.” Better than estimation would be to do a linear least squares fit, which could be plotted on top of the bar chart:
I did this in Mathematica Wolfram, but you could do it in any number of apps. Most people would probably use a spreadsheet.
But in Health, Apple shows my step trend this way:
It doesn’t try to fit the best straight line to the data, it fits a step function. It’s idea of a trend is discrete rather than continuous: You used to average this many steps per day, now you average this many steps per day.
Apple would probably say this is just as valid a definition of trend as mine, but it sure doesn’t feel that way, especially when the data look like this. There’s no sudden jump in the data anywhere in the period covered by the graph. Whatever ups and downs we see are just scatter.
You may have noticed something funny about the daily average of 17,488 shown in red: it’s higher than all but one of the five weeks it covers. How can that be the daily average of those five weeks? I’m not sure, but I suspect the answer has something to do with how Health defines a week, which runs from Sunday through Saturday. Since I took this screenshot this morning (Friday, November 1), the current week is missing the better part of two days of steps. My guess is that Apple doing two inconsistent estimations:
- The plotted value for the average of the current week, 15,488 steps per day, is based on my total steps for the week so far divided by 6.
- The trend value, 17,488 steps per day, is based on a projection of what my average daily step count will be at the end of Saturday.
Inconsistency doesn’t bother Apple. You’ve probably noticed that the Fitness app, which is closely allied with Health, defines weeks as going from Monday through Sunday. Whatever.
Getting back to trends, I was surprised to see that the Health app considered my cardio recovery data1 to have no trend. What do you think?
A variation from 11,000 to 17,000 steps is a trend, but a variation from 20 bpm to 30 bpm is not. Got it. And I’m sure Apple has a very sound reason for plotting the data in this graph as dots while the steps data are plotted as bars. As sound as the reason Fitness and Health use different definitions of a week.
-
That’s the drop in your heart rate in the first minute after exercise. ↩