Exponential growth and log scales

Exponential growth can be represented this way:

y=AeCxy = A e^{Cx}

where AA and CC are constants. The base of the exponential function doesn’t matter. We could just as easily express this growth using base-10 logs,

y=A·10Dxy = A \cdot {10}^{Dx}

where

D=Clog10eD = C \log_{10} e

Any base can be used to represent exponential growth, it’s just a matter of scaling.

Data that are thought to display exponential growth are commonly plotted on “semilog” graphs, in which the one of the scales is logarithmic and the other is linear. To see why, let’s take the base-10 logarithm of both sides of the second equation.

log10y=log10(A·10Dx)=log10A+Dx\log_{10} y = \log_{10} \left( A \cdot 10^{Dx} \right) = \log_{10} A + D x

This means that if we plot log10y\log_{10} y vs xx, we’ll get a straight line with slope DD and intercept log10A\log_{10} A. This is the same as plotting yy on a base-10 log scale and xx on a linear scale.

I chose to give this example using 10 as the base because that’s default in logarithmic plotting in spreadsheets and other plotting tools. If you check the “logarithmic” scale box in your spreadsheet’s graphing controls, you’ll get a base-10 scale on that axis with equally spaced major ticks at powers of 10: 1, 10, 100, 1000, and so on. And if you’re using Numbers, that’s all you’ll get; it has no way of setting the base or the positioning of the ticks. Excel does give you that kind of control, but not in the iOS version.

This use of base-10 is, I think, a holdover from the days when we plotted data on graph paper. Semilog and log-log graph paper were always laid out with their logarithmic axes gridded using base-10 logs. This was convenient because it would have been expensive to make paper for a wide range of bases, and as we’ve seen, all bases are effectively equivalent—it’s just a matter of scaling.

As is often the case when plotting with software, sticking with the defaults can be good for exploratory work, but no so good for presentation. With a little work, we can use whatever base is we think is helpful to tell the story of the data, and we don’t have to restrict our major ticks and gridlines to powers of the base.

For example, in data that shows exponential growth, people are often interested in the “doubling time,” the interval between successive doublings of whatever it is that’s being measured. To make the doubling time easy to see on the graph, we can use a base-2 log scale.

Here’s a semilog chart of exponential growth that’s of particular interest right now: the cumulative number of confirmed COVID-19 cases in the United States. The data came from this Google Doc, which is being maintained by a team of people interested in gathering and providing data from all the states, a job you might expect the CDC to be doing. My spot checks over the last few days suggest that it runs slightly below the more well known Johns Hopkins data set. Still, I think it’s worth using because it maintains a history, which I’ve never been able to find in the Johns Hopkins data.

I put the data, last updated at 4:00 PM Eastern today in a CSV file.

Date,Count
2020-03-04,118
2020-03-05,176
2020-03-06,223
2020-03-07,341
2020-03-08,417
2020-03-09,584
2020-03-10,778
2020-03-11,1053
2020-03-12,1315
2020-03-13,1922
2020-03-14,2450
2020-03-15,3173
2020-03-16,4019

Here’s the semilog plot I made, using a base-2 log scale for the vertical axis.

US COVID-19 growth

The raw counts are plotted as orange circles and the dashed line is the best linear fit. As you can see, the linear fit is very good, so the growth has been consistently exponential over the past couple of weeks. Because the major horizontal gridlines are one doubling apart from one another, it’s fairly easy to see that the doubling time is between 2 and 3 days.

The more accurate value given in the text comment came from the linear fit. As we’ll see later, the line was calculated using linear regression, which returned, among other things, the line’s slope. Because the vertical axis is in a base-2 log scale and the horizontal axis is in days, the slope has the units “doublings per day.” We therefore want the inverse of the slope, which is “days per doubling.” I rounded that value to one decimal point and added it to the graph.

The code I wrote to produce the graph is messier than it could have been, mainly because I wanted a script that would adjust the limits of the graph according to the input. Here it is:

python:
 1:  #!/usr/bin/env python
 2:  
 3:  import pandas as pd
 4:  import numpy as np
 5:  from scipy.stats import linregress
 6:  from math import log2, ceil, sqrt
 7:  import matplotlib.pyplot as plt
 8:  from datetime import timedelta, date
 9:  
10:  # Import the data
11:  df = pd.read_csv('covid.csv', parse_dates=['Date'])
12:  
13:  # Extend the table with base-2 log of the case count and the day number
14:  df['Log2Count'] = np.log2(df.Count)
15:  firstDate = df.Date[0]
16:  df['Days'] = (df.Date - firstDate).dt.days
17:  
18:  # Linear regression of Log2Count on Days
19:  lr = linregress(df.Days, df.Log2Count)
20:  df['Fit'] = 2**(lr.intercept + lr.slope*df.Days)
21:  
22:  # Doubling time
23:  doubling = 1/lr.slope
24:  dText = f'Count doubles\nevery {doubling:.1f} days'
25:  
26:  # Plot the data and the fit
27:  fig, ax = plt.subplots(figsize=(9, 6))
28:  plt.yscale('log', basey=2)
29:  ax.plot(df.Days, df.Count, 'o', color='#d95f02', lw=1)
30:  ax.plot(df.Days, df.Fit, '--', color='#7570b3', lw=2)
31:  
32:  # Ticks and grid
33:  yStart = 100
34:  coordMax = log2(df.Count.max()/yStart)
35:  expMax = ceil(coordMax)
36:  yAdd = .4142*yStart*2**expMax if expMax - coordMax < .1 else 0
37:  plt.ylim(ymin=yStart, ymax=yStart*2**expMax + yAdd)
38:  majors = np.array([ yStart*2**i for i in range(expMax+1) ])
39:  ax.set_yticks(majors)
40:  ax.set_yticklabels(majors)
41:  
42:  dateTickFreq = 2
43:  dMax = df.Days.max()
44:  xAdd = 2 if df.Days.max() % dateTickFreq else 1
45:  plt.xlim(xmin=-1, xmax=dMax + 1)
46:  ax.set_xticks([ x for x in range(0, dMax+xAdd, dateTickFreq) ])
47:  dates = [ (firstDate.date() + timedelta(days=x)).strftime('%b %-d') for x in range(0, dMax+xAdd, dateTickFreq) ]
48:  ax.set_xticklabels(dates)
49:  
50:  ax.grid(linewidth=.5, which='major', color='#dddddd', linestyle='-')
51:  
52:  # Title and labels
53:  title = 'US COVID-19 growth'
54:  plt.title(title)
55:  plt.ylabel('Confirmed case count')
56:  
57:  # Add doubling text
58:  plt.text(1, yStart*2**(expMax-1.65), dText, fontsize=11)
59:  
60:  # Use the last date in the file name
61:  prefix = df.Date.max().date().strftime('%Y%m%d')
62:  plt.savefig(f'{prefix}-{title}.svg', format='svg')

I’m using Pandas to collect and manipulate the data. It’s not necessary to use such a powerful tool for a simple work like this, but I’m used to Pandas and its power makes the manipulations easier.

After importing the CSV file into a data frame (table) in Line 11, Lines 14–16 add two new columns to the data frame, one that’s the base-2 log of the count values and the other that’s the time, in days, since the beginning of the data set. Linear regression is then performed on those columns in Line 19 and the fitted data (after converting back from log2\log_2) is stored in another new column of the data frame.

When Line 20 is complete, the data frame looks like this:

         Date  Count  Log2Count  Days          Fit                                                                                                                        
0  2020-03-04    118   6.882643     0   129.909060                                                                                                                        
1  2020-03-05    176   7.459432     1   174.100532                                                                                                                        
2  2020-03-06    223   7.800900     2   233.324722                                                                                                                        
3  2020-03-07    341   8.413628     3   312.695345                                                                                                                        
4  2020-03-08    417   8.703904     4   419.065660                                                                                                                        
5  2020-03-09    584   9.189825     5   561.620217                                                                                                                        
6  2020-03-10    778   9.603626     6   752.667894                                                                                                                        
7  2020-03-11   1053  10.040290     7  1008.704712                                                                                                                        
8  2020-03-12   1315  10.360847     8  1351.838180                                                                                                                        
9  2020-03-13   1922  10.908393     9  1811.696171                                                                                                                        
10 2020-03-14   2450  11.258566    10  2427.985143                                                                                                                        
11 2020-03-15   3173  11.631632    11  3253.918592                                                                                                                        
12 2020-03-16   4019  11.972621    12  4360.811776

I know SciPy has other ways to fit curves to data and extract the parameters, but I’m comfortable with linregress and decided it would be easier to do a little extra work with the back and forth log2\log_2 conversions than to start fresh with another fitting function.

Lines 23–24 take the results of the linear regression and put the inverse of the slope into a short text description that we’ll use later.

Now comes the plotting. Lines 27–30 plot the raw data and the fit after setting the y-axis to a base-2 logarithmic scale. Then Lines 33–40 adjust the presentation of the y-axis and Lines 42–48 adjust the presentation of the x-axis. These are the sections of the code that are longer than they might otherwise be because I’m trying to create a system that will work for future data.

The lower bound of the y-axis is always going to be 100 because our first count is 118. Typically, the upper bound will be some whole number of doublings of 100 (200, 400, 800, etc.), enough to encompass the maximum count. But if the maximum count is just slightly below a whole number of doublings, we’ll want to extend the upper end of the y-axis range so the data point’s circle doesn’t get cut off by the top of the plot area. That’s the purpose of the expMax and yAdd variables, which are defined in Lines 34–36 and put to use in Lines 37–40 to define the upper bound of the y-axis and the locations of the tick marks.

There’s a similar xAdd variable defined and used in Lines 42–48. Basically, I want the upper end of the x-axis to be one day after the last day of data. If that upper bound happens to be at a whole number of date ticks, I want the end of the x-axis to have a tick and a label. Otherwise, it should remain unticked and unlabeled.

An example is probably better than words. You’ve seen in the figure above how the axes are set and when the maximum count is comfortably under the doubling that’s higher than it. But, if we plot the data through yesterday, we see how nice it is to have a little extra space above the last doubling.

US COVID-19 growth

This graph also shows the value of having the x-axis labels go beyond the data when the end of the axis coincides with what would be the next tick mark.

The rest of the code is just adding labels and writing out the file. Note that I don’t bother to label the x-axis, as it’s obvious what it is. The one interesting thing is in Line 58, where the y-coordinate of the explanatory text is calculated to put it between horizontal grid lines.

If I continue to update and present this graph, the dates might start crowding each other. I can change the dateTickFreq variable in Line 42 to keep the labels from crashing into each other and the rest of the graph will adjust accordingly.

Writing a script like this certainly isn’t the fastest way to produce a graph, but it is the best way to produce a series of graphs that are consistent. I’m hoping the next few weeks will slow down the upward march of the orange dots and make it a short series.