Plots and predictions

Returning to the Minute Physics video that we’ve discussed here and here, I want to discuss my biggest concern: its title. The title is “How To Tell If We’re Beating COVID-19,” and I think that way oversells the predictive ability of plots like these.

I don’t come to this with clean hands. In my first post about graphing the COVID-19 data, I overlaid a linear regression fit on the semilog plot. It was meant to show how exponential growth works and how you can extract certain information from it, but the “Count doubles every 2.4 days” text I stuck on it is an invitation to the reader to extend that line up and off to the right.

US COVID-19 growth

In some ways, I suppose, it’s impossible to prevent a viewer from extrapolating. Even if I’d added the boilerplate language used by brokerages—past performance is no guarantee of future results—you and I both would have looked at the graph and done some mental math to guess at where we’d be in a week or so. And there’s nothing terrible about doing that as long as we don’t take our predictions as fundamental truths.

But Henry Reich, the Minute Physics guy, goes further than that. In telling us that we can use his graphs to see when we’ve beaten COVID-19, he’s saying that past performance is a guarantee of future results. But even his example of how to see the onset of winning, the data from South Korea, shows that graphs can lead you astray.

Here’s the log-log plot for of weekly new cases vs. cumulative cases that Reich favors. It’s using data from South Korea as collected by the Johns Hopkins Center for Systems Science and Engineering.

South Korea log-log

We can see where South Korea “beat” COVID-19 in the upper right. When the cumulative cases reach about 5,000, the plot starts to curl over, eventually dropping from four to five thousand new cases per week to its current level of several hundred new cases per week in that splotch of dots at the end of the graph.

But what if we’d been using this graph to track South Korea back in February? At that point, the only data we’d have is what’s in the lower left corner of the graph. Would we not have used the same logic to say that COVID-19 had been beaten when new cases dropped off the exponential rise line from a peak of about 20 per week to about 3 per week? And yet it was at that point in late February that there was a sudden jump from 30 cumulative cases to 100 and exponential growth began again.

You might argue that this is unfair, that the early data for new cases was probably unreliable and that South Korea’s extensive testing program hadn’t yet been established. And you may very well be right, but that’s my point. The graph itself can lead you astray. You need additional information outside of the graph to be able to use the graph to make predictions. That extra information may be a physical law—as it often is in the physical sciences—or it may be a deep understanding of how the data were collected and their relative reliability. But whatever it is, you need something extra.

Interestingly, you cannot make the South Korea graph shown above by using the Aatish Bhatia’s Covid Trends page. Even though the Johns Hopkins CSSE data for South Korea goes back to January 22, when there was only one case,1 Bhatia’s page won’t plot anything before February 20, which is the day the cumulative cases jumped from 30 to 100. This is Bhatia doing some filtering of the data before plotting. Nothing nefarious about, but he’s obviously using some extra information—perhaps something as simple as a rule not to use data before a certain cumulative threshold has been passed—to reduce the likelihood of the graph being misleading.

In my daily update of US COVID-19 graphs, I’ve stopped using a linear regression line and switched to locally weighted regression, the kind of smoothing pioneered by William Cleveland. Here’s an example, a plot of daily data:

US COVID daily data

I hope the wiggles in the blue smoothing lines are a strong enough hint that we are not on some inevitable trajectory (for good or bad) as if we were a satellite governed by Newtonian physics. As the saying goes, predictions are hard, especially of the future.

  1. Which means you could start plotting a week’s worth of new cases on January 28.