As easy as falling off…

The other day I was thinking about how to present a set of data, and I started fiddling around with histograms. Histograms are easy to make, but if they’re going to be representative of the data, the size of the bins has to be reasonable. Bins that are to narrow make the data set look less regular than it is; bins that are too wide throw away information and are biased away from the underlying density function at the edges of the bins.

There’s a simple rule, called Sturges’s rule, for choosing a reasonable bin width. Despite its simplicity, I can never remember it. Fortunately, I can remember which of my books has it in an easily found location. It’s Elmer Lewis’s Introduction to Reliability Engineering, which gives the width as

\[w = \frac{x_{max} - x_{min}}{1 + 3.3\; \log_{10} n}\]

where \(n\) is the number of samples, \(x_{min}\) is the smallest sample, and \(x_{max}\) is the largest sample.

I’ve used this formula unquestioningly dozens of times, but for some reason this time I wanted to see how it was derived. OK, not “some reason”; I was looking for an opportunity to procrastinate. So I Googled “Sturges histogram” and started clicking likely links.

The first thing I noticed was that none of the sources gave the formula the same way Elmer did. They gave it as

\[w = \frac{x_{max} - x_{min}}{1 + \log_2 n}\]

which looks a lot more like something derived from first principles. I’d always looked at the \(\log_{10}\) and the 3.3 in Elmer’s version of the formula and assumed that it was some empirically derived best fit to a series of numerical experiments. But a nice, simple \(\log_2\) could only come from a theoretical derivation. Elmer had obviously converted the \(\log_2 n\) to \(3.3\;\log_{10} n\) to make the formula easier to use for engineers, who would feel more comfortable with base-10 logs and probably wouldn’t have a base-2 log function on their calculators.

Here’s the embarrassing part: as I saw this, I realized that I’d forgotten how to convert logarithms from one base to another. Oh, I knew that the conversion factor to go from base-\(a\) to base-\(b\) was either \(\log_a b\) or \(\log_b a\) (or maybe the reciprocal of one of those), but I didn’t feel confident I knew which. It drove me crazy.

So I sat down with a pencil and paper and figured it out, like I had to do in junior high. I started with my best guess

\[\log_a n \stackrel{?}{=} \log_a b \; \log_b n\]

and raised \(a\) to both sides

\[a^{\log_a n} \stackrel{?}{=} a^{\log_a b \; \log_b n}\]

The left side was easy to simplify with the definition of a logarithm. The right needed to be rearranged using the rule for product exponents.

\[n \stackrel{?}{=} \left( a^{\log_a b} \right)^{\log_b n}\]

Now apply the logarithm definition to the right side once,

\[n \stackrel{?}{=} b^{\log_b n}\]

and then again to give the equality I was hoping for,

\[n = n\]

which meant that my initial guess was correct.

Even though this was a proof I first learned about 40 years ago, I was happy to see that I could still do it without cheating, without looking it up somewhere. The Van Camp’s people were right: simple pleasures are the best.

I need to add two notes before closing this post:

  1. This is my first post since switching to MathJax’s new CDN. The changeover seems to have gone smoothly.
  2. I learned how to make the \(\stackrel{?}{=}\) symbol from this Stack Overflow page. The LaTeX code is \stackrel{?}{=}.