In defense of floating point

June 28, 2025 at 3:06 PM by Dr. Drang

I’ve noticed that many programmers have a phobia about floating point numbers. They see something like this (a Python interactive session, but a similar thing could be done in many languages),

python:
>>> sum = 0.0
>>> for i in range(10):
...     sum += 0.1
...
>>> sum
0.9999999999999999

and decide never to trust floating point numbers again. Web pages with titles like “Why Are Floating Point Numbers Inaccurate?” and “What is a floating point number, and why do they suck” help promote the mistrust.¹ I fear this post published yesterday by John D. Cook will do the same.

The gist of Cook’s article, which is perfectly correct, is that the overwhelming majority of 32-bit integers cannot be represented exactly by a 32-bit float. And an even greater majority of 64-bit integers cannot be represented exactly by a 64-bit float.

If your response to the previous paragraph is “Well, duh!” you’re my kind of people. The mantissa of a 32-bit float is only 24 bits wide (one of the bits is implicit), so of course you can only represent a small percentage of the 32-bit integers. After accounting for the sign bit, you have a 7-bit deficit.

But here’s the thing: a 32-bit float can represent exactly every integer from -16,777,216 to 16,777,216 ( $- 2^{24}$ to $2^{24}$ ). Here’s a quick demonstration in an interactive Python session:

python:
>>> import numpy as np
>>> n = 2**24
>>> ai = np.linspace(-n, n, 2*n+1, dtype=np.int32)
>>> af = np.linspace(-n, n, 2*n+1, dtype=np.float32)
>>> np.array_equal(af.astype(np.int32), ai)
True

Aside
I’m using linspace instead of arange because of this warning in the arange docs:

The actual step value used to populate the array is dtype(start + step) - dtype(start) and not step. Precision loss can occur here, due to casting or due to using floating points when start is much larger than step.

In our case the start value is much larger (in absolute value) than the step, so I followed the docs’ recommendation:

In such cases, the use of numpy.linspace should be preferred.

As Cook explains, there are actually many more integers that can be represented exactly by a float32, but there are gaps between them. The run from -16,777,216 to 16,777,216 has no gaps.

That’s a big range, possibly bigger than you need. And you’re more likely to be using double precision floats than single precision. For float64s, the mantissa is 53 bits (again, one bit is implicit), so they can exactly represent every integer from -9,007,199,254,740,992 to 9,007,199,254,740,992. Yes, as Cook says, that’s a very small percentage of 64-bit integers, but it’s still damned big.

JavaScript programmers understand the practical implications of this. By default, JavaScript stores numbers internally as 64-bit floats, so you’ll run into problems if you need an integer greater than 9 quadrillion. That’s why JavaScript has the isSafeInteger function and the BigInt type.

I guess the main point is understand the data types you’re using. You wouldn’t use an 8-bit integer to handle values in the thousands, but it’s fine if the values stay under one hundred. The same rules apply to floating point. You just have to know how they work.

The author of the second piece apparently doesn’t trust question marks, either. ↩

And now it’s all this

I just said what I said and it was wrong
Or was taken wrong

In defense of floating point

Site search

Meta

Recent posts

Credits

And now it’s all this

I just said what I said and it was wrong Or was taken wrong

In defense of floating point

Site search

Meta

Recent posts

Credits

I just said what I said and it was wrong
Or was taken wrong