In defense of floating point

I’ve noticed that many programmers have a phobia about floating point numbers. They see something like this (a Python interactive session, but a similar thing could be done in many languages),

python:
>>> sum = 0.0
>>> for i in range(10):
...     sum += 0.1
...
>>> sum
0.9999999999999999

and decide never to trust floating point numbers again. Web pages with titles like “Why Are Floating Point Numbers Inaccurate?” and “What is a floating point number, and why do they suck” help promote the mistrust.1 I fear this post published yesterday by John D. Cook will do the same.

The gist of Cook’s article, which is perfectly correct, is that the overwhelming majority of 32-bit integers cannot be represented exactly by a 32-bit float. And an even greater majority of 64-bit integers cannot be represented exactly by a 64-bit float.

If your response to the previous paragraph is “Well, duh!” you’re my kind of people. The mantissa of a 32-bit float is only 24 bits wide (one of the bits is implicit), so of course you can only represent a small percentage of the 32-bit integers. After accounting for the sign bit, you have a 7-bit deficit.

But here’s the thing: a 32-bit float can represent exactly every integer from -16,777,216 to 16,777,216 (224 to 224). Here’s a quick demonstration in an interactive Python session:

python:
>>> import numpy as np
>>> n = 2**24
>>> ai = np.linspace(-n, n, 2*n+1, dtype=np.int32)
>>> af = np.linspace(-n, n, 2*n+1, dtype=np.float32)
>>> np.array_equal(af.astype(np.int32), ai)
True

As Cook explains, there are actually many more integers that can be represented exactly by a float32, but there are gaps between them. The run from -16,777,216 to 16,777,216 has no gaps.

That’s a big range, possibly bigger than you need. And you’re more likely to be using double precision floats than single precision. For float64s, the mantissa is 53 bits (again, one bit is implicit), so they can exactly represent every integer from -9,007,199,254,740,992 to 9,007,199,254,740,992. Yes, as Cook says, that’s a very small percentage of 64-bit integers, but it’s still damned big.

JavaScript programmers understand the practical implications of this. By default, JavaScript stores numbers internally as 64-bit floats, so you’ll run into problems if you need an integer greater than 9 quadrillion. That’s why JavaScript has the isSafeInteger function and the BigInt type.

I guess the main point is understand the data types you’re using. You wouldn’t use an 8-bit integer to handle values in the thousands, but it’s fine if the values stay under one hundred. The same rules apply to floating point. You just have to know how they work.


  1. The author of the second piece apparently doesn’t trust question marks, either. 


Library search via web page

A few months ago, I wrote about three short Python scripts I could run from the command line to search for books in my technical library. The information on the books and authors was kept in an SQLite3 database, and the SQL queries built into the scripts were originally written with the help of ChatGPT and then improved with comments from readers. I said at the time that my goal was to move the database to a server and access it through a web page, the idea being that I could check on what books I have when I’m away from my computer. (In a used book store, for example.) I made the move last weekend and figured I’d write up a quick overview.

There’s just a single web page. Here’s what it looks like on my iPhone (a 16 Pro running Safari) before and after a search:

Library search on iPhone

It’s set up to search by title, author, or both, depending on which fields I fill in.

Because I’m not a web programmer and I wanted to reuse as much of the previously written Python code as possible, the page is generated by an old-fashioned CGI script. This one:

python:
  1:  #!/path/to/python3
  2:  
  3:  import os
  4:  from urllib.parse import parse_qs
  5:  import sqlite3
  6:  
  7:  # Get form data from QUERY_STRING
  8:  def parse_query_string():
  9:      query = os.environ.get("QUERY_STRING", "")
 10:      params = parse_qs(query)
 11:      return {k: v[0] for k, v in params.items() if v}
 12:  
 13:  tString = ''
 14:  aString = ''
 15:  params = parse_query_string()
 16:  tString = params.get('title', '').strip()
 17:  aString = params.get('author', '').strip()
 18:  
 19:  # Set up query strings
 20:  qTitle = '''SELECT b.loc, b.title, GROUP_CONCAT(a.name, '; ')
 21:  FROM book_author ba
 22:  JOIN author a ON a.id = ba.author_id
 23:  JOIN book b ON b.id = ba.book_id
 24:  WHERE b.title LIKE ? GROUP BY b.id ORDER BY b.loc;'''
 25:  
 26:  qAuthor = '''SELECT b.loc, b.title, GROUP_CONCAT(a.name, '; ')
 27:  FROM book_author ba
 28:  JOIN author a ON a.id = ba.author_id
 29:  JOIN book b ON b.id = ba.book_id
 30:  WHERE b.id IN (
 31:    SELECT ba.book_id
 32:    FROM book_author ba
 33:    JOIN author a ON a.id = ba.author_id
 34:    WHERE a.name LIKE ?
 35:  )
 36:  GROUP BY b.id ORDER BY b.loc;'''
 37:  
 38:  qTitleAuthor = '''SELECT b.loc, b.title, GROUP_CONCAT(a.name, '; ')
 39:  FROM book_author ba
 40:  JOIN author a ON a.id = ba.author_id
 41:  JOIN book b ON b.id = ba.book_id
 42:  WHERE b.title LIKE ?
 43:  AND b.id IN (
 44:    SELECT ba.book_id
 45:    FROM book_author ba
 46:    JOIN author a ON a.id = ba.author_id
 47:    WHERE a.name LIKE ?
 48:  )
 49:  GROUP BY b.id ORDER BY b.loc;'''
 50:  
 51:  # Query the database
 52:  def search(t, a):
 53:      con = sqlite3.connect('library.db')
 54:      cur = con.cursor()
 55:      if (t != '') and (a == ''):
 56:        sql = qTitle
 57:        vals = [f'%{t}%']
 58:      elif (t == '') and (a != ''):
 59:        sql = qAuthor
 60:        vals = [f'%{a}%']
 61:      elif (t != '') and (a != ''):
 62:        sql = qTitleAuthor
 63:        vals = [f'%{t}%', f'%{a}%']
 64:      else:
 65:        results = []
 66:        con.close()
 67:        return results
 68:      cur.execute(sql, vals)
 69:      results = cur.fetchall()
 70:      con.close()
 71:      return results
 72:  
 73:  results = search(tString, aString)
 74:  if results:
 75:    rcount = len(results)
 76:  else:
 77:    rcount = 0
 78:  
 79:  if rcount == 1:
 80:    rheader = '1 Result'
 81:  else:
 82:    rheader = f'{rcount} Results'
 83:  
 84:  # Output HTML
 85:  print(f'''Content-Type: text/html
 86:  
 87:  <html>
 88:    <head>
 89:      <title>Library Search</title>
 90:      <link rel="stylesheet" href="search.css">
 91:    </head>
 92:    <body>
 93:      <h1>Search Library</h1>
 94:      <form method="get" action="search.py">
 95:        <p class="desc">Enter strings for title, author, or both</p>
 96:        <label for="title">Title:</label>
 97:        <input type="text" name="title" value="{tString}">
 98:        <label for="author">Author:</label>
 99:        <input type="text" name="author" value="{aString}">
100:        <input type="submit" value="Submit">
101:      </form>
102:      <div class="results">
103:      <h2>{rheader}</h2>''')
104:  
105:  if results:
106:    print('<ul>')
107:    for row in results:
108:        print(f'<li>{"<br/>".join(row)}</li>')
109:    print('</ul>')
110:  
111:  print('''    </div>
112:    </body>
113:  </html>''')

Once upon a time, a script like this would’ve imported the cgi module to handle the form data. But that module was deprecated in Python 3.11 and removed in 3.13, so the parse_query_string function in Lines 8–11 was built “by hand” by pulling in data from the $QUERY_STRING environment variable.

(Unfortunately, the word query will be doing double-duty in this post. In the parse_query_string function, it means the query portion of a URL—the part after the question mark. In the rest of the script, it’ll mean an SQL query. The potential confusion is unavoidable; query is the standard term for both situations.)

The title and author strings, tString and aString, are set in Lines 13–17. The default values are empty strings; they get set to the corresponding field values when the script is run via the Submit button.

The next portion of the script, Lines 20–49, define the three SQL queries we’ll need: one for searching on a title only, one for searching on an author only, and one for searching on both. These strings were taken from the earlier Python scripts.

Lines 52–71 define the search function, which runs the appropriate SQL query according to which strings were provided in the input fields. Lines 73–82 perform the search and set up the header for the Results section of the page.

The rest of the script outputs the page’s HTML. The part that’s most variable is given in Lines 105–109, which spit out the search results in a unordered list.

The CSS file that styles the page is this:

css:
  1:  /* Base styles */
  2:  body {
  3:    font-family: Helvetica, sans-serif;
  4:    margin: 1em;
  5:    padding: 0;
  6:    line-height: 1.6;
  7:    background-color: #f9f9f9;
  8:    color: #333;
  9:  }
 10:  
 11:  /* Form container */
 12:  form {
 13:    max-width: 40em;
 14:    margin: 0 auto 2em;
 15:    background-color: #fff;
 16:    padding: 1em;
 17:    border-radius: 8px;
 18:    box-shadow: 0 0 8px rgba(0, 0, 0, 0.1);
 19:  }
 20:  
 21:  /* Form elements */
 22:  form p {
 23:    padding-top: 0em;
 24:    margin-top: 0em;
 25:    font-size: 1.25em;
 26:  }
 27:  
 28:  form label {
 29:    display: block;
 30:    margin-bottom: 0.5em;
 31:    font-weight: bold;
 32:  }
 33:  
 34:  form input[type="text"] {
 35:    width: 100%;
 36:    max-width: 100%;
 37:    padding: 0.5em;
 38:    font-size: 1em;
 39:    margin-bottom: 1em;
 40:    box-sizing: border-box;
 41:  }
 42:  
 43:  form input[type="submit"], form button {
 44:    padding: 0.6em 1.2em;
 45:    font-size: 1em;
 46:    cursor: pointer;
 47:    background-color: #007bff;
 48:    border: none;
 49:    border-radius: 4px;
 50:    color: #fff;
 51:  }
 52:  
 53:  form input[type="submit"]:hover, form button:hover {
 54:    background-color: #0056b3;
 55:  }
 56:  
 57:  /* Results list */
 58:  .results {
 59:    max-width: 40em;
 60:    margin: 0 auto;
 61:    padding: 1em;
 62:    background-color: #fff;
 63:    border-radius: 8px;
 64:    box-shadow: 0 0 8px rgba(0, 0, 0, 0.1);
 65:  }
 66:  
 67:  .results h2 {
 68:    margin-top: 0em;
 69:    padding-top: 0em;
 70:  }
 71:  
 72:  .results ul {
 73:    font-size: 1.1em;
 74:    line-height: 1.25em;
 75:    padding-left: 1.2em;
 76:  }
 77:  
 78:  .results li {
 79:    margin-bottom: 0.75em;
 80:  }
 81:  
 82:  /* Responsive tweaks */
 83:  @media (pointer: coarse) {
 84:    html {
 85:      font-size: 30px;
 86:    }
 87:    body {
 88:      margin: 0.5em; 
 89:    }
 90:  
 91:    form, .results {
 92:      padding: 0.8em;
 93:    }
 94:    
 95:    form input[type="text"], form button {
 96:      width: 100%;
 97:      font-size: 1.25em;
 98:      margin-top: 0.5em;
 99:    }
100:    form input[type="submit"], form button {
101:      width: 100%;
102:      margin-top: 0.5em;
103:    }
104:  }

About the only clever thing in this is the use of pointer: coarse to make the page easier to read on an iPhone. I started out trying a responsive design with different width values in the @media query (oops! there’s a third use of that word), but I wasn’t happy with the results. I didn’t want the text to get bigger just because the width of the browser window on my Mac got narrow. And I didn’t want to rewrite this when I got a new (and different resolution) phone. Then I saw the pointer feature and decided that was the way to go. Yes, it might make the text larger than I like on my iPad, but I don’t see myself using this on my iPad.

The Python script, the CSS file, and the database file are all kept in the same directory, a directory that’s password-protected through few Apache settings. Whenever I get a new book, I’ll add it to the local version of the database file and then upload that to replace the one on the server.


NBA Finals and Pandas

The NBA season ended last night as the Thunder beat the Pacers 103–91 in the seventh game of the Finals. I read this morning that this was the 20th Finals to go seven games in the 79-year history of the NBA, and I wondered what the distribution of game counts was.

I found all the Finals results on this Wikipedia page in a table that starts out this way:

NBA Finals table excerpt from Wikipedia

I figured I’d use what I learned about how Pandas can read HTML tables to extract the information I wanted. So I started an interactive Python session that went like this:

python:
>>> import pandas as pd
>>> dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_NBA_champions')
>>> df = pd.DataFrame({'Result': dfs[2]['Result']})
>>> df[['West', 'East']] = df.Result.str.split('–', expand=True).astype(int)
>>> df
     Result  West  East
0     1–4     1     4
1     4–2     4     2
2     4–2     4     2
3     4–2     4     2
4     4–3     4     3
..    ...   ...   ...
74    2–4     2     4
75    4–2     4     2
76    4–1     4     1
77    1–4     1     4
78    4–3     4     3

[79 rows x 3 columns]
>>> df['Games'] = df.West + df.East
>>> for n in range(4, 8):
...     print(f'{n}  {len(df[df.Games == n])}')
...     
4  9
5  20
6  30
7  20

OK, you’re right, this is an edited version of the session. The command that created the West and East fields took a few tries to get right. Also, I’m pretty sure there’s a better way to put the Results column from the original table into its own dataframe, but creating an intermediate dictionary was the first thing that came to mind. Overall, I was pleased that I didn’t need to do much thrashing about; I’ve used Pandas long enough now to get most things right or nearly right on the first try.

The upshot is this:

Games Count
4 9
5 20
6 30
7 20

I expected the 4–0 series to be the least common. As a Bulls fan, I suppose I should have guessed that 4–2 series are the most common—Michael Jordan was responsible for five of them.

If next year’s Finals is a sweep, I will definitely do this again. What a neat and tidy table that will be.


Graphing without empty spaces

I’ve mentioned here before that I was on a dietary program to keep my Type II diabetes under control. The program and its associated app also kept track of my blood pressure with a Bluetooth-connected cuff that I used once a week. I left the program at the beginning of the year (insurance wouldn’t cover it anymore), but I’ve continued the diet and tracking my blood pressure. I don’t know if it’s possible to connect to the cuff’s Bluetooth signal, and even if it is, I don’t have the programming chops to do it. But I’ve kept taking my blood pressure once a week and entering it into the Health app on my phone.

Unfortunately, the Health app’s way of plotting blood pressure is kind of crappy. Here’s what the past six months looks like:

Blood pressure graph from Health

Lots of wasted space in there, and because the range is so broad, I can’t see at a glance where I stand. Since seeing at a glance is sort of the whole point of plotting data, I would say this graph is basically useless.

Plotting the data using a tighter range would be better, as it would give me a better chance to figure out the various values to within a couple of mm Hg. Here’s an example:

Blood pressure single plot

This still seems like it could be improved. We’ve traded a bunch of wasted space above the systolic readings for a bunch of wasted space between the systolic and diastolic values. In some situations, it’s good to see how a gap between two sets of values compares to the variation within a set, but I don’t see much use in it for this type of data.

This data could use a scale break. In effect, this means making two plots but arranging them into a single figure to take advantage of their common parts. Here’s one way to do it:

Blood pressure subplots

The scales of the two parts are the same, so the larger variation in systolic values is properly represented. We’ve just cut out the empty space between the systolic and diastolic and pushed them together.

William Cleveland, author of The Elements of Graphing Data, is not a fan of scale breaking:

Use scale breaking only when necessary. If a break cannot be avoided, use a full scale break. Do not connect numerical values on two sides of a break. Taking logs can cure the need for a break.

In this case, taking logs would make the cure worse than the disease, as it would make reading the values harder. If you’re wondering about the difference between full and partial breaks, a partial break is when an axis is broken by a wavy or zigzag line, like this:

Partial scale break

Cleveland thinks partial breaks are too easy for readers to overlook. A full break is what we’ve done, so he may forgive it.

Despite Cleveland’s admonitions, scale breaks can be effective and attractive. Here’s an example from Modern Timber Engineering by Scofield and O’Brien:

Scofield connector graph

This gives the capacities of a type of wood connector under a variety of conditions. The graph is broken into three groups according to wood species, and what makes scale breaks useful here is that the plots would overlap and be impossible to read without the breaks. The authors could have make this three separate graphs, of course, but putting them together into a single figure emphasizes the interrelatedness of the data. And the curving gaps between the sections look really cool.

I’m not saying Apple should try curvy scale breaks in the Health app, but it wouldn’t take much to make the blood pressure graphs a lot more useful.