Semiautomated LaTeX tables

As a followup to my last post about automating the creation of Markdown tables, here’s a simple pair of functions that’ve been very helpful in making LaTeX tables quickly.

A few years ago, I wrote about how much I hated the syntax of LaTeX tables and how I was shifting to building tables as graphics files that I could insert using the \includegraphics{} command. In the early stages, I was doing this by hand, as outlined in that post, but after I developed a sense of the kind of spacing I liked, I translated that into a set of Python functions that used the ReportLab module to create decent-looking tables as PDF files. A Python solution made the most sense, as the data from which I made the tables typically came out of data analysis done in Python.

The functions I wrote worked well enough, but the overall system was more fussy than it should have been, and I realized I’m better at programming the manipulation of text than the manipulation of graphics. So I started thinking about ways to make LaTeX tables directly from Python data.

Here’s an artificial example that isn’t too far away from the kinds of tables I need to make. Let’s say we want a short table of (base 10) logarithms. We can generate the data—a list of values and a parallel list of their logs—like this:

python:
from math import log10

x = [ (10+i)/10 for i in range(90) ]
lx = [ f'{log10(y):.4f}' for y in x ]

And what I want is a table that looks like this:

Log table

I use the booktabs package for making tables. I like its clean look, and I like its nice \addlinespace command for adding a little extra space between certain rows to make reading long tables easier. The code that produced the table above is

\setlength{\tabcolsep}{.125in}
\begin{table}[htbp]
\begin{center}
\begin{tabular}{
    @{\hspace*{5pt}}
    cc@{\hspace{.75in}}cc@{\hspace{.75in}}cc
    @{\hspace*{5pt}}
}
\toprule
$x$ & $\log x$ & $x$ & $\log x$ & $x$ & $\log x$ \\
\midrule
1.0 & 0.0000 & 4.0 & 0.6021 & 7.0 & 0.8451 \\
1.1 & 0.0414 & 4.1 & 0.6128 & 7.1 & 0.8513 \\
1.2 & 0.0792 & 4.2 & 0.6232 & 7.2 & 0.8573 \\
1.3 & 0.1139 & 4.3 & 0.6335 & 7.3 & 0.8633 \\
1.4 & 0.1461 & 4.4 & 0.6435 & 7.4 & 0.8692 \\
\addlinespace
1.5 & 0.1761 & 4.5 & 0.6532 & 7.5 & 0.8751 \\
    .
    .
    .
3.7 & 0.5682 & 6.7 & 0.8261 & 9.7 & 0.9868 \\
3.8 & 0.5798 & 6.8 & 0.8325 & 9.8 & 0.9912 \\
3.9 & 0.5911 & 6.9 & 0.8388 & 9.9 & 0.9956 \\
\bottomrule
\end{tabular}
\caption{Partial logarithm table}
\label{table}
\end{center}
\end{table}

The boilerplate at the top and bottom is produced by a Keyboard Maestro macro. Except for the column alignment line and caption—which need to be set for each table—it never changes. The troublesome parts are the header and, especially, the body. For those, I use two functions, theader and tbody, to build the LaTeX code without having to touch the ampersand or backslash keys. To make the table in the same script as the data, I adjust the script to this:

python:
from math import log10
from latex_tables import tbody, theader

x = [ (10+i)/10 for i in range(90) ]
lx = [ f'{log10(y):.4f}' for y in x ]
headers = ['$x$', '$\log x$']*3
print(theader(headers))
print(tbody(x[:30], lx[:30], x[30:60], lx[30:60], x[60:], lx[60:], group=5))

The output is

$x$ & $\log x$ & $x$ & $\log x$ & $x$ & $\log x$ \\
\midrule
1.0 & 0.0000 & 4.0 & 0.6021 & 7.0 & 0.8451 \\
1.1 & 0.0414 & 4.1 & 0.6128 & 7.1 & 0.8513 \\
1.2 & 0.0792 & 4.2 & 0.6232 & 7.2 & 0.8573 \\
1.3 & 0.1139 & 4.3 & 0.6335 & 7.3 & 0.8633 \\
1.4 & 0.1461 & 4.4 & 0.6435 & 7.4 & 0.8692 \\
\addlinespace
1.5 & 0.1761 & 4.5 & 0.6532 & 7.5 & 0.8751 \\
    .
    .
    .
3.7 & 0.5682 & 6.7 & 0.8261 & 9.7 & 0.9868 \\
3.8 & 0.5798 & 6.8 & 0.8325 & 9.8 & 0.9912 \\
3.9 & 0.5911 & 6.9 & 0.8388 & 9.9 & 0.9956 \\

which, as you can see, is exactly what goes between the top and bottom boilerplate.

The functions are defined in the file latex_tables.py, which is saved in my site-packages directory. Here’s the code:

python:
 1:  def tbody(*cols, group=0):
 2:    "Given the columns, return the body of a LaTeX table."
 3:  
 4:    # Add blanks at the ends of short columns
 5:    lens = [ len(c) for c in cols ]
 6:    nrows = max(lens)
 7:    cols = [ c + [' ']*(nrows - len(c)) for c in cols ]
 8:  
 9:    # Assemble the rows of the table
10:    rows = []
11:    for i in range(nrows):
12:      if (group > 0) and (i > 0) and (i % group == 0):
13:        rows.append(r'\addlinespace')
14:      row = [ str(c[i]) for c in cols ]
15:      rows.append(' & '.join(row) + r' \\')
16:  
17:    return '\n'.join(rows)
18:  
19:  def theader(hcols):
20:    "Given a list of column headers, return the header lines of a LaTeX table."
21:  
22:    # Figure out the maximum number of lines in the header
23:    hcols = [ x.splitlines() for x in hcols ]
24:    maxlines = max(len(x) for x in hcols)
25:    hcols = [ [' ']*(maxlines - len(x)) + x for x in hcols ]
26:  
27:    # Assemble the rows of the header
28:    rows = []
29:    for i in range(maxlines):
30:      row = [ str(c[i]) for c in hcols ]
31:      rows.append(' & '.join(row) + r' \\')
32:  
33:    return '\n'.join(rows) + '\n\\midrule'

Like the example log table, most of the tables I make come from lists of data that are meant to go into the columns of the table. So the main feature of tbody is assembling the rows from those lists. The other significant feature is placing the \addlinespace command according to the group parameter.

The theader function is even simpler. The only interesting thing about it is how it handles multiline headers. If it’s called like this,

print(theader(['One line', 'Two\nlines', 'And\nthree\nlines']))

the output will be

    &   & And \\
    & Two & three \\
One line & lines & lines \\
\midrule

which will produce a table that has a header that looks like this:

Multiline header

The header lines are bottom-aligned instead of top-aligned. The main purpose of theader is to automate that alignment.

Trickier table arrangements still have me hauling out my copy of Kopka & Daly, but theader and tbody do the bulk of the work even in those cases.


Under the table

In my last post, I mentioned an AppleScript I wrote some time ago that turns a table in Numbers into a MultiMarkdown1 table. A couple of days later, jacobio asked a question in the Mac Power Users forum, that led me to posting the script there. After looking it over—and seeing jacobio’s independently developed script—I made a few small changes to the script and decided to talk about it here.

The premise for the script is that I’m writing a blog post in BBEdit, and I want to include a table. Writing a Markdown table isn’t especially hard, but doing so involves lots of typing of formatting characters, the pipes and colons that tell the Markdown processor how to make the table but aren’t part of the table’s content. It’s much easier to write a table in a spreadsheet or some other dedicated table-writing tool and then invoke a script that converts it to Markdown. And if the table I want to insert is already in a spreadsheet, using a conversion script is even faster.

(Let me pause here and mention TableFlip, which Rosemary Orchard talked about in that same thread. I haven’t used it, but it sounds great, especially if most of the tables you deal with have to be written from scratch. It’s killer feature is how it ties the Markdown table in your document to the spreadsheet-like form in TableFlip itself. Changes you make to the table in TableFlip are conveyed automatically to the Markdown document. The downside to using TableFlip is that you don’t have a version of the table in a spreadsheet, which is often very useful.)

I’ve created a BBEdit package of scripts and text filters that help me write blog posts. The Table from Numbers script is in that package and has a few lines that are package-specific. But most of it could be used as a standalone conversion script. Here’s the code:

applescript:
 1:  -- Get the path to the Normalize Table text filter in the package.
 2:  tell application "Finder" to set cPath to container of container of container of (path to me) as text
 3:  set normalize to quoted form of (POSIX path of cPath & "Text Filters/06)Blogging/02)Normalize Table.py")
 4:  
 5:  -- Construct a MultiMarkdown table from the top Numbers table
 6:  tell application "Numbers"
 7:    tell table 1 of sheet 1 of document 1
 8:      set tbl to ""
 9:      set h to header row count
10:      set c to column count
11:      set r to row count
12:      
13:      -- header lines
14:      repeat with i from 1 to h
15:        set tbl to tbl & "|"
16:        repeat with j from 1 to c
17:          set val to formatted value of cell j of row i
18:          if val = missing value then
19:            set val to ""
20:          end if
21:          set tbl to tbl & " " & val & " |"
22:        end repeat
23:        set tbl to tbl & linefeed
24:      end repeat
25:      
26:      -- formatting line
27:      set tbl to tbl & "|"
28:      repeat with j from 1 to c
29:        set a to alignment of cell j of row h
30:        if a = right then
31:          set tbl to tbl & "---:|"
32:        else if a = center then
33:          set tbl to tbl & ":--:|"
34:        else
35:          set tbl to tbl & ":---|"
36:        end if
37:      end repeat
38:      set tbl to tbl & linefeed
39:      
40:      -- body lines
41:      repeat with i from h + 1 to r
42:        set tbl to tbl & "|"
43:        repeat with j from 1 to c
44:          set val to formatted value of cell j of row i
45:          if val = missing value then
46:            set val to ""
47:          end if
48:          set tbl to tbl & " " & val & " |"
49:        end repeat
50:        set tbl to tbl & linefeed
51:      end repeat
52:      
53:    end tell -- table
54:  end tell -- Numbers
55:  
56:  -- Normalize the table
57:  set the clipboard to tbl
58:  set tbl to do shell script ("pbpaste | " & normalize)
59:  
60:  tell application "BBEdit" to set selection to tbl

Lines 1–3 are some of the package-specific stuff I talked about before. Let’s skip over that for now and get to the meat of the script, which starts on Line 6 with a long tell block that controls Numbers.

The script assumes that the table of interest is the first table of the first sheet of the frontmost Numbers document. Line 7 starts a tell block that speaks to that table. Line 8 initializes the tbl variable, which is where we’re going to put all the text of the Markdown table. Lines 9–11 get the sizes of the header and the table as a whole.

A word about headers is in order. Some Markdown implementations allow for only one header line, but others—including MultiMarkdown—allow for any number of header lines. This script accommodates the more general case. In Numbers, the header lines (rows) are set in the Table section of the Format side panel. There’s a pop-up button that lets you set the number of header rows, and this is typically reflected in the number of shaded rows at the top of the table.

Header rows

Lines 14–24 use nested loops to create the header lines of the Markdown table. The outer loop, which starts on Line 14, goes through the header rows of the table. The inner loop, which starts on Line 16, goes through the columns of each row. As is often the case, AppleScript’s idiosyncrasies are more difficult to deal with than the logic of the process. To make our Markdown table show what the Numbers table shows, we need to ask for the formatted value of each cell. Also, instead of asking for column j of row i in Line 17, we have to ask for cell j of row i. And finally, if a cell is empty, AppleScript will set val to the literal text “missing value” in Line 17, so we need Lines 18–20 to correct that to an empty string before appending val and the appropriate space and pipe (|) characters to tbl in Line 21.

Lines 27–38 use the alignment of the cells in the last header row (row h) to create the formatting line in the Markdown table. I think the logic here is pretty easy to understand. If the alignment of the cell is right, we add a short string of hyphens with a colon at the right end; if the alignment is left, we add a short string of hyphens with a colon at each end; for any other alignment, we treat the column as left-aligned and add a short string of hyphens with a colon at the left end. “Any other alignment” includes auto align; this is the default and is presented as left-aligned for text, which is what header cells usually contain.

Lines 41–51 loop through the body of the table. Apart from the limits on the repeat command in Line 41, this is the same code as in Lines 14–24. I probably should factor this out into a function.

When the code exits the tell block on Line 54, tbl contains a valid but ugly Markdown table. Something like this:

|  | County | Yes | No | Margin | CMargin |
|:---|:---|---:|---:|---:|---:|
| 1 | Kern | 126,999 | 78,477 | 48,522 | 48,522 |
| 2 | Placer | 114,643 | 85,302 | 29,341 | 77,863 |
| 3 | Shasta | 49,141 | 21,655 | 27,486 | 105,349 |
| 4 | Tulare | 64,372 | 41,009 | 23,363 | 128,712 |
| 5 | El Dorado | 58,393 | 39,907 | 18,486 | 147,198 |
| 6 | Stanislaus | 82,911 | 69,247 | 13,664 | 160,862 |
| 7 | Tehama | 15,958 | 6,186 | 9,772 | 170,634 |
| 8 | Madera | 25,638 | 16,233 | 9,405 | 180,039 |
| 9 | Sutter | 20,458 | 11,593 | 8,865 | 188,904 |
| 10 | Kings | 19,710 | 11,242 | 8,468 | 197,372 |

This can be made nicer looking by running it through my “Normalize Table” text filter, which is also part of my Blogging package and which I’ve written about before. Lines 2–3 figure out the path to the text filter, and Lines 57–58 apply it to the contents of tbl, putting the nicely-formatted result, e.g.,

|    | County     |     Yes |     No | Margin | CMargin |
|---:|:-----------|--------:|-------:|-------:|--------:|
|  1 | Kern       | 126,999 | 78,477 | 48,522 |  48,522 |
|  2 | Placer     | 114,643 | 85,302 | 29,341 |  77,863 |
|  3 | Shasta     |  49,141 | 21,655 | 27,486 | 105,349 |
|  4 | Tulare     |  64,372 | 41,009 | 23,363 | 128,712 |
|  5 | El Dorado  |  58,393 | 39,907 | 18,486 | 147,198 |
|  6 | Stanislaus |  82,911 | 69,247 | 13,664 | 160,862 |
|  7 | Tehama     |  15,958 |  6,186 |  9,772 | 170,634 |
|  8 | Madera     |  25,638 | 16,233 |  9,405 | 180,039 |
|  9 | Sutter     |  20,458 | 11,593 |  8,865 | 188,904 |
| 10 | Kings      |  19,710 | 11,242 |  8,468 | 197,372 |

back into tbl. Note that the code uses the clipboard and pbpaste to send the text through the “Normalize Table” filter. I could avoid the clipboard by setting up a shell command that uses a here-document, but that code is messy and easy to screw up (I speak from experience). Because I use Keyboard Maestro’s clipboard history manager, I can always get back to what was on the clipboard before I ran this script.

Finally, Line 60 puts the text of tbl into my current BBEdit document. If there’s a selection, it replaces the selection; if not, it inserts the text at the cursor.

If you want to adapt this script for use with any text editor, delete Lines 60 and 1–3 and change Line 58 to

set tbl to do shell script "pbpaste | /path/to/normalize | pbcopy"

For this last part to work, you’ll need to have a normalization script and replace /path/to/normalize with the path to it. When you get it working, you’ll have a script that will put the formatted Markdown table on your clipboard, ready to be pasted anywhere.

This script is no speed demon. AppleScript is very deliberate as it walks through the table, so there’s always a pause between choosing the Table from Numbers command and seeing the table appear in BBEdit. If you adapt it to put the table on your clipboard, you might want to add a command at the end to play a sound when the clipboard is ready for pasting.


  1. From now on, I’m just going to call these Markdown tables. Gruber’s Markdown doesn’t include tables, but pretty much every Markdown implementation has them—either built in or as an option. I bet most Markdown users don’t even know that tables are an extension. 


Data cleaning without boredom

The worst part of data analysis is organizing the data into an analyzable form. You want to get on with the real work of analysis, but first you have to collect and massage the data into a shape that your analysis tools can deal with. Often, this takes more time than the analysis itself. Even when it doesn’t take more time, it feels like it does because you have the sense that nothing important is getting done. I’m an analyst, dammit, I shouldn’t be stuck making sure that tabs and commas are in the right place.

But an unorganized pile of numbers is useless, and wishing it was well-ordered won’t make it so. When I have to organize disparate sets of data, I treat the process as an independent intellectual challenge, something I can be proud of in its own right. This keeps me interested and makes the time go faster. For me, the key is to figure out a way to do as little organization “by hand” as I can.

For the recall election post I wrote on Monday, the biggest problem was collecting the voting results from the California Secretary of State’s office. As far as I could tell, the SoS has published no single table of votes organized by county. Instead, there are 58 individual web pages that look like this:

Calaveras results page

Certainly, we can copy the Yes and No votes from this page and paste them into a spreadsheet or text file, but that process would have to be done 58 times. There’s no way I can do that without making several mistakes as I try to speed through it. So the challenge is in automating the process to avoid boredom-induced blunders.

First, I saw that the Calaveras County page shown above has this URL:

https://electionresults.sos.ca.gov/returns/governor-recall/county/calaveras

It’s a longish prefix followed by the lower-cased name of the county. A quick check on counties with multi-word names—of which California has a lot—showed that those URLs use dashes to separated the names, like

https://electionresults.sos.ca.gov/returns/governor-recall/county/santa-barbara

So if I have a list of the county names, I can use that list to construct the URL for each county in turn. A Google search turned up this page at Ballotpedia. I copied the list and pasted it into a text document named county-names.txt. There was one extra entry in the Ballotpedia list, “Los Angeles County not covered,” which I took to be all of Los Angeles County outside of the city. I deleted that line from the file.

Now that I can automate the generation of URLs, I need to be able to extract the Yes and No votes from an individual county’s page. Scanning through the HTML source of the Calaveras page, I found the code for the vote table:

<table class="propTblCounty stateCountyResultsTbl">
  <thead>
    <tr class="crsTblHdrTop">
      <th colspan="3" rowspan="2">Question</th>
      <th colspan="4">County</th>
      <th colspan="4">State</th>
    </tr>
    <tr class="crsTblHdrTop">
      <th>Yes<br />Votes</th>
      <th>%</th>
      <th>No<br />Votes</th>
      <th>%</th>
      <th>Yes<br />Votes</th>
      <th>%</th>
      <th>No<br />Votes</th>
      <th>%</th>
    </tr>
  </thead>
  <tbody>
    <tr class="evenRow resultsTblBorder">
      <td class="propYes" style="font-size:1.2em">Yes</td>
      <td class="bold"></td>
      <td class="bold">Shall Gavin Newsom</span> Be Recalled (Removed) From the Office of Governor?</td>
      <td class="votesProp">15,133</td>
      <td class="percProp">64.5%</td>
      <td class="votesProp">8,320</td>
      <td class="percProp">35.5%</td>
      <td class="votesProp">4,874,057</td>
      <td class="percProp">38.1%</td>
      <td class="votesProp">7,927,249</td>
      <td class="percProp">61.9%</td>
    </tr>
  </tbody>
</table>

A search through the full HTML source told me that the stateCountyResultsTbl class uniquely identifies the table of interest. And within the body of that table, the first two data cells with a votesProp class will yield the Yes and No votes for the county.

There are lots of ways to extract data from HTML. Because I’m most comfortable programming in Python, I chose to use it and the Beautiful Soup module, which I’ve used often, to get the Yes and No votes.

In the time it took me to drink a couple of cups of tea, I had the following script, collect-votes.py, written.

python:
 1:  #!/usr/bin/env python
 2:  
 3:  import requests
 4:  from bs4 import BeautifulSoup
 5:  
 6:  urlPrefix = 'https://electionresults.sos.ca.gov/returns/governor-recall/county/'
 7:  
 8:  print('County,Yes,No')
 9:  
10:  # Loop through the counties, collecting their Yes and No votes
11:  with open('county-names.txt') as f:
12:    for line in f:
13:      # Construct the URL and parse the HTML from it.
14:      name = line.rstrip()
15:      url = urlPrefix + name.lower().replace(' ', '-')
16:      r = requests.get(url)
17:      soup = BeautifulSoup(r.text, 'html.parser')
18:  
19:      # The table we want is the first (and only)
20:      # of class stateCountyResultsTbl
21:      voteTable = soup.find('table', {"class": "stateCountyResultsTbl"})
22:  
23:      # The Yes and No votes are the first two cells of class votesProp
24:      # in the body of the table
25:      votes = voteTable.tbody.find_all('td', {'class': 'votesProp'})[:2]
26:  
27:      # Strip out any thousand separators
28:      votes = [ x.text.replace(',', '') for x in votes ]
29:      outLine = ','.join([name] + votes)
30:      print(outLine)

This script goes through the lines of county-names.txt and for each county

  1. Generates the URL (Line 15).
  2. Gets the HTML source using the Requests module (Line 16).
  3. Finds the vote table (Line 21).
  4. Finds the Yes and No votes (Line 25).
  5. Strips the thousands separators from the votes (Line 28).
  6. Prints a CSV-style line for the county (Lines 29-30).

A lot of Python programmers would say I should have used the csv module to write out the data. If output were more complicated, I would have done so, but this seemed simple enough to do with just the print function.

(You might think I stripped out the thousands separators simply so I could avoid having commas within the fields of my CSV output. No, that was just a nice side benefit. I stripped the thousands separators to make the numbers in the CSV file easy to parse. Although lots of systems can handle numbers with commas, every system can handle numbers without commas. This is in keeping with Postel’s Law: Be conservative in what you do, be liberal in what you accept from others.)

Running the script via

python collect-votes.py > alphabetical-votes.csv

gave me a CSV data file that looked like this:

County,Yes,No
Alameda,108081,465901
Alpine,225,354
Amador,12895,6957
Butte,43129,36128
Calaveras,15133,8320
Colusa,3977,1996
Contra Costa,130058,324747
Del Norte,5243,3505
El Dorado,58393,39907
Fresno,132691,125710
    etc.

Clean data makes the analysis simple. I could have opened the CSV file in a spreadsheet and done all the analysis there, but I prefer using Pandas, even when it’s overkill, because it keeps my Pandas skills fresh. The script that generated the ordered data table in my earlier post was this:

python:
 1:  #!/usr/bin/env python
 2:  
 3:  import pandas as pd
 4:  
 5:  df = pd.read_csv('alphabetical-votes.csv')
 6:  df['Margin'] = df.Yes - df.No
 7:  df.sort_values('Margin', ascending=False, inplace=True, ignore_index=True)
 8:  df['CMargin'] = df.Margin.cumsum()
 9:  df.index += 1
10:  df.to_csv('ordered-votes.csv')

The only tricky thing here is Line 9. Pandas, like Python itself, uses zero-based indexing for the rows of a data frame. I wanted one-based indexing for the table in my post because most people like lists to start with one. The output ordered-votes.csv file looks like this:

,County,Yes,No,Margin,CMargin
1,Kern,126999,78477,48522,48522
2,Placer,114643,85302,29341,77863
3,Shasta,49141,21655,27486,105349
4,Tulare,64372,41009,23363,128712
5,El Dorado,58393,39907,18486,147198
6,Stanislaus,82911,69247,13664,160862
7,Tehama,15958,6186,9772,170634
8,Madera,25638,16233,9405,180039
9,Sutter,20458,11593,8865,188904
10,Kings,19710,11242,8468,197372
    etc.

Long ago, I wrote an AppleScript that converts a table in Numbers to (Multi)Markdown format and pastes it into BBEdit. So I opened ordered-votes.csv in Numbers, turned on the thousands separators in the numeric columns, and ran the conversion script to get this:

|  | County | Yes | No | Margin | CMargin |
|---:|:---|---:|---:|---:|---:|
| 1 | Kern | 126,999 | 78,477 | 48,522 | 48,522 |
| 2 | Placer | 114,643 | 85,302 | 29,341 | 77,863 |
| 3 | Shasta | 49,141 | 21,655 | 27,486 | 105,349 |
| 4 | Tulare | 64,372 | 41,009 | 23,363 | 128,712 |
| 5 | El Dorado | 58,393 | 39,907 | 18,486 | 147,198 |
| 6 | Stanislaus | 82,911 | 69,247 | 13,664 | 160,862 |
| 7 | Tehama | 15,958 | 6,186 | 9,772 | 170,634 |
| 8 | Madera | 25,638 | 16,233 | 9,405 | 180,039 |
| 9 | Sutter | 20,458 | 11,593 | 8,865 | 188,904 |
| 10 | Kings | 19,710 | 11,242 | 8,468 | 197,372 |
    etc.

Another script normalized the table to get all the cells aligned.

|    | County     |     Yes |     No | Margin | CMargin |
|---:|:-----------|--------:|-------:|-------:|--------:|
|  1 | Kern       | 126,999 | 78,477 | 48,522 |  48,522 |
|  2 | Placer     | 114,643 | 85,302 | 29,341 |  77,863 |
|  3 | Shasta     |  49,141 | 21,655 | 27,486 | 105,349 |
|  4 | Tulare     |  64,372 | 41,009 | 23,363 | 128,712 |
|  5 | El Dorado  |  58,393 | 39,907 | 18,486 | 147,198 |
|  6 | Stanislaus |  82,911 | 69,247 | 13,664 | 160,862 |
|  7 | Tehama     |  15,958 |  6,186 |  9,772 | 170,634 |
|  8 | Madera     |  25,638 | 16,233 |  9,405 | 180,039 |
|  9 | Sutter     |  20,458 | 11,593 |  8,865 | 188,904 |
| 10 | Kings      |  19,710 | 11,242 |  8,468 | 197,372 |
    etc.

This isn’t strictly necessary—the HTML generated from the Markdown will be same regardless of the alignment—but it looks nicer as I’m writing the post. By the way, since my earlier post, the numbers have changed a bit. You can now take a subset of 46 counties and get Yes votes to outnumber the No votes:

County Yes No Margin CMargin
1 Kern 126,999 78,477 48,522 48,522
2 Placer 114,643 85,302 29,341 77,863
3 Shasta 49,141 21,655 27,486 105,349
4 Tulare 64,372 41,009 23,363 128,712
5 El Dorado 58,393 39,907 18,486 147,198
6 Stanislaus 82,911 69,247 13,664 160,862
7 Tehama 15,958 6,186 9,772 170,634
8 Madera 25,638 16,233 9,405 180,039
9 Sutter 20,458 11,593 8,865 188,904
10 Kings 19,710 11,242 8,468 197,372
11 Yuba 15,291 7,961 7,330 204,702
12 Riverside 362,958 355,630 7,328 212,030
13 Butte 43,129 36,128 7,001 219,031
14 Fresno 132,691 125,710 6,981 226,012
15 Lassen 8,538 1,604 6,934 232,946
16 Calaveras 15,133 8,320 6,813 239,759
17 Tuolumne 15,832 9,850 5,982 245,741
18 Amador 12,895 6,957 5,938 251,679
19 Siskiyou 11,282 6,951 4,331 256,010
20 Glenn 6,317 2,479 3,838 259,848
21 Plumas 5,837 3,407 2,430 262,278
22 Merced 30,210 27,867 2,343 264,621
23 Mariposa 5,378 3,376 2,002 266,623
24 Colusa 3,977 1,996 1,981 268,604
25 Modoc 2,508 706 1,802 270,406
26 Del Norte 5,243 3,505 1,738 272,144
27 Inyo 4,128 3,496 632 272,776
28 Trinity 2,699 2,106 593 273,369
29 Sierra 1,064 616 448 273,817
30 Alpine 225 354 -129 273,688
31 Mono 2,245 2,805 -560 273,128
32 Lake 3,728 5,605 -1,877 271,251
33 San Bernardino 285,650 288,291 -2,641 268,610
34 San Benito 9,181 12,595 -3,414 265,196
35 Nevada 25,426 29,851 -4,425 260,771
36 Imperial 12,193 18,210 -6,017 254,754
37 San Luis Obispo 61,148 68,365 -7,217 247,537
38 Mendocino 11,870 21,852 -9,982 237,555
39 San Joaquin 94,877 105,405 -10,528 227,027
40 Humboldt 18,179 32,585 -14,406 212,621
41 Napa 18,681 38,948 -20,267 192,354
42 Yolo 24,769 52,444 -27,675 164,679
43 Santa Barbara 57,355 92,905 -35,550 129,129
44 Orange 547,685 586,457 -38,772 90,357
45 Solano 58,372 97,935 -39,563 50,794
46 Monterey 38,169 80,664 -42,495 8,299
47 Ventura 136,610 182,470 -45,860 -37,561
48 Santa Cruz 25,454 90,874 -65,420 -102,981
49 Marin 24,273 108,599 -84,326 -187,307
50 Sonoma 57,396 160,550 -103,154 -290,461
51 Sacramento 220,498 329,952 -109,454 -399,915
52 San Mateo 64,329 227,174 -162,845 -562,760
53 San Diego 504,915 674,417 -169,502 -732,262
54 Contra Costa 130,058 324,747 -194,689 -926,951
55 San Francisco 47,193 292,744 -245,551 -1,172,502
56 Santa Clara 166,930 468,680 -301,750 -1,474,252
57 Alameda 108,081 465,901 -357,820 -1,832,072
58 Los Angeles 855,234 2,076,354 -1,221,120 -3,053,192

The overall change in votes makes Newsom’s winning margin only about 3500 votes lower than it was last Monday, but the way the pluses and minuses were distributed increased our worthless count of counties by one.

This, by the way, demonstrates another advantage of automating the data collection and organization. In addition to keeping me interested, it also makes it easy to rerun the analysis if the data change. When the vote-counting is finalized later in the month, I can check the numbers again by just running a couple of scripts.


Recall this analysis

We’re now several years into the trend of “data journalism,” and I’m still not sure whether I like it. On the one hand, it’s nice to see reporters writing about numbers; on the other hand, numbers typically aren’t their strong suit. Too often we get articles whose sole purpose seems to be to show off some colorful graphs.

Last Thursday, the San Francisco Chronicle published this article, entitled “If California was made up of just these 35 counties, Gavin Newsom would have been recalled.” That a similarly-titled article could be written about almost any election—”Some people voted for, others against”—didn’t prevent the Chronicle from going with this dog-bites-man story.

The article starts with this long bar chart of counties:

SF Chronicle bar chart

The counties are ordered according to their percent of Yes votes (in favor of recall), and the bars represent the cumulative percentage of Yes and No votes as we include the votes of each county in turn. If you think that’s a contrived and tortuous way to present the data, you’re not alone. The reporter, Nami Sumida, spent five paragraphs explaining it, and I’ll bet those paragraphs were written at the insistence of confused and math-averse editors.

The flip from cumulative Yes to cumulative No comes when we add the Ventura County vote. If you put the effort into counting down the list, Lassen through San Diego is 35 counties, hence the title of the piece.

If you’d like to see the numerical results for each county, the article gives them to you in a series of tables that look like this:

SF Chronicle table

The colors in this table made me see red. Or maybe green. Using the same color to represent contradictory data is just inexcusably sloppy work. Maybe if the editors hadn’t been so confused by the construction of the bar chart, they would have noticed that Sumida flipped the meaning of red and green in the middle of the article.

The article also includes this stylized map of California, with each of the counties represented by a hexagon in its approximate geographic location and the titular 35 counties tinted.1

SF Chronicle map

This use of equal-sized hexagons in place of actual geographic shapes has become a popular way to make colored maps without over-representing large areas. In this case, of course, it has the effect of under-representing populous areas, making it seem as if each county counts as much as every other county in the recall vote. This will be a surprise to both the 10,000,000 residents of Los Angeles County and the 1,200 residents of Alpine County.

You might look at the map and think that 35 counties had a majority of Yes votes. I wouldn’t blame you if you did. That’s normally how people use color when showing election results on a map. But six of these counties—San Diego, Mono, San Joaquin, Nevada, Orange, and San Bernardino—had a majority of No votes. It’s just that if you add their votes to the 29 counties that really did have a majority of Yes votes, you still end up with a Yes majority.

Are there other ways to take a subset of California’s 58 counties and get a majority of votes in favor of recall? God, yes. Even if we restrict ourselves to subsets of 35, there are \(8.8\times10^{15}\) (that’s 8.8 quadrillion) ways to combine 58 things, taking them 35 at a time. Only a small fraction of these combinations will yield a majority of Yes votes, but even a small fraction of 8.8 quadrillion is a big number. Taking the election results (as of October 1) from the California Secretary of State’s office and using the combinations function from Python’s itertools module, I wrote a short bit of brute force code and in a few minutes found over ten million combinations that did the trick.2 And you don’t have to use a subset of exactly 35 counties.

This is perhaps the weirdest thing about the article. It’s clearly intended to slice up California in a way to show that Newsom is unpopular across some large chunk of the state. What other purpose is served by including counties that voted to retain him? So why not go for the gusto? Why not work out the largest number of counties which, when added together, would have arrived at a majority for recall?

It isn’t hard to do. Instead of ordering the counties by their percentage of Yes votes, order them by their margin of Yes votes. Like this:

County Yes No Margin CMargin
1 Kern 117,584 73,781 43,803 43,803
2 Shasta 49,141 21,655 27,486 71,289
3 Placer 103,192 80,088 23,104 94,393
4 Tulare 63,680 40,640 23,040 117,433
5 El Dorado 58,062 39,743 18,319 135,752
6 Stanislaus 81,456 68,085 13,371 149,123
7 Tehama 15,958 6,186 9,772 158,895
8 Madera 25,638 16,233 9,405 168,300
9 Sutter 20,216 11,484 8,732 177,032
10 Kings 19,710 11,242 8,468 185,500
11 Riverside 362,958 355,630 7,328 192,828
12 Fresno 130,580 123,433 7,147 199,975
13 Yuba 14,839 7,784 7,055 207,030
14 Butte 42,703 35,707 6,996 214,026
15 Lassen 8,532 1,600 6,932 220,958
16 Calaveras 14,559 8,018 6,541 227,499
17 Tuolumne 15,832 9,850 5,982 233,481
18 Amador 12,895 6,957 5,938 239,419
19 Siskiyou 11,282 6,951 4,331 243,750
20 Glenn 6,317 2,479 3,838 247,588
21 Merced 29,926 27,517 2,409 249,997
22 Plumas 5,106 3,008 2,098 252,095
23 Mariposa 5,375 3,376 1,999 254,094
24 Colusa 3,977 1,996 1,981 256,075
25 Modoc 2,508 706 1,802 257,877
26 Del Norte 5,137 3,454 1,683 259,560
27 Inyo 4,128 3,496 632 260,192
28 Trinity 2,699 2,106 593 260,785
29 Sierra 1,064 616 448 261,233
30 Alpine 218 340 -122 261,111
31 Mono 2,186 2,719 -533 260,578
32 Lake 3,728 5,605 -1,877 258,701
33 San Bernardino 282,659 285,596 -2,937 255,764
34 San Benito 9,181 12,595 -3,414 252,350
35 Nevada 25,273 29,702 -4,429 247,921
36 Imperial 12,193 18,210 -6,017 241,904
37 Mendocino 11,870 21,852 -9,982 231,922
38 San Luis Obispo 47,882 59,364 -11,482 220,440
39 Humboldt 15,442 28,810 -13,368 207,072
40 San Joaquin 79,097 94,109 -15,012 192,060
41 Napa 17,747 37,134 -19,387 172,673
42 Yolo 24,273 51,405 -27,132 145,541
43 Santa Barbara 57,355 92,905 -35,550 109,991
44 Orange 547,685 586,457 -38,772 71,219
45 Solano 58,277 97,839 -39,562 31,657
46 Monterey 38,169 80,664 -42,495 -10,838
47 Ventura 136,389 182,158 -45,769 -56,607
48 Santa Cruz 24,188 86,669 -62,481 -119,088
49 Marin 22,701 105,508 -82,807 -201,895
50 Sonoma 45,443 142,866 -97,423 -299,318
51 Sacramento 218,432 327,482 -109,050 -408,368
52 San Mateo 64,250 226,891 -162,641 -571,009
53 San Diego 502,226 671,379 -169,153 -740,162
54 Contra Costa 128,259 321,242 -192,983 -933,145
55 San Francisco 47,053 292,180 -245,127 -1,178,272
56 Santa Clara 166,827 468,486 -301,659 -1,479,931
57 Alameda 108,081 465,901 -357,820 -1,837,751
58 Los Angeles 853,398 2,072,346 -1,218,948 -3,056,699

Each county’s CMargin column is the cumulative margin from the top of the list. By ordering them this way, the Yes margin gets nibbled away as slowly as possible as you move down the list. As you can see, if the Chronicle had ordered the counties this way, they could’ve written an article with the headline “If California was made up of just these 45 counties, Gavin Newsom would have been recalled.”

And we still wouldn’t be any smarter.


  1. And yes, they’re tinted green, another example of changing the meaning of colors mid-article. 

  2. How long would it take to check all 8.8 quadrillion combinations by brute force? I don’t know, but it’s likely to be hundreds of years, and I didn’t feel like waiting that long.