The Electoral College again, this time with aggregation

After the last post, I had a conversation with ondaiwai on Mastodon about the to_markdown and agg functions in Pandas. After seeing his example code here and here, I rewrote my script, and I think it’s much better.

In a nutshell, the agg function aggregates the data pulled together by groupby and creates a new dataframe that can be formatted as a MultiMarkdown table via to_markdown. I had used agg several years ago but forgot about it. And I’d never seen to_markdown before. Let’s see how they work to make a nicely formatted Electoral College table.

This time, instead of showing you the script in pieces, I’ll show the whole thing at once, and we’ll consider each section in turn

python:
 1:  #!/usr/bin/env python
 2:  
 3:  import pandas as pd
 4:  
 5:  # Import the raw data
 6:  df = pd.read_csv('states.csv')
 7:  
 8:  # Generate new fields for percentages
 9:  popSum = df.Population.sum()
10:  ecSum = df.Electors.sum()
11:  df['PopPct'] = df.Population/popSum
12:  df['ECPct'] = df.Electors/ecSum
13:  
14:  # Collect states with same numbers of electors
15:  basicAgg = {'State': 'count', 'Population': 'sum', 'PopPct': 'sum',\
16:              'Electors': 'sum', 'ECPct': 'sum'}
17:  dfgBasic = df.groupby(by='Electors').agg(basicAgg)
18:  
19:  # Print out the summary table
20:  print(dfgBasic)
21:  
22:  print()
23:  print()
24:  
25:  # Collect as above but for summary table in blog post
26:  tableAgg = {'Abbrev': lambda s: ', '.join(s), 'PopPct': 'sum', 'ECPct': 'sum'}
27:  dfgTable = df.groupby(by='Electors').agg(tableAgg)
28:  
29:  # Print as Markdown table
30:  print(dfgTable.to_markdown(floatfmt='.2%',\
31:          headers=['Electors', 'States', 'Pop Pct', 'EC Pct'],\
32:          colalign=['center', 'center', 'right', 'right'] ))

Lines 1–12 are the same as before; they import the state population and Electoral College information into a dataframe, df, and then add a couple of columns for percentages of the totals. After Line 12, the first five rows of df are

State Abbrev Population Electors PopPct ECPct
Alabama AL 5108468 9 0.015253 0.016729
Alaska AK 733406 3 0.002190 0.005576
Arizona AZ 7431344 11 0.022189 0.020446
Arkansas AR 3067732 6 0.009160 0.011152
California CA 38965193 54 0.116344 0.100372

The code in Lines 15–20 groups the data according to the number of Electoral College votes per state and prints out a summary table intended for my use. “Intended for my use” means the output is clear but not the sort of thing I’d want to present to anyone else. “Quick and dirty” would be another way to describe the output, which is this:

          State  Population    PopPct  Electors     ECPct
Electors                                                 
3             7     5379033  0.016061        21  0.039033
4             7    10196485  0.030445        28  0.052045
5             2     4092750  0.012220        10  0.018587
6             6    18766882  0.056035        36  0.066914
7             2     7671000  0.022904        14  0.026022
8             3    13333261  0.039811        24  0.044610
9             2    10482023  0.031298        18  0.033457
10            5    29902889  0.089285        50  0.092937
11            4    28421431  0.084862        44  0.081784
12            1     7812880  0.023328        12  0.022305
13            1     8715698  0.026024        13  0.024164
14            1     9290841  0.027741        14  0.026022
15            1    10037261  0.029970        15  0.027881
16            2    21864718  0.065284        32  0.059480
17            1    11785935  0.035191        17  0.031599
19            2    25511372  0.076173        38  0.070632
28            1    19571216  0.058436        28  0.052045
30            1    22610726  0.067512        30  0.055762
40            1    30503301  0.091078        40  0.074349
54            1    38965193  0.116344        54  0.100372

The agg function has many options for passing arguments, one of which is a dictionary in which the column names are the keys and the aggregation functions to be applied to those columns are the values. We create that dictionary, basicAgg, in Lines 15–16, where the State column is counted and the other columns are summed. Notice that the values of basicAgg are strings with the names of the functions to be applied.1

Line 17 then groups the dataframe by the Electors column and runs the aggregation functions specified in basicAgg on the appropriate columns. The output is a new dataframe, dfgBasic, which is then printed out in Line 20. Nothing fancy in the print function because this is meant to be quick and dirty.

After a couple of print()s on Lines 22–23 to give us some whitespace, the rest of the code creates a Markdown table that looks like this:

|  Electors  |           States           |   Pop Pct |   EC Pct |
|:----------:|:--------------------------:|----------:|---------:|
|     3      | AK, DE, DC, ND, SD, VT, WY |     1.61% |    3.90% |
|     4      | HI, ID, ME, MT, NH, RI, WV |     3.04% |    5.20% |
|     5      |           NE, NM           |     1.22% |    1.86% |
|     6      |   AR, IA, KS, MS, NV, UT   |     5.60% |    6.69% |
|     7      |           CT, OK           |     2.29% |    2.60% |
|     8      |         KY, LA, OR         |     3.98% |    4.46% |
|     9      |           AL, SC           |     3.13% |    3.35% |
|     10     |     CO, MD, MN, MO, WI     |     8.93% |    9.29% |
|     11     |       AZ, IN, MA, TN       |     8.49% |    8.18% |
|     12     |             WA             |     2.33% |    2.23% |
|     13     |             VA             |     2.60% |    2.42% |
|     14     |             NJ             |     2.77% |    2.60% |
|     15     |             MI             |     3.00% |    2.79% |
|     16     |           GA, NC           |     6.53% |    5.95% |
|     17     |             OH             |     3.52% |    3.16% |
|     19     |           IL, PA           |     7.62% |    7.06% |
|     28     |             NY             |     5.84% |    5.20% |
|     30     |             FL             |     6.75% |    5.58% |
|     40     |             TX             |     9.11% |    7.43% |
|     54     |             CA             |    11.63% |   10.04% |

which presents just the data I want to show and formats it nicely for pasting into the blog post.

Line 26 creates a new aggregation dictionary, tableAgg. It’s similar to basicAgg except for the function applied to the Abbrev column. Because there’s no built-in function (as far as I know) for turning a column into a comma-separated string of its contents, I made one. Because this function is very short and I’m using it only once, I added it to tableAgg as a lambda function:

python:
lambda s: ', '.join(s)

This illustrates the overall structure of an aggregation function. It takes a Pandas series (the column) as input and produces a scalar value as output. Because a series acts like a list, it can be passed directly to the join function, producing the comma-separated string I wanted in the table.

The last few lines print out the table in Markdown format, using the to_markdown function to generate the header line, the format line, and the pipe characters between the columns. While I didn’t know about to_markdown until ondaiwai told me about it, it does its work via the tabulate module, which I have used (badly) in the past. The keyword arguments given to to_markdown in Lines 30–32 are passed on to tabulate to control the formatting:

  1. floatfmt='.2%' sets the formatting of all the floating point numbers, which in this case are the PopPct and ECPct columns. Note that in the quick-and-dirty table, these columns are printed as decimal values with more digits than necessary.
  2. headers=['Electors', 'States', 'Pop Pct', 'EC Pct'] sets the header names to something nicer than the default, which would be the names in df.
  3. colalign=['center', 'center', 'right', 'right'] sets the alignment of the four columns. Without this, the Electors column would be right-aligned (the default for numbers) and the States column would be left-aligned (the default for strings).

Why do I think this code is better than what I showed in the last post? Mainly because this code operates on dataframes and series all at once instead of looping through items as my previous code did. This is how Pandas is meant to be used. Also, the functions being applied to the columns are more obvious because of how the basicAgg and tableAgg dictionaries are built. In my previous code, these functions are inside braces in f-strings, where they’re harder to see. Similarly, the formatting of the Markdown table is easier to see when it’s given as arguments to a function instead of buried in print commands.

My thanks again to ondaiwai for introducing me to to_markdown and getting me to think in a more Pandas-like way. He rewrote and extended the code snippets he posted on Mastodon and made a gist of it. After seeing his code, I felt pretty good about my script; it’s different from his, but it shows that I understood what he was saying in the tweets.

Update 7 Sep 2024 12:33 PM
Kieran Healy followed up with a post on doing this summary analysis in R, but there’s more to his post than a simple language translation. He does a better job than I did at explaining why we should use higher-level abstractions in our analysis code. While I just kind of waved my hands and said “this is how Pandas is meant to be used,” Kieran explains that this is a general principle, not just a feature of Pandas. He also does a good job of relating the Pandas agg function to the split-apply-combine strategy, which has been formalized in Pandas, R, and Julia (and, I assume, other data analysis systems).

Using abstractions like this in programming is analogous to how we advance in mathematics. If you’re doing a problem that requires matrix multiplication, you don’t write out every step of row-column multiplication and addition, you just say

C=AB

This is how you get to think about the purpose of the operation, not the nitty-gritty of how it’s done.


  1. While you can, in theory, use the functions themselves as the values (e.g., sum instead of 'sum'), I found that doesn’t necessarily work for all aggregation functions. The count function, for example, seems to work only if it’s given as a string. I’m sure there’s a good reason for this, but I haven’t figured it out yet.