The Electoral College again, this time with aggregation
August 31, 2024 at 11:27 PM by Dr. Drang
After the last post, I had a conversation with ondaiwai on Mastodon about the to_markdown
and agg
functions in Pandas. After seeing his example code here and here, I rewrote my script, and I think it’s much better.
In a nutshell, the agg
function aggregates the data pulled together by groupby
and creates a new dataframe that can be formatted as a MultiMarkdown table via to_markdown
. I had used agg
several years ago but forgot about it. And I’d never seen to_markdown
before. Let’s see how they work to make a nicely formatted Electoral College table.
This time, instead of showing you the script in pieces, I’ll show the whole thing at once, and we’ll consider each section in turn
python:
1: #!/usr/bin/env python
2:
3: import pandas as pd
4:
5: # Import the raw data
6: df = pd.read_csv('states.csv')
7:
8: # Generate new fields for percentages
9: popSum = df.Population.sum()
10: ecSum = df.Electors.sum()
11: df['PopPct'] = df.Population/popSum
12: df['ECPct'] = df.Electors/ecSum
13:
14: # Collect states with same numbers of electors
15: basicAgg = {'State': 'count', 'Population': 'sum', 'PopPct': 'sum',\
16: 'Electors': 'sum', 'ECPct': 'sum'}
17: dfgBasic = df.groupby(by='Electors').agg(basicAgg)
18:
19: # Print out the summary table
20: print(dfgBasic)
21:
22: print()
23: print()
24:
25: # Collect as above but for summary table in blog post
26: tableAgg = {'Abbrev': lambda s: ', '.join(s), 'PopPct': 'sum', 'ECPct': 'sum'}
27: dfgTable = df.groupby(by='Electors').agg(tableAgg)
28:
29: # Print as Markdown table
30: print(dfgTable.to_markdown(floatfmt='.2%',\
31: headers=['Electors', 'States', 'Pop Pct', 'EC Pct'],\
32: colalign=['center', 'center', 'right', 'right'] ))
Lines 1–12 are the same as before; they import the state population and Electoral College information into a dataframe, df
, and then add a couple of columns for percentages of the totals. After Line 12, the first five rows of df
are
State | Abbrev | Population | Electors | PopPct | ECPct |
---|---|---|---|---|---|
Alabama | AL | 5108468 | 9 | 0.015253 | 0.016729 |
Alaska | AK | 733406 | 3 | 0.002190 | 0.005576 |
Arizona | AZ | 7431344 | 11 | 0.022189 | 0.020446 |
Arkansas | AR | 3067732 | 6 | 0.009160 | 0.011152 |
California | CA | 38965193 | 54 | 0.116344 | 0.100372 |
The code in Lines 15–20 groups the data according to the number of Electoral College votes per state and prints out a summary table intended for my use. “Intended for my use” means the output is clear but not the sort of thing I’d want to present to anyone else. “Quick and dirty” would be another way to describe the output, which is this:
State Population PopPct Electors ECPct
Electors
3 7 5379033 0.016061 21 0.039033
4 7 10196485 0.030445 28 0.052045
5 2 4092750 0.012220 10 0.018587
6 6 18766882 0.056035 36 0.066914
7 2 7671000 0.022904 14 0.026022
8 3 13333261 0.039811 24 0.044610
9 2 10482023 0.031298 18 0.033457
10 5 29902889 0.089285 50 0.092937
11 4 28421431 0.084862 44 0.081784
12 1 7812880 0.023328 12 0.022305
13 1 8715698 0.026024 13 0.024164
14 1 9290841 0.027741 14 0.026022
15 1 10037261 0.029970 15 0.027881
16 2 21864718 0.065284 32 0.059480
17 1 11785935 0.035191 17 0.031599
19 2 25511372 0.076173 38 0.070632
28 1 19571216 0.058436 28 0.052045
30 1 22610726 0.067512 30 0.055762
40 1 30503301 0.091078 40 0.074349
54 1 38965193 0.116344 54 0.100372
The agg
function has many options for passing arguments, one of which is a dictionary in which the column names are the keys and the aggregation functions to be applied to those columns are the values. We create that dictionary, basicAgg
, in Lines 15–16, where the State
column is counted and the other columns are summed. Notice that the values of basicAgg
are strings with the names of the functions to be applied.1
Line 17 then groups the dataframe by the Electors
column and runs the aggregation functions specified in basicAgg
on the appropriate columns. The output is a new dataframe, dfgBasic
, which is then printed out in Line 20. Nothing fancy in the print
function because this is meant to be quick and dirty.
After a couple of print()
s on Lines 22–23 to give us some whitespace, the rest of the code creates a Markdown table that looks like this:
| Electors | States | Pop Pct | EC Pct |
|:----------:|:--------------------------:|----------:|---------:|
| 3 | AK, DE, DC, ND, SD, VT, WY | 1.61% | 3.90% |
| 4 | HI, ID, ME, MT, NH, RI, WV | 3.04% | 5.20% |
| 5 | NE, NM | 1.22% | 1.86% |
| 6 | AR, IA, KS, MS, NV, UT | 5.60% | 6.69% |
| 7 | CT, OK | 2.29% | 2.60% |
| 8 | KY, LA, OR | 3.98% | 4.46% |
| 9 | AL, SC | 3.13% | 3.35% |
| 10 | CO, MD, MN, MO, WI | 8.93% | 9.29% |
| 11 | AZ, IN, MA, TN | 8.49% | 8.18% |
| 12 | WA | 2.33% | 2.23% |
| 13 | VA | 2.60% | 2.42% |
| 14 | NJ | 2.77% | 2.60% |
| 15 | MI | 3.00% | 2.79% |
| 16 | GA, NC | 6.53% | 5.95% |
| 17 | OH | 3.52% | 3.16% |
| 19 | IL, PA | 7.62% | 7.06% |
| 28 | NY | 5.84% | 5.20% |
| 30 | FL | 6.75% | 5.58% |
| 40 | TX | 9.11% | 7.43% |
| 54 | CA | 11.63% | 10.04% |
which presents just the data I want to show and formats it nicely for pasting into the blog post.
Line 26 creates a new aggregation dictionary, tableAgg
. It’s similar to basicAgg
except for the function applied to the Abbrev
column. Because there’s no built-in function (as far as I know) for turning a column into a comma-separated string of its contents, I made one. Because this function is very short and I’m using it only once, I added it to tableAgg
as a lambda function:
python:
lambda s: ', '.join(s)
This illustrates the overall structure of an aggregation function. It takes a Pandas series (the column) as input and produces a scalar value as output. Because a series acts like a list, it can be passed directly to the join
function, producing the comma-separated string I wanted in the table.
The last few lines print out the table in Markdown format, using the to_markdown
function to generate the header line, the format line, and the pipe characters between the columns. While I didn’t know about to_markdown
until ondaiwai told me about it, it does its work via the tabulate
module, which I have used (badly) in the past. The keyword arguments given to to_markdown
in Lines 30–32 are passed on to tabulate
to control the formatting:
floatfmt='.2%'
sets the formatting of all the floating point numbers, which in this case are thePopPct
andECPct
columns. Note that in the quick-and-dirty table, these columns are printed as decimal values with more digits than necessary.headers=['Electors', 'States', 'Pop Pct', 'EC Pct']
sets the header names to something nicer than the default, which would be the names indf
.colalign=['center', 'center', 'right', 'right']
sets the alignment of the four columns. Without this, the Electors column would be right-aligned (the default for numbers) and the States column would be left-aligned (the default for strings).
Why do I think this code is better than what I showed in the last post? Mainly because this code operates on dataframes and series all at once instead of looping through items as my previous code did. This is how Pandas is meant to be used. Also, the functions being applied to the columns are more obvious because of how the basicAgg
and tableAgg
dictionaries are built. In my previous code, these functions are inside braces in f-strings, where they’re harder to see. Similarly, the formatting of the Markdown table is easier to see when it’s given as arguments to a function instead of buried in print
commands.
My thanks again to ondaiwai for introducing me to to_markdown
and getting me to think in a more Pandas-like way. He rewrote and extended the code snippets he posted on Mastodon and made a gist of it. After seeing his code, I felt pretty good about my script; it’s different from his, but it shows that I understood what he was saying in the tweets.
Update 7 Sep 2024 12:33 PM
Kieran Healy followed up with a post on doing this summary analysis in R, but there’s more to his post than a simple language translation. He does a better job than I did at explaining why we should use higher-level abstractions in our analysis code. While I just kind of waved my hands and said “this is how Pandas is meant to be used,” Kieran explains that this is a general principle, not just a feature of Pandas. He also does a good job of relating the Pandas agg
function to the split-apply-combine strategy, which has been formalized in Pandas, R, and Julia (and, I assume, other data analysis systems).
Using abstractions like this in programming is analogous to how we advance in mathematics. If you’re doing a problem that requires matrix multiplication, you don’t write out every step of row-column multiplication and addition, you just say
This is how you get to think about the purpose of the operation, not the nitty-gritty of how it’s done.
-
While you can, in theory, use the functions themselves as the values (e.g.,
sum
instead of'sum'
), I found that doesn’t necessarily work for all aggregation functions. Thecount
function, for example, seems to work only if it’s given as a string. I’m sure there’s a good reason for this, but I haven’t figured it out yet. ↩