Two things in ninety-nine minutes
September 10, 2024 at 10:09 PM by Dr. Drang
One of the best things about not being an Apple blogger is that I don’t feel compelled to comment on everything Apple does. I was reminded of this when I read Stephen Hackett’s roundup of yesterday’s event. It was my favorite post of the day, and he came up with a great angle on the presentation, but I really felt his struggle to get it done and off his to-do list.
It turns out, though, that I actually do have something to say about yesterday’s announcements—two of the announcements, anyway. I’m not sure it was really worth 99 minutes1 of my remaining life to watch it in real time, but that’s an issue I’ll consider before the next event.
The first item of importance to me has to do with the two new versions of the iPhone Pro. I’m currently using an iPhone 13 Pro, so I’ve long planned to upgrade this fall. Even though, as lots of people have pointed out, the differences between the regular and Pro versions are perhaps smaller this year than usual, I still want the Pro because I take lots of photos outside in which the subject is pretty far away and needs to be brought in by as much optical zoom as the camera can muster. Piyush Pratik’s’ valiant effort (starting about 57 minutes into the presentaton) to make the non-Pro’s camera seem like it had enough zoom fell flat with me.
Which gets to the concern I had before the event. Last year, the 5X camera came on the Pro Max only, and I was wondering what I’d do if that held true this year, as well. Luckily, the 5X camera came down to the Pro (about an hour and twenty minutes [whew!] into the presentation) and I’m not faced with the prospect of buying a phone that’s bigger and more expensive than I would otherwise need.
The other item of interest was the addition of mapping to paddling workouts (about 18 minutes into the presentation) in the Workout and Activity apps. This was mentioned during WWDC, and I was glad to see that it’ll be in next week’s updates to watchOS and iOS. No waiting until “later this fall/year.”
I complained about the lack of mapping for paddling in Workout/Activity back in May. I mentioned in that post that I was going to try Strava. I did and took an instant dislike to it.2 After that, I used WorkOutDoors and Paddle Logger. They both work well—I’m using Paddle Logger now—but have one serious deficiency: they chew through my watch’s battery like nobody’s business.
For example, a couple of days ago I went for a longish kayak trip on the Hennepin Canal in western Illinois. I was out on the water for about 2½ hours, and my watch’s battery went from 100% (I had charged it fully in the car during the drive to the canal) down to somewhere near 35%. I say “somewhere near” because I forgot to check it when I stopped paddling. It was at 35% about an hour later when all I had done in the meantime was drive.
My hope is that Apple’s history of battery consciousness will hold and that the paddling workout will be as parsimonious as the walking and biking workouts. I will gladly give up the extra data Paddle Logger provides if I don’t have to worry about charging my watch before every kayaking workout.
OK, there was one other part of the presentation that caught my attention: the extension of the AirPods Pro into a kinda sorta hearing aid (starting at about the 33 minute mark). Now, the AirPods Pro already have Conversation Boost, but I guess this new thing goes beyond that. I know I have hearing loss at high frequencies—more than normal for my age—but I’ve never had to boost the volume on my TV or my car stereo to a point that makes people with me uncomfortable. Still, the idea of having music adjusted to make up for my hearing deficiencies is pretty intriguing. We’ll have to wait to learn what “this fall” means for this feature.
Kilometers and the golden ratio
September 3, 2024 at 8:38 AM by Dr. Drang
Regular readers know I enjoy reading John D. Cook’s blog and often comment on it here. But I was a little creeped out by it a couple of days ago. It started off with something I’ve been thinking about a lot over the past several weeks, and it was as if he’d been reading my mind.
The post is about how the conversion factor between miles and kilometers, , is close to the golden ratio, . To convert kilometers to miles, you can make a good estimate by multiplying by , which means that you can convert in the other direction by multiplying by .
You may think multiplying by an irrational number is a pain in the ass, and you’d be right. Cook gets around this by approximating by the ratio of consecutive Fibonacci numbers, so you can, for example, convert from miles to kilometers by multiplying by 21 and dividing by 13. Similarly, you can use consecutive Lucas numbers in the same fashion, multiplying, say, by 29 and dividing by 18.
The problem with these calculations is that I’m not John D. Cook, and I can’t do multiplications and divisions like this in my head without losing digits along the way. So my conversion method is much cruder: to go from miles to kilometers, I multiply by 16; and to go in the opposite direction, I multiply by 6. In both cases, I finish by shifting the decimal point to divide by 10. If I need more precision than this, I pull out my phone, launch PCalc, and use its built-in conversions.
By the way, the main reason I’ve been converting between miles and kilometers lately is that I recently switched the units I use in the Activity/Fitness app1 from miles to kilometers. I now often find myself in the middle of one of my walking routes wondering how long it is in kilometers. I know the lengths of all my routes in miles but haven’t memorized their lengths in kilometers yet. It was while doing conversions like this that I noticed that the conversion factor was close to and started doing a lot of multiplication by 16.
If you’re wondering why I bothered switching units, it’s because I enter 5k races a few times a year and I like my time to be under 45 minutes. To keep myself in shape for this, every couple of weeks I push myself to do my first 5k in that time. It’s much easier to look at my watch and know that my pace should be about 9:00 per kilometer than 14:29 per mile. Also, it’s easier to know when I’m done—marking my time at 3.11 miles is as much a pain in the ass as multiplying by
-
Does it really have to have different names on the watch and the phone? I don’t get confused when I’m using them because the icons are the same, but I never know which name to use when talking about them. ↩
The Electoral College again, this time with aggregation
August 31, 2024 at 11:27 PM by Dr. Drang
After the last post, I had a conversation with ondaiwai on Mastodon about the to_markdown
and agg
functions in Pandas. After seeing his example code here and here, I rewrote my script, and I think it’s much better.
In a nutshell, the agg
function aggregates the data pulled together by groupby
and creates a new dataframe that can be formatted as a MultiMarkdown table via to_markdown
. I had used agg
several years ago but forgot about it. And I’d never seen to_markdown
before. Let’s see how they work to make a nicely formatted Electoral College table.
This time, instead of showing you the script in pieces, I’ll show the whole thing at once, and we’ll consider each section in turn
python:
1: #!/usr/bin/env python
2:
3: import pandas as pd
4:
5: # Import the raw data
6: df = pd.read_csv('states.csv')
7:
8: # Generate new fields for percentages
9: popSum = df.Population.sum()
10: ecSum = df.Electors.sum()
11: df['PopPct'] = df.Population/popSum
12: df['ECPct'] = df.Electors/ecSum
13:
14: # Collect states with same numbers of electors
15: basicAgg = {'State': 'count', 'Population': 'sum', 'PopPct': 'sum',\
16: 'Electors': 'sum', 'ECPct': 'sum'}
17: dfgBasic = df.groupby(by='Electors').agg(basicAgg)
18:
19: # Print out the summary table
20: print(dfgBasic)
21:
22: print()
23: print()
24:
25: # Collect as above but for summary table in blog post
26: tableAgg = {'Abbrev': lambda s: ', '.join(s), 'PopPct': 'sum', 'ECPct': 'sum'}
27: dfgTable = df.groupby(by='Electors').agg(tableAgg)
28:
29: # Print as Markdown table
30: print(dfgTable.to_markdown(floatfmt='.2%',\
31: headers=['Electors', 'States', 'Pop Pct', 'EC Pct'],\
32: colalign=['center', 'center', 'right', 'right'] ))
Lines 1–12 are the same as before; they import the state population and Electoral College information into a dataframe, df
, and then add a couple of columns for percentages of the totals. After Line 12, the first five rows of df
are
State | Abbrev | Population | Electors | PopPct | ECPct |
---|---|---|---|---|---|
Alabama | AL | 5108468 | 9 | 0.015253 | 0.016729 |
Alaska | AK | 733406 | 3 | 0.002190 | 0.005576 |
Arizona | AZ | 7431344 | 11 | 0.022189 | 0.020446 |
Arkansas | AR | 3067732 | 6 | 0.009160 | 0.011152 |
California | CA | 38965193 | 54 | 0.116344 | 0.100372 |
The code in Lines 15–20 groups the data according to the number of Electoral College votes per state and prints out a summary table intended for my use. “Intended for my use” means the output is clear but not the sort of thing I’d want to present to anyone else. “Quick and dirty” would be another way to describe the output, which is this:
State Population PopPct Electors ECPct
Electors
3 7 5379033 0.016061 21 0.039033
4 7 10196485 0.030445 28 0.052045
5 2 4092750 0.012220 10 0.018587
6 6 18766882 0.056035 36 0.066914
7 2 7671000 0.022904 14 0.026022
8 3 13333261 0.039811 24 0.044610
9 2 10482023 0.031298 18 0.033457
10 5 29902889 0.089285 50 0.092937
11 4 28421431 0.084862 44 0.081784
12 1 7812880 0.023328 12 0.022305
13 1 8715698 0.026024 13 0.024164
14 1 9290841 0.027741 14 0.026022
15 1 10037261 0.029970 15 0.027881
16 2 21864718 0.065284 32 0.059480
17 1 11785935 0.035191 17 0.031599
19 2 25511372 0.076173 38 0.070632
28 1 19571216 0.058436 28 0.052045
30 1 22610726 0.067512 30 0.055762
40 1 30503301 0.091078 40 0.074349
54 1 38965193 0.116344 54 0.100372
The agg
function has many options for passing arguments, one of which is a dictionary in which the column names are the keys and the aggregation functions to be applied to those columns are the values. We create that dictionary, basicAgg
, in Lines 15–16, where the State
column is counted and the other columns are summed. Notice that the values of basicAgg
are strings with the names of the functions to be applied.1
Line 17 then groups the dataframe by the Electors
column and runs the aggregation functions specified in basicAgg
on the appropriate columns. The output is a new dataframe, dfgBasic
, which is then printed out in Line 20. Nothing fancy in the print
function because this is meant to be quick and dirty.
After a couple of print()
s on Lines 22–23 to give us some whitespace, the rest of the code creates a Markdown table that looks like this:
| Electors | States | Pop Pct | EC Pct |
|:----------:|:--------------------------:|----------:|---------:|
| 3 | AK, DE, DC, ND, SD, VT, WY | 1.61% | 3.90% |
| 4 | HI, ID, ME, MT, NH, RI, WV | 3.04% | 5.20% |
| 5 | NE, NM | 1.22% | 1.86% |
| 6 | AR, IA, KS, MS, NV, UT | 5.60% | 6.69% |
| 7 | CT, OK | 2.29% | 2.60% |
| 8 | KY, LA, OR | 3.98% | 4.46% |
| 9 | AL, SC | 3.13% | 3.35% |
| 10 | CO, MD, MN, MO, WI | 8.93% | 9.29% |
| 11 | AZ, IN, MA, TN | 8.49% | 8.18% |
| 12 | WA | 2.33% | 2.23% |
| 13 | VA | 2.60% | 2.42% |
| 14 | NJ | 2.77% | 2.60% |
| 15 | MI | 3.00% | 2.79% |
| 16 | GA, NC | 6.53% | 5.95% |
| 17 | OH | 3.52% | 3.16% |
| 19 | IL, PA | 7.62% | 7.06% |
| 28 | NY | 5.84% | 5.20% |
| 30 | FL | 6.75% | 5.58% |
| 40 | TX | 9.11% | 7.43% |
| 54 | CA | 11.63% | 10.04% |
which presents just the data I want to show and formats it nicely for pasting into the blog post.
Line 26 creates a new aggregation dictionary, tableAgg
. It’s similar to basicAgg
except for the function applied to the Abbrev
column. Because there’s no built-in function (as far as I know) for turning a column into a comma-separated string of its contents, I made one. Because this function is very short and I’m using it only once, I added it to tableAgg
as a lambda function:
python:
lambda s: ', '.join(s)
This illustrates the overall structure of an aggregation function. It takes a Pandas series (the column) as input and produces a scalar value as output. Because a series acts like a list, it can be passed directly to the join
function, producing the comma-separated string I wanted in the table.
The last few lines print out the table in Markdown format, using the to_markdown
function to generate the header line, the format line, and the pipe characters between the columns. While I didn’t know about to_markdown
until ondaiwai told me about it, it does its work via the tabulate
module, which I have used (badly) in the past. The keyword arguments given to to_markdown
in Lines 30–32 are passed on to tabulate
to control the formatting:
floatfmt='.2%'
sets the formatting of all the floating point numbers, which in this case are thePopPct
andECPct
columns. Note that in the quick-and-dirty table, these columns are printed as decimal values with more digits than necessary.headers=['Electors', 'States', 'Pop Pct', 'EC Pct']
sets the header names to something nicer than the default, which would be the names indf
.colalign=['center', 'center', 'right', 'right']
sets the alignment of the four columns. Without this, the Electors column would be right-aligned (the default for numbers) and the States column would be left-aligned (the default for strings).
Why do I think this code is better than what I showed in the last post? Mainly because this code operates on dataframes and series all at once instead of looping through items as my previous code did. This is how Pandas is meant to be used. Also, the functions being applied to the columns are more obvious because of how the basicAgg
and tableAgg
dictionaries are built. In my previous code, these functions are inside braces in f-strings, where they’re harder to see. Similarly, the formatting of the Markdown table is easier to see when it’s given as arguments to a function instead of buried in print
commands.
My thanks again to ondaiwai for introducing me to to_markdown
and getting me to think in a more Pandas-like way. He rewrote and extended the code snippets he posted on Mastodon and made a gist of it. After seeing his code, I felt pretty good about my script; it’s different from his, but it shows that I understood what he was saying in the tweets.
Update 7 Sep 2024 12:33 PM
Kieran Healy followed up with a post on doing this summary analysis in R, but there’s more to his post than a simple language translation. He does a better job than I did at explaining why we should use higher-level abstractions in our analysis code. While I just kind of waved my hands and said “this is how Pandas is meant to be used,” Kieran explains that this is a general principle, not just a feature of Pandas. He also does a good job of relating the Pandas agg
function to the split-apply-combine strategy, which has been formalized in Pandas, R, and Julia (and, I assume, other data analysis systems).
Using abstractions like this in programming is analogous to how we advance in mathematics. If you’re doing a problem that requires matrix multiplication, you don’t write out every step of row-column multiplication and addition, you just say
This is how you get to think about the purpose of the operation, not the nitty-gritty of how it’s done.
-
While you can, in theory, use the functions themselves as the values (e.g.,
sum
instead of'sum'
), I found that doesn’t necessarily work for all aggregation functions. Thecount
function, for example, seems to work only if it’s given as a string. I’m sure there’s a good reason for this, but I haven’t figured it out yet. ↩
Pandas and the Electoral College
August 29, 2024 at 12:15 PM by Dr. Drang
A couple of weeks ago, I used the Pandas groupby
function in some data analysis for work, so when I started writing my previous post on the Electoral College, groupby
came immediately to mind when I realized I wanted to add this table to the post:
Electors | States | Pop Pct | EC Pct |
---|---|---|---|
3 | AK, DE, DC, ND, SD, VT, WY | 1.61% | 3.90% |
4 | HI, ID, ME, MT, NH, RI, WV | 3.04% | 5.20% |
5 | NE, NM | 1.22% | 1.86% |
6 | AR, IA, KS, MS, NV, UT | 5.60% | 6.69% |
7 | CT, OK | 2.29% | 2.60% |
8 | KY, LA, OR | 3.98% | 4.46% |
9 | AL, SC | 3.13% | 3.35% |
10 | CO, MD, MN, MO, WI | 8.93% | 9.29% |
11 | AZ, IN, MA, TN | 8.49% | 8.18% |
12 | WA | 2.33% | 2.23% |
13 | VA | 2.60% | 2.42% |
14 | NJ | 2.77% | 2.60% |
15 | MI | 3.00% | 2.79% |
16 | GA, NC | 6.53% | 5.95% |
17 | OH | 3.52% | 3.16% |
19 | IL, PA | 7.62% | 7.06% |
28 | NY | 5.84% | 5.20% |
30 | FL | 6.75% | 5.58% |
40 | TX | 9.11% | 7.43% |
54 | CA | 11.63% | 10.04% |
I got the population and elector information from the Census Bureau and the National Archives, respectively, and put them in a CSV file named states.csv
(which you can download). The header and first ten rows are
State | Abbrev | Population | Electors |
---|---|---|---|
Alabama | AL | 5108468 | 9 |
Alaska | AK | 733406 | 3 |
Arizona | AZ | 7431344 | 11 |
Arkansas | AR | 3067732 | 6 |
California | CA | 38965193 | 54 |
Colorado | CO | 5877610 | 10 |
Connecticut | CT | 3617176 | 7 |
Delaware | DE | 1031890 | 3 |
District of Columbia | DC | 678972 | 3 |
Florida | FL | 22610726 | 30 |
(Because the District of Columbia has a spot in the Electoral College, I’m lumping it in with the 50 states and referring to all of them as “states” in the remainder of this post. That makes it easier for both you and me.)
Here’s the initial code I used to summarize the data:
python:
1: #!/usr/bin/env python
2:
3: import pandas as pd
4:
5: # Import the raw data
6: df = pd.read_csv('states.csv')
7:
8: # Generate new fields for percentages
9: popSum = df.Population.sum()
10: ecSum = df.Electors.sum()
11: df['PopPct'] = df.Population/popSum
12: df['ECPct'] = df.Electors/ecSum
13:
14: # Collect states with same numbers of electors
15: dfg = df.groupby(by='Electors')
16:
17: # Print out the summary table
18: print('State EC States Population Pop Pct Electors EC Pct')
19: for k, v in dfg:
20: print(f' {k:2d} {v.State.count():2d} {v.Population.sum():11,d}\
21: {v.PopPct.sum():6.2%} {v.Electors.sum():3d} {v.ECPct.sum():6.2%}')
The variable df
contains the dataframe of all the states. It’s read in from the CSV in Line 6, and then extended in Lines 9–12. After Line 12, the first five rows of df
are
State | Abbrev | Population | Electors | PopPct | ECPct |
---|---|---|---|---|---|
Alabama | AL | 5108468 | 9 | 0.015253 | 0.016729 |
Alaska | AK | 733406 | 3 | 0.002190 | 0.005576 |
Arizona | AZ | 7431344 | 11 | 0.022189 | 0.020446 |
Arkansas | AR | 3067732 | 6 | 0.009160 | 0.011152 |
California | CA | 38965193 | 54 | 0.116344 | 0.100372 |
The summarizing is done first by using the groupby
function in Line 15, which groups the states according to the number of electors they have. The resulting variable, dfg
, is of type
pandas.core.groupby.generic.DataFrameGroupBy
which works like a dictionary, where the keys are the numbers of electors per state and the values are dataframes of the subset of states with the number of electors given by that key. Lines 19–21 loop through the dictionary and print out summary information for each subset of states. The output is this:
State EC States Population Pop Pct Electors EC Pct
3 7 5,379,033 1.61% 21 3.90%
4 7 10,196,485 3.04% 28 5.20%
5 2 4,092,750 1.22% 10 1.86%
6 6 18,766,882 5.60% 36 6.69%
7 2 7,671,000 2.29% 14 2.60%
8 3 13,333,261 3.98% 24 4.46%
9 2 10,482,023 3.13% 18 3.35%
10 5 29,902,889 8.93% 50 9.29%
11 4 28,421,431 8.49% 44 8.18%
12 1 7,812,880 2.33% 12 2.23%
13 1 8,715,698 2.60% 13 2.42%
14 1 9,290,841 2.77% 14 2.60%
15 1 10,037,261 3.00% 15 2.79%
16 2 21,864,718 6.53% 32 5.95%
17 1 11,785,935 3.52% 17 3.16%
19 2 25,511,372 7.62% 38 7.06%
28 1 19,571,216 5.84% 28 5.20%
30 1 22,610,726 6.75% 30 5.58%
40 1 30,503,301 9.11% 40 7.43%
54 1 38,965,193 11.63% 54 10.04%
This is what I was looking at when I wrote the last few paragraphs of the post, but I didn’t want to present the data to you in this way—I wanted a nice table with only the necessary columns. I added a couple of print()
statements to the code above to add a little whitespace and then this code to create a MultiMarkdown table.
python:
25: # Print out the Markdown summary table
26: print('| Electors | States | Pop Pct | EC Pct |')
27: print('|:--:|:--:|--:|--:|')
28: for k, v in dfg:
29: print(f'| {k:2d} | {", ".join(v.Abbrev)} | {v.PopPct.sum():6.2%} \
30: | {v.ECPct.sum():6.2%} |')
This is very much like the previous output code. Apart from adding the pipe characters and the formatting line, there’s the
python:
{", ".join(v.Abbrev)}
piece in the f-string of Line 29. The join
part concatenates all the state abbreviations, putting a comma-space between them. I decided a list of states with the given number of electors would be better than just a count of how many such states there are.
The output from this additional code was
| Electors | States | Pop Pct | EC Pct |
|:--:|:--:|--:|--:|
| 3 | AK, DE, DC, ND, SD, VT, WY | 1.61% | 3.90% |
| 4 | HI, ID, ME, MT, NH, RI, WV | 3.04% | 5.20% |
| 5 | NE, NM | 1.22% | 1.86% |
| 6 | AR, IA, KS, MS, NV, UT | 5.60% | 6.69% |
| 7 | CT, OK | 2.29% | 2.60% |
| 8 | KY, LA, OR | 3.98% | 4.46% |
| 9 | AL, SC | 3.13% | 3.35% |
| 10 | CO, MD, MN, MO, WI | 8.93% | 9.29% |
| 11 | AZ, IN, MA, TN | 8.49% | 8.18% |
| 12 | WA | 2.33% | 2.23% |
| 13 | VA | 2.60% | 2.42% |
| 14 | NJ | 2.77% | 2.60% |
| 15 | MI | 3.00% | 2.79% |
| 16 | GA, NC | 6.53% | 5.95% |
| 17 | OH | 3.52% | 3.16% |
| 19 | IL, PA | 7.62% | 7.06% |
| 28 | NY | 5.84% | 5.20% |
| 30 | FL | 6.75% | 5.58% |
| 40 | TX | 9.11% | 7.43% |
| 54 | CA | 11.63% | 10.04% |
which I then ran through my in BBEdit to produce the nicer looking filter
| Electors | States | Pop Pct | EC Pct |
|:--------:|:--------------------------:|--------:|-------:|
| 3 | AK, DE, DC, ND, SD, VT, WY | 1.61% | 3.90% |
| 4 | HI, ID, ME, MT, NH, RI, WV | 3.04% | 5.20% |
| 5 | NE, NM | 1.22% | 1.86% |
| 6 | AR, IA, KS, MS, NV, UT | 5.60% | 6.69% |
| 7 | CT, OK | 2.29% | 2.60% |
| 8 | KY, LA, OR | 3.98% | 4.46% |
| 9 | AL, SC | 3.13% | 3.35% |
| 10 | CO, MD, MN, MO, WI | 8.93% | 9.29% |
| 11 | AZ, IN, MA, TN | 8.49% | 8.18% |
| 12 | WA | 2.33% | 2.23% |
| 13 | VA | 2.60% | 2.42% |
| 14 | NJ | 2.77% | 2.60% |
| 15 | MI | 3.00% | 2.79% |
| 16 | GA, NC | 6.53% | 5.95% |
| 17 | OH | 3.52% | 3.16% |
| 19 | IL, PA | 7.62% | 7.06% |
| 28 | NY | 5.84% | 5.20% |
| 30 | FL | 6.75% | 5.58% |
| 40 | TX | 9.11% | 7.43% |
| 54 | CA | 11.63% | 10.04% |
This is the Markdown I added to the post.
Update 1 Sep 2024 7:11 AM
There’s a better version of the script here.