NASA fact sheets—missing and found
November 25, 2025 at 9:50 PM by Dr. Drang
While I was putting together my recent post on Maya calendar calculations, I tried to get some information on the Moon from a NASA website. Specifically, a page with the prosaic name Moon Fact Sheet at the URL
https://nssdc.gsfc.nasa.gov/planetary/factsheet/moonfact.html
The gsfc part of the domain refers to the Goddard Space Flight Center, and the nssdc part refers to the NASA Space Science Data Coordinated Archive (formerly the National Space Science Data Center, which explains the missing a). It’s one of several similar pages with nicely summarized information on planets and other bodies in the solar system.
Did I say it is one of several such pages? I meant it was one of several such pages. They’re all gone and apparently have been for a few months now. Why? Well, my first thought was some sort of DOGE-inspired budget cut that saved the US taxpayer the untold billions of dollars necessary to host maybe as much as a half-megabyte of static HTML.
Or maybe NASA itself was to blame. Maybe it decided to replace the dry but wonderfully useful tables of numbers with this page and similar, which have lots of nice images but very few numbers. Or maybe the various Fact Sheets have been moved where neither Google nor Kagi can find them.
Whatever the answer, I was sad for the loss. But only momentarily. Because the Internet Archive still had the Moon Fact Sheet and all the other Planetary Fact Sheets. Hooray!

I’ve bookmarked the planetary sheet, as it has links to all the others. For what it’s worth, the last archived versions seem to date from August of this year.
As an early Christmas present to myself, and to the rest of the internet, I made a donation.
Web logs with Pandas
November 24, 2025 at 9:31 AM by Dr. Drang
I don’t care much about monitoring the traffic here, but analyzing web logs seemed like a good way to practice using Pandas. I used some old data analysis friends, like groupby, and made some new ones, like the vectorized str functions. I also reacquainted myself with rsync, which I used to run quite often in my Linux days.
Let’s start by laying out the problem. My web server is running Apache 2, and it generates access log files in the combined format. I wanted my script, called top-pages, to download and parse the necessary log files from the server and report on the top pages accessed either yesterday or today. Today, running top-pages with no options returns this:
Sunday, November 23, 2025
Page path Visits
1 2015/01/apple-leverage 651
2 2025/11/some-maya-calendar-calculations 315
3 2025/11/casting-about-again 222
4 2025/11/charting-malpractice 185
5 2025/11/variance-of-a-sum 179
6 2025/11/the-wait-for-new-siri-continues 178
7 2025/11/snow-and-memory 123
8 2025/11/the-problem-with-dollars 114
9 2025/10/other-wavy-paths 109
10 2025/10/vectors-and-weathervanes 105
All pages 10,050
These are the ten individual blog posts with the most hits yesterday and the total number of hits on all individual posts. As you can see, top-pages doesn’t report on hits to the home page, my RSS feeds, image files, or anything other than blog posts.
To see the options top-pages can handle, here’s the output from running top-pages -h:
usage: top-pages [-h] [-t | -y] [-n [N]]
Table of the top pages from ANIAT weblogs.
options:
-h, --help show this help message and exit
-t today's top pages
-y yesterday's top pages (default)
-n [N] number of pages in table (default: 10)
So I can change the length of the table with the -n option, and I can have it show today’s top pages (so far) with the -t option. The -t and -y options are mutually exclusive. If I try to run top-pages -t -y, it will return
usage: top-pages [-h] [-t | -y] [-n [N]]
top-pages: error: argument -y: not allowed with argument -t
Because top-pages runs on my MacBook Pro, it has to download the necessary log files from the server. This is not efficient, but right now I’m more interested in writing top-pages than in running it. If I suddenly get the urge to run it on a regular basis, I’ll move it to the server and set up a cron job there to run it every day and email me the results.
The access log files are in the /var/log/apache2 directory on the server. At present, that directory contains these files:
access.log
access.log.1
access.log.2.gz
access.log.3.gz
[more access logs]
error.log.1
error.log.2.gz
error.log.3.gz
[more error logs]
access.log is the current access log file. It’s being continually updated as the site gets visited. Every morning at 6:25 am (in the server’s time zone), the current access log is archived into access.log.1 and a new version of access.log is started. At the same time, what used to be access.log.1 gets gzip‘d into access.log.2.gz, and the older gzip‘d access logs get their numbers incremented.
The upshot of this is that the only access logs top-pages needs to determine the hits from yesterday and today are access.log, access.log.1, and access.log.2.gz. We’ll see this when we get into the source code.
Which I guess it’s time to do:
python:
1: #!/usr/bin/env python3
2:
3: import pandas as pd
4: from datetime import datetime, timedelta
5: from zoneinfo import ZoneInfo
6: import argparse
7: import subprocess
8: import os.path
9:
10: # Functions for later use.
11: def day_params(whichDay):
12: '''Return the title, start, and end parameters for whichDay.
13:
14: The whichday argument can be either "today" or "yesterday."
15: Any value other than "today" is treated as "yesterday."
16: '''
17:
18: if whichDay == 'today':
19: theDay = datetime.now(tz=ZoneInfo('America/Chicago'))
20: else:
21: theDay = datetime.now(tz=ZoneInfo('America/Chicago')) - timedelta(days=1)
22: dTitle = f'{theDay:%A, %B %-e, %Y}'
23: dStart = theDay.replace(hour=0, minute=0, second=0, microsecond=0)
24: dEnd = theDay.replace(hour=23, minute=59, second=59, microsecond=999_999)
25: return dTitle, dStart, dEnd
26:
27: def date_slug(request, width):
28: '''Return a series of yyyy/mm/slug strings from the request series.
29:
30: The strings are width characters long, truncated or left-padded,
31: as necessary.
32: '''
33:
34: # The page URL is between two space characters.
35: dslug = request.str.replace(r'^[^ ]+ ([^ ]+) .+', r'\1', regex=True)
36: # Strip everything but the yyyy/mm/slug.
37: dslug = dslug.str.replace(r'(https?://(www\.)?leancrew\.com)?/all-this/',
38: r'', regex=True)
39: dslug = dslug.str.replace(r'(^\d\d\d\d/\d\d/[^/]+).*$', r'\1', regex=True)
40: # Shorten if necessary.
41: long = dslug[dslug.str.len()>width]
42: dslug[long.index] = dslug[long.index].str.slice_replace(start=width-3,
43: repl='...')
44: return dslug.str.pad(width)
45:
46: def read_apache_access(f):
47: '''Return a dataframe from an Apache access log file.
48:
49: The log is in combined format and the returned dataframe has columns for
50: Request, Status, and Datetime. The Datetime is determined via the Timestamp
51: and Zone columns, which are not part of the returned dataframe.
52: '''
53:
54: # CSV reading code adapted from https://stackoverflow.com/questions/58584444
55: headers = 'Timestamp Zone Request Status'.split()
56: df = pd.read_csv(f, sep=' ', escapechar='\\', quotechar='"',
57: usecols=[3, 4, 5, 6], names=headers)
58:
59: # Turn the Timestamp and Zone into a single Datetime column.
60: df['Datetime'] = pd.to_datetime(df.Timestamp.str.slice(start=1) +
61: df.Zone.str.slice(stop=-1),
62: format='%d/%b/%Y:%H:%M:%S%z',
63: utc=True)
64:
65: # Don't include the Timestamp and Zone columns in the returned dataframe.
66: return df.drop(labels=['Timestamp','Zone'], axis='columns')
67:
68:
69: ########## Main program ##########
70:
71: # Handle the arguments.
72: desc = 'Table of the top pages from ANIAT weblogs.'
73: parser = argparse.ArgumentParser(description=desc)
74: group = parser.add_mutually_exclusive_group()
75: group.add_argument('-t',
76: help="today's top pages",
77: default=False,
78: action='store_true')
79: group.add_argument('-y',
80: help="yesterday's top pages (default)",
81: default=True,
82: action='store_true')
83: parser.add_argument('-n',
84: help="number of pages in table (default: 10)",
85: nargs='?',
86: default=10,
87: const=10,
88: type=int)
89: args = parser.parse_args()
90:
91: # Set the printing parameters.
92: iwidth = len(str(args.n)) # index
93: pwidth = 40 # page path
94: vwidth = 7 # visits
95: fwidth = iwidth + 2 + pwidth + 1 + vwidth # full table
96: twidth = iwidth + 2 + pwidth # "All pages" position
97:
98: # Set the day's title, start, and end parameters.
99: if args.t:
100: dTitle, dStart, dEnd = day_params('today')
101: else:
102: dTitle, dStart, dEnd = day_params('yesterday')
103:
104: # Set the local log file parameters.
105: logdir = os.path.join(os.environ['HOME'],
106: 'Library/Mobile Documents/com~apple~CloudDocs/personal/weblogs/')
107: af = [os.path.join(logdir, 'access.log.2'),
108: os.path.join(logdir, 'access.log.1'),
109: os.path.join(logdir, 'access.log')]
110:
111: # Update the log files.
112: rsync = f"rsync -az -e 'ssh -p 1234'\
113: --include=access.log.2.gz\
114: --include=access.log.1\
115: --include=access.log\
116: --exclude=*\
117: user@server.com:/var/log/apache2/ '{logdir}'"
118: subprocess.run(rsync, shell=True)
119:
120: # Gunzip the oldest log file if necessary. Keep the gzipped file.
121: gzfile = af[0] + '.gz'
122: if (not os.path.exists(af[0])) or\
123: (os.path.getmtime(af[0]) < os.path.getmtime(gzfile)):
124: subprocess.run(f"gunzip -fk '{gzfile}'", shell=True)
125:
126: # Read the log files into a dataframe.
127: df = pd.concat((read_apache_access(f) for f in af), ignore_index=True)
128:
129: # Limit to one day's successes using my timezone.
130: df = df[(df.Datetime >= dStart) &
131: (df.Datetime <= dEnd) &
132: (df.Request.str.slice(stop=4) == 'GET ') &
133: (df.Status == 200)]
134:
135: # Limit to just the requests for individual pages.
136: df = df[df.Request.str.contains(r'all-this/\d\d\d\d/\d\d', regex=True)]
137:
138: # Get the yyyy/mm/slug from Requests and add it as a new column.
139: dslug = date_slug(df.Request, pwidth)
140: df.insert(0, 'Page path', dslug)
141:
142: # Group by Page path, count the entries, and sort in descending order.
143: # All of the columns have the same count, so I chose to include just the
144: # Status column and renamed it Visits for presentation.
145: # Limit it to just the first args.n rows.
146: top = df.groupby('Page path')\
147: .count()\
148: .sort_values('Status', ascending=False)\
149: ['Status'].reset_index(name='Visits')[:args.n]
150:
151: # Make the index column one-based, not zero-based, and left-padded.
152: top.index = [ f'{i+1:{iwidth}}' for i in range(args.n) ]
153:
154: # Print out the date, the top pages and the count of all pages.
155: print(f'{dTitle:>{fwidth}s}')
156: print(top.to_string(formatters=[lambda p: f'{p:s}',
157: lambda v: f'{v:{vwidth},d}']))
158: print(f'{'All pages':>{twidth}} {len(df):{vwidth},d}')
This is distinctly longer than most of the scripts I post here, so the explanation will be longer, too. Apologies in advance. I’m going to start the description with the main body of the program—we’ll deal with the functions as they get called.
The first section, Lines 72–89, uses the argparse library to define and process the command-line arguments. The -t and -y options are defined as members of a mutually exclusive group so they give us the behavior shown above. The -n option is separate, with a default value of 10. No one who watched as much David Letterman as I did would set the default to anything other than a Top Ten list.
The next few sections set global parameters that the balance of the program uses.
The printing parameters (Lines 92–96) are the widths and positions of the various columns in the table. For reasons I don’t quite understand, Pandas likes to put two spaces after the index column but just one space between the others. That’s why you see the 2 and the 1 in the fwidth and twidth definitions.
The date and time parameters (Lines 99–102) control the title of the table and the datetime limits used to filter the entries that get counted. This section calls the day_params function that’s defined in Lines 11–25. The key feature of day_params is that it bases its definitions of “today” and “yesterday” according to my home timezone here in the Chicago area. This makes the dStart and dEnd values timezone-aware. As we’ll see later, we can compare them with the datetime values from the access logs, which are set to UTC, and get correct results.
The local log file parameters (Lines 105–109) set the directory and full file paths to the access log files after they’ve been downloaded to my MacBook Pro. As you probably know, your iCloud Drive folder is this directory:
~/Library/Mobile Documents/com~apple~CloudDocs
I keep the access log files in a personal/weblogs subdirectory in iCloud Drive.
With the parameters defined, it’s time to start analyzin’. The first thing we need to do is make sure the local copies of the log files are up to date. That’s done through the longish rsync command that’s defined on Lines 112–117 and executed on Line 118. The port, user, and server you see in these lines are, of course, fictional, but the rest of the command is exactly as I use it. Basically, it downloads the access.log, access.log.1, and access.log.2.gz files if they are older on the server than they are on my Mac.
The next section, Lines 121–124, uncompresses the access.log.2.gz if necessary. The decision is made based on the presence of an (uncompressed) access.log.2 file and its modification date. The gunzip command is run if either
- there is no uncompressed file, or
- the uncompressed file is older than the compressed file.
The sharp-eyed among you will note that I could save some downloading here by noting whether top-pages is being asked to return a table for yesterday or today. I decided to ignore this optimization because I wanted to focus on Pandas. And if this script ever gets moved to the server, none of this rsync stuff will be necessary.
With these preliminaries over, it’s time to move into the Pandas stuff. The first step is to read the log files into a dataframe. This is done on Line 127, which uses the read_apache_access function defined on Lines 46–66. To help understand read_apache_access, we should take a look at an example log file entry:
34.96.45.20 - - [23/Nov/2025:06:56:36 -0600] "GET /all-this/2025/11/some-maya-calendar-calculations/ HTTP/1.1" 200 8111 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36"
Sorry about the very long line, but that’s how they look in the access log file.
The fields are separated by space characters. If a space character is part of a field, the field is surrounded by double quotation marks. Except, unfortunately, for the date/time field. For some reason, that field is surrounded by square brackets instead of double quotes, which makes the parsing in read_apache_access a little tricky.
The solution is in the read_csv call in Lines 56–57. By setting the sep, escapechar, and quotechar parameters as shown, we get all the fields the way we want except the date/time, which gets parsed as two fields, one with the timestamp (and a leading bracket) and the other with the timezone offset from UTC (and a trailing bracket). For the example above, the Timestamp field is
[23/Nov/2025:06:56:36
and the Zone field is
-0600]
Lines 60–63 then combine these two fields—after stripping the brackets—with the to_datetime function and add a new Datetime field to the dataframe. The utc=True parameter in the call to to_datetime does exactly what you think: it makes the resulting datetime value timezone-aware and set to UTC. read_apache_access then returns the dataframe with three fields: Request, Status, and Datetime. In the example, Request is
GET /all-this/2025/11/some-maya-calendar-calculations/ HTTP/1.1
Status is 200, and Datetime (in UTC) is 12:56:36 on 2025-11-23.
Line 127 uses a generator expression to loop through the three access log files, creating a dataframe for each and putting them together with concat.
The next section, Lines 130–133, filters the dataframe to include only successful GETs within the specified day. Line 136 further filters the dataframe to just individual posts. It does this by searching for the all-this/yyyy/mm pattern that all the posts at ANIAT have.
The next section, Lines 139–140, adds a new field, Page path, by normalizing the page URLs in the Requests field. It does this through a call to the date_slug function, which is defined on Lines 27–44. date_slug goes through a series of regex replacements to strip out all the inessential parts, leaving just the yyyy/mm/slug part of the URL. It also pads or shortens the string to width characters.
We’re getting near the end now. The next section, Lines 146–149 is a groupby call that counts the entries, sorts them by count, and puts the top ten (or whatever you gave to the -n option) in a new dataframe called top. It also calls reset_index to renumber the rows of top. Without that call, the indices of top would be the row numbers of the original df dataframe.
Because I wasn’t a computer science major, I don’t think a list of the top ten items should be numbered from zero to nine. Line 152 adds 1 to all the index numbers and right-aligns them.
Finally, Lines 155–158 print out the table that we saw back near the top of the post. The columns are formatted using f-string rules.
Although there was a lot of non-Pandas stuff in this script, I think it was a good exercise. The filtering and groupby parts matched up with a lot of the Pandas work I’ve done in the past, but the vectorized regex calls certainly didn’t. Most of my previous analysis has been with dataframes filled with either numbers or boolean values, so this was a nice stretch for me. It wouldn’t be hard to rejigger top-pages to count hits on my RSS feed or list the top referring sites. If I ever decide to care about such things.
Some Maya calendar calculations
November 21, 2025 at 10:15 PM by Dr. Drang
Longtime readers of this blog know I’m a sucker for calendrical calculations. So I couldn’t stop myself from doing some after reading this article by Jennifer Ouellette at Ars Technica.
The article covers a recent paper on the Dresden Codex, one of the few surviving documents from the Maya civilization. The paragraph that caught my eye was this one:
[The paper’s authors] concluded that the codex’s eclipse tables evolved from a more general table of successive lunar months. The length of a 405-month lunar cycle (11,960 days) aligned much better with a 260-day calendar (46 × 260 = 11,960) than with solar or lunar eclipse cycles. This suggests that the Maya daykeepers figured out that 405 new moons almost always came out equivalent to 46 260-day periods, knowledge the Maya used to accurately predict the dates of full and new moons over 405 successive lunar dates.
Many calendrical calculations involve cycles that are close integer approximations of orbital phenomena whose periods are definitely not integers. The Metonic cycle is a good example: there are almost exactly 235 synodic months in 19 years, so the dates of new moons (and all the other phases) this year match the dates of new moons back in 2006. “Match” has to be given some slack here; they match within 12 hours or so.
Anyway, the Maya used a few calendars, one of them being the 260-day divinatory calendar mentioned in the quote above. The integral approximation in this case is that 46 of these 260-day cycles match pretty well with 405 synodic months.
How well? The average synodic month is 29.53059 days long (see the Wikipedia article), so 405 of them add up to 11,959.9 days, which is a pretty good match. It’s off by just a couple of hours over a cycle of nearly 33 years.
I wanted to see how well this works for specific cycles, not just on average. So I used Mathematica’s MoonPhaseDate function to calculate the dates of new moons over about five centuries, from 1502 through 2025. This gave me 6,480 (405 × 16) synodic months, and I could work out the lengths of all the 405-month cycles within that period. I’ll show all the Mathematica code at the end of the post.
First, the shortest synodic month in this period was 29.2719 days and the longest was 29.8326 days, a range of about 13½ hours. The mean was 29.5306 days, which matched the value from Wikipedia and its sources.
I then calculated the lengths of all the 405-month periods in this range: from the first new moon in the list to the 406th, from the second to the 407th, and so on. The results were a minimum of 11,959.5 days, a maximum of 11,960.3 days, and a mean of 11,959.9 days (consistent with the value calculated above).
These are good results but not perfect, and the Maya knew that. They made adjustments to their calendar tables, based on observations, and thereby maintained the accuracy they wanted.
Now that I’ve done this, I feel like going through the paper to look for more fun calculations.
The calculations summarized above were done in two Mathematica notebooks. The first used the MoonPhaseDate command repeatedly to build a list of all the new moons from 1502 through 2025 and save them to a file. Here it is:
Because MoonPhaseDate returns the next new moon after the given date, the For loop builds the list by getting the day after the last new moon, calculating the next new moon after that, and appending it to the end of the list. It’s a command that takes little time to write but a lot of time to execute—over a minute on my MacBook Pro with an M4 Pro chip. That’s why the last command in the notebook saves the newmoons list to a file. The notebook that does all the manipulations of the list could run quickly by just reading that list in from the saved file.
I don’t know any of the details of how MoonPhaseDate works or how accurate it is, but I assume it’s less accurate for dates further away from today. That’s why the period over which I had it calculate the new moons was well after the peak of the Maya civilization.
The second notebook starts by reading in the newmoons.wl file—a plain text file with a single Mathematica command that’s nearly 2 MB in size—and then goes on to calculate the lengths of each synodic month and each 405-month cycle. The statistics come from the Min, Max, and Mean functions.
Charting malpractice
November 20, 2025 at 5:43 PM by Dr. Drang
I expect Fox News to make bad charts as part of its predisposition to promote right wing views. But today I saw an incredibly bad chart that, as far as I can tell, doesn’t promote its usual awful agenda. It’s terrible for politically neutral reasons.
I saw it first in today’s post from Paul Krugman. He got it from this article. With that attribution out of the way, here it is:

As you can see, it’s made with an interactive tool called Flourish, but this is not Flourish’s fault. It’s the fault of the writer of the piece, Dana Blanton, or some dataviz minion at Fox. It’s a poor craftsman who blames his tools, and this chart was made by a very poor craftsman.
The chart uses colors in the purple range, ranging from hex #522B67 (imperial) to hex #E9D7F8 (pale lavender). It’s often a good idea to use colors in the same family but with varying levels of darkness or intensity to represent increasing (or decreasing) levels, but that’s not what Fox did. The categories in each bar cover different levels of price change, but the ordering of the colors doesn’t match the ordering of the price changes:
| Price change | Color |
|---|---|
| Increased a lot | |
| Increased a little | |
| Decreased a little | |
| Decreased a lot | |
| The same |
“The same” should be in the middle, not at the end. It has no business being slightly less intense than “Decreased a lot.” And “Decreased a little” shouldn’t be more intense than “Decreased a lot.” You shouldn’t have to keep looking back at the legend to see what’s what.
I suspect the order of the price changes used in the chart is the same order used in the poll question. Whoever made the chart just fed that same order to Flourish, and Flourish dutifully responded with a chart whose categories are in the order given to it. That’s fine for Flourish; software is supposed to respond robotically. People are not.
With increases and decreases, this is a perfect occasion for a diverging color scheme. I don’t know what color schemes are native to Flourish, but ColorBrewer has a nice five-category diverging scheme that has purples on one side of neutral and greens on the other. It’s also color-blind safe:
| Price change | Color |
|---|---|
| Increased a lot | |
| Increased a little | |
| The same | |
| Decreased a little | |
| Decreased a lot |
The off-white in the center of this scheme would be troublesome in the chart as it is because Fox has little wisps of white separating the colors in each bar. The wisps should be removed, though, because the point of a percentage bar chart like this is for the sum of the lengths of the parts to equal the length of the whole. The wisps mess that up. I would’ve mentioned this problem earlier, but it pales in comparison (ha!) to the outright malpractice of the ordering.
Update 21 Nov 2025 7:55 AM
Nathan Grigg pointed out an oddity I failed to: the five bars are of different lengths, even though you’d think they’re all supposed to represent 100%. This could be due to entering rounded numbers into Flourish, or it be due to a “don’t know” category in the poll that wasn’t included in the chart. I confess I don’t have the stomach to dig through the story to see if I can find the reason.