MathML with Pandoc
May 3, 2025 at 11:25 AM by Dr. Drang
Since switching from MathJax to MathML to render equations here on ANIAT, I’ve tried several approaches to generate the MathML. There are many utilities and libraries that claim to do the conversion, but I’ve found all of them to be limited in one way or another. For a while, I was even writing MathML directly, albeit with the help of some Typinator abbreviations, because I couldn’t trust the converters to generate the correct characters or even understand some LaTeX commands I use regularly. Recently, I began using what I should have started out with: Pandoc.
It’s not that I wasn’t aware of Pandoc. Its famous in the Markdown/HTML/LaTeX world, and I probably first heard of it shortly after its release. But I’ve always thought of it as a document converter, not an equation converter. I was wrong. It’s very easy to use with a single equation.
pandoc --mathml <<<'$$T = \frac{1}{2} v_x^2$$'
produces
<p>
<math display="block" xmlns="http://www.w3.org/1998/Math/MathML">
<semantics>
<mrow>
<mi>T</mi>
<mo>=</mo>
<mfrac><mn>1</mn><mn>2</mn></mfrac>
<msubsup><mi>v</mi><mi>x</mi><mn>2</mn></msubsup>
</mrow>
<annotation encoding="application/x-tex">
T = \frac{1}{2} v_x^2
</annotation>
</semantics>
</math>
</p>
where I’ve added linebreaks and indentation to the output to make it easier to read. Because it’s delimited by double dollar signs, the equation is rendered in block mode, like this:
Single dollar signs would generate MathML with a display="inline"
attribute.
(If you look at the source code for this page, you’ll see that I usually delete some of the code Pandoc generates—we’ll get to that later.)
All the converters handled simple equations, like , well, but more complicated stuff can be troublesome. One of the problems other converters have is dealing with multiline equations, something Pandoc handles with ease. For example, this piecewise function definition,
$$ f(x) = \left\{ \begin{array} {rcl}
-1 & : & x<0 \\
0 & : & x=0 \\
+1 & : & x>0
\end{array} $$
is rendered exactly as expected:
Well, perhaps not exactly as expected. If you’re reading this in Chrome (and, presumably, other Chrome-based browsers), all the cells in the array are aligned left, which puts the zero in the wrong spot, not vertically aligned with the ones. But that’s Chrome’s fault, not Pandoc’s.
Since Pandoc understands the \begin{array}
command, it can do matrices, too:
So far, I’ve found only one small bug in Pandoc’s conversion from LaTeX to MathML. Here’s a simple formula that includes both a summation and a limit:
$$ e^x
= \sum_{n=0}^\infty \frac{x^n}{n!}
= \lim_{n\to\infty} \left( 1+ \frac{x}{n} \right)^n $$
This is what it should look like, a screenshot of the equation as rendered by LaTeX itself:
But here’s how it comes out after passing the equation to Pandoc:
The summation is fine, but the limit is formatted incorrectly. The part should be under the , not off to the side like a subscript. That subscript-like formatting is what you’d use for an inline equation, not a block equation.
Let’s see what happened. Here’s the MathML produced by Pandoc:
1 <math display="block" xmlns="http://www.w3.org/1998/Math/MathML">
2 <semantics>
3 <mrow>
4 <msup><mi>e</mi><mi>x</mi></msup>
5 <mo>=</mo>
6 <munderover>
7 <mo>∑</mo>
8 <mrow><mi>n</mi><mo>=</mo><mn>0</mn></mrow>
9 <mo accent="false">∞</mo>
10 </munderover>
11 <mfrac>
12 <msup><mi>x</mi><mi>n</mi></msup>
13 <mrow><mi>n</mi><mi>!</mi></mrow>
14 </mfrac>
15 <mo>=</mo>
16 <munder>
17 <mi>lim</mi><mo></mo>
18 <mrow><mi>n</mi><mo>→</mo><mi>∞</mi></mrow>
19 </munder>
20 <msup>
21 <mrow>
22 <mo form="prefix" stretchy="true">(</mo>
23 <mn>1</mn><mo>+</mo>
24 <mfrac><mi>x</mi><mi>n</mi></mfrac>
25 <mo form="postfix" stretchy="true">)</mo>
26 </mrow>
27 <mi>n</mi>
28 </msup>
29 </mrow>
30 <annotation encoding="application/x-tex">
31 e^x
32 = \sum_{n=0}^\infty \frac{x^n}{n!}
33 = \lim_{n\to\infty} \left( 1+ \frac{x}{n} \right)^n
34 </annotation>
35 </semantics>
36 </math>
The problem with the rendering of the limit is in Line 17. There’s an empty <mo></mo>
element after the <mi>lim</mi>
element. That’s what’s messing up the formatting. If we remove that empty element, the limit gets formatted the way it should:
Obviously, I’m not going to try to fix Pandoc; I have no idea how to program in Haskell. I’ll send a note to John McFarlane (can he really still be the sole developer?) about the rendering bug, but in the meantime I’ll just remember to delete the empty <mo></mo>
whenever I need a limit.
I think I’ve mentioned in the past that one of my favorite features of Markdown is that it allows you to mix HTML with regular Markdown text; it passes the HTML through unchanged. I’m using that here to add MathML equations to my blog posts. I write the equation in LaTeX, select it, and run a Keyboard Maestro that replaces the LaTeX with its MathML equivalent. Because I’m still messing around with the macro (and may change it to an automation that BBEdit runs directly) I won’t post it here, but I do want to include the Python script that runs Pandoc to do the conversion and then cleans up Pandoc’s output to make it more compact.
Here’ the script:
1 #!/usr/bin/env python3
2
3 import sys
4 import subprocess
5 from bs4 import BeautifulSoup
6
7 # Get LaTeX from stdin, run it through Pandoc, and parse the HTML
8 latex = sys.stdin.read()
9 process = subprocess.run(['pandoc', '--mathml'], input=latex, text=True, capture_output=True)
10 html = process.stdout
11 soup = BeautifulSoup(html, 'lxml')
12
13 # Extract the MathML
14 math = soup.math
15
16 # Delete the annotation
17 math.annotation.decompose()
18
19 # Delete the unnecessary <semantics> wrapper
20 math.semantics.unwrap()
21
22 # Delete the unnecessary top-level <mrow> wrapper in block display
23 if math['display'] == 'block':
24 math.mrow.unwrap()
25
26 # Delete the unnecessary attribute for inline display
27 if math['display'] == 'inline':
28 del math['display']
29
30 # Print the cleaned-up MathML
31 print(math)
Lines 8–10 get the LaTeX equation from standard input, pass it through Pandoc via the subprocess.run
function, and save the standard output to the html
variable. Line 11 then parses html
with Beautiful Soup, putting it in a form that makes it very easy to change.
Because we don’t need the <p></p>
tags the MathML is wrapped in, we pull out just the <math></math>
part in Line 14. The rest of the code removes elements and attributes that can be useful, but which don’t add to the rendering of the equations. You may disagree with my removal of these pieces, but it’s my blog.
First, I don’t want to keep the original LaTeX code, so Line 17 deletes the <annotation></annotation>
tag and everything inside it. With that gone, <semantics>
and </semantics>
are no longer necessary, so I got rid of them, too. Unlike the decompose
function, which removes tags and their contents, unwrap
removes just the tags, leaving behind what’s between them.
I’ve noticed there’s always an extra <mrow></mrow>
wrapper around block equations, so Lines 23–24 get rid of that. And because display="inline"
is the default, Lines 27–28 deletes that attribute from the <math>
tag. When the MathML code is in the middle of a paragraph, it helps readability to make it as small as possible.
One thing I’m not entirely happy with is the need to select the equation before doing the conversion to MathML. I think I can use BBEdit’s AppleScript library to control the cursor and make the selection before sending the text to the Python script. But I haven’t been using this system long enough to know where my cursor is likely to be when I want to convert the equation. The obvious answer is to assume it’ll be after the final dollar sign, but I’ve been at this long enough to mistrust the obvious.
Pandas and HTML tables
May 1, 2025 at 10:43 AM by Dr. Drang
The Talk Python To Me podcast is one that underscores the value of Castro’s two-stage management system. Unlike, say, 99% Invisible and Upgrade, Talk Python is a show I don’t listen to every week because many of its topics are just too far afield from my interests. So its episodes go into my Inbox for vetting instead of directly into my Queue. The latest episode, about PyArrow and its movement into the internals of Pandas, is one that I immediately promoted to the Queue.
But this post isn’t about Castro or PyArrow (inverted pyramid? what’s that?). It’s a feature of Pandas that was mentioned offhandedly by Michael Kennedy—Pandas can take the URL of a web page and turn the tables on that page into data frames. I didn’t know about this and wanted to try it out immediately.
After writing these two posts on bolts, I wondered about how the lead angle of bolt threads changes with bolt size. I didn’t do anything about it at the time, but now I figured it would make a good test case, as there are plenty of web pages out there with tables of bolt dimensions. I chose this one from bolt supplier Fastenere. There are two tables on the page, one for US dimensions and the other for metric. Here’s a screenshot of the US table:
I wanted to pull this table into a data frame, create two new data frames, one for coarse threads and the other for fine threads, and then add a column for lead angle to each data frame.
The lead angle is a bit of geometry that comes from unrolling the screw thread at the pitch diameter, :
The pitch diameter isn’t given in the table, but it’s easily calculated from
where is the major diameter in the table and is the pitch or lead,
Here are the results for coarse threads,
Size | Diameter | TPI | Angle |
---|---|---|---|
#2 | 0.0860 | 56 | 4.37 |
#4 | 0.1120 | 40 | 4.75 |
#5 | 0.1250 | 40 | 4.18 |
#6 | 0.1380 | 32 | 4.83 |
#8 | 0.1640 | 32 | 3.96 |
#10 | 0.1900 | 24 | 4.65 |
1/4″ | 0.2500 | 20 | 4.18 |
5/16″ | 0.3125 | 18 | 3.66 |
3/8″ | 0.3750 | 16 | 3.40 |
7/16″ | 0.4375 | 14 | 3.33 |
1/2″ | 0.5000 | 13 | 3.11 |
9/16″ | 0.5625 | 12 | 2.99 |
5/8″ | 0.6250 | 11 | 2.93 |
3/4″ | 0.7500 | 10 | 2.66 |
7/8″ | 0.8750 | 9 | 2.52 |
1″ | 1.0000 | 8 | 2.48 |
1-1/8″ | 1.1250 | 7 | 2.52 |
1-1/4″ | 1.2500 | 7 | 2.25 |
1-3/8″ | 1.3750 | 6 | 2.40 |
1-1/2″ | 1.5000 | 6 | 2.18 |
1-3/4″ | 1.7500 | 5 | 2.25 |
2″ | 2.0000 | 4.5 | 2.18 |
The diameter is given in inches and the angle is in degrees. Similarly, here are the results for fine threads:
Size | Diameter | TPI | Angle |
---|---|---|---|
#0 | 0.0600 | 80 | 4.39 |
#2 | 0.0860 | 64 | 3.75 |
#4 | 0.1120 | 48 | 3.85 |
#5 | 0.1250 | 44 | 3.75 |
#6 | 0.1380 | 40 | 3.74 |
#8 | 0.1640 | 36 | 3.47 |
#10 | 0.1900 | 32 | 3.35 |
1/4″ | 0.2500 | 28 | 2.87 |
5/16″ | 0.3125 | 24 | 2.66 |
3/8″ | 0.3750 | 24 | 2.18 |
7/16″ | 0.4375 | 20 | 2.25 |
1/2″ | 0.5000 | 20 | 1.95 |
9/16″ | 0.5625 | 18 | 1.92 |
5/8″ | 0.6250 | 18 | 1.72 |
3/4″ | 0.7500 | 16 | 1.61 |
7/8″ | 0.8750 | 14 | 1.57 |
1″ | 1.0000 | 12 | 1.61 |
1-1/8″ | 1.1250 | 12 | 1.42 |
1-1/4″ | 1.2500 | 12 | 1.27 |
1-3/8″ | 1.3750 | 12 | 1.15 |
1-1/2″ | 1.5000 | 12 | 1.05 |
The lead angles are all pretty small, even for tiny screws that don’t have much room for threads. And we haven’t even considered extra fine threads.
Here’s the code that produced the tables:
1 #!/usr/bin/env python3
2
3 import pandas as pd
4 import numpy as np
5 import sys
6
7 # Get the first table from https://www.fastenere.com/blog/bolt-size-chart
8 # Don't include the two header rows
9 dfBoth = pd.read_html('https://www.fastenere.com/blog/bolt-size-chart', skiprows=2, na_values='---')[0]
10
11 # We don't need any other columns
12 colnames = 'Size Diameter TPI'.split()
13
14 # Make a table for the coarse threads
15 dfCoarse = dfBoth.drop(columns=[3, 4, 5, 6, 7]).drop(index=0)
16 dfCoarse.columns = colnames
17 dfCoarse['Angle'] = np.round(np.rad2deg(np.atan2(1/dfCoarse.TPI, np.pi*(dfCoarse.Diameter- 0.64951905/dfCoarse.TPI))), decimals=2)
18 print("US Coarse Threads")
19 print(dfCoarse)
20 print()
21
22 # Make a table for the fine threads
23 dfFine = dfBoth.drop(columns=[2, 3, 4, 6, 7]).drop(index=[21, 22])
24 dfFine.columns = colnames
25 dfFine['Angle'] = np.round(np.rad2deg(np.atan2(1/dfFine.TPI, np.pi*(dfFine.Diameter- 0.64951905/dfFine.TPI))), decimals=2)
26 print("US Fine Threads")
27 print(dfFine)
28 print()
Line 8 is what was new to me. The read_html
function reads all the tables on the referenced page (you can also provide a local HTML file or an HTML string wrapped in io.StringIO
) and returns a list of data frames. Since the US bolt is the first one on the page, the list is indexed by [0]
. The first two rows in the HTML table are headers, so I used skiprows=2
to keep them out of the data frame; I add my own column names later via Lines 12, 16, and 24. The na_values='---'
parameter handles the missing values, which are indicated in the HTML table by three hyphens.
The rest of the code is pretty straightforward. I make the coarse data frame by dropping the columns associated with fine threads, and vice versa. I also drop the columns for area because they don’t matter for what I’m doing. Rows with missing values are dropped, too. The calculation of the lead angle (Lines 17 and 25) is kind of long, but that’s mainly because I wanted the results in degrees instead of radians and rounded to the nearest hundredth of a degree.
The output tables are in Pandas’ native format, which looks like this:
Size Diameter TPI Angle
1 #2 0.0860 56.0 4.37
2 #4 0.1120 40.0 4.75
3 #5 0.1250 40.0 4.18
4 #6 0.1380 32.0 4.83
5 #8 0.1640 32.0 3.96
6 #10 0.1900 24.0 4.65
7 1/4" 0.2500 20.0 4.18
[etc]
I did some simple rectangular editing in BBEdit to turn this into a Markdown table for posting here:
| Size | Diameter | TPI | Angle |
|-------:|---------:|-----:|------:|
| #2 | 0.0860 | 56 | 4.37 |
| #4 | 0.1120 | 40 | 4.75 |
| #5 | 0.1250 | 40 | 4.18 |
| #6 | 0.1380 | 32 | 4.83 |
| #8 | 0.1640 | 32 | 3.96 |
| #10 | 0.1900 | 24 | 4.65 |
| 1/4″ | 0.2500 | 20 | 4.18 |
[etc]
Pandas has a to_markdown
function, which is sometimes the best way to go, but in this case that didn’t give the same number of decimal places for all the items in the Diameter column, which ruined the alignment. It was faster to add the pipe characters to the output than change the code to make to_markdown
print the way I wanted it to.
Between read_html
and Tabula, which extracts tabular data from PDFs,
-
For double-threaded (and other multiply-threaded) bolts, the pitch and lead are not the same, but we’re considering only single-threaded bolts here. ↩
-
There’s also
tabula-py
, a Python wrapper around Tabula. This is a more direct way of getting a PDF table into a data frame, but I’ve always used Tabula itself to make a CSV file, which I then read into a data frame. It’s slightly longer, but it’s always felt safer because it lets me see what I’m doing as I do it. ↩
Constructive solid geometry
April 27, 2025 at 9:53 AM by Dr. Drang
In my recent post on thin-walled pressure vessels, I used this image to help explain the calculation of hoop stress.
I made it through a combination of Mathematica and OmniGraffle. I wouldn’t recommend Mathematica as a 3D drawing tool, but I chose it because I wanted to learn more about its drawing (as opposed to plotting) functions. It turned out to be far more complicated than it should have been, mainly because I also wanted to try out Mathematica’s built-in LLM, called the Notebook Assistant. Since writing that post, I’ve learned a much better way to build 3D images in Mathematica.
I’m trying out Notebook Assistant for a month to see if I can learn from it. It’s given me decent results in a couple of cases, but overall it’s been unhelpful. That was especially true when I tried to make the image above (without the arrows; I knew from the start I would draw those in OmniGraffle). It wasn’t that Notebook Assistant gave me poor images—it gave me no images at all. None of the code NA suggested were legal Mathematica statements. Every one of them led to error messages. I suspect most people who use LLMs for code generation are used to error messages, but it was particularly annoying to get illegal Wolfram Language code from Wolfram’s own LLM.
With Notebook Assistant a bust, I wondered if other LLMs would be better. ChatGPT gave me Mathematica code that ran immediately and after several iterations, I got this image:
This was not what I’d hoped for. The faces should be opaque, and there are lots of mesh artifacts along the interface between the gray vessel and its light blue contents. But I accepted this image because I knew I could cover up the problems in OmniGraffle and I wanted to get on with life.
But after the post was written, I kept thinking there had to be a way of drawing this image that didn’t involve discretizing the various parts and leaving mesh artifacts on their surfaces.
CSGRegion
command. With only a few lines of code, I was able to create this much-improved image with no meshing and no weird transparency:
Here’s the code that did it. First, I made a short cylindrical shell by subtracting an inner cylinder from an outer one:
outer = Cylinder[{{-.375, 0, 0},{.375, 0, 0}}, 1];
inner = Cylinder[{{-.375, 0, 0},{.375, 0, 0}}, .9];
pipe = CSGRegion["Difference",
{Style[outer,GrayLevel[.5]],
Style[inner,GrayLevel[.5]]}]
The cylinderical shell is 0.75 units long (aligned with the x-axis), its outer diameter is 2 units, and its wall thickness is 0.1 unit.
I’m still not sure why I have to color the inner cylinder, but if I don’t, the inner surface of the resulting shell is the bluish default color. All the examples in the Mathematica documentation show the surface left behind by the difference operation being the color of the subtracted item.
Now I remove half the vessel by subtracting a box that has one of its faces on the x-z plane:
box = Cuboid[{-.5, -1.2, -1.2},{.5, 0, 1.2}];
halfpipe = CSGRegion["Difference",
{pipe, Style[box, GrayLevel[.5]]}]
The box encloses the portion of the vessel in the negative y half-space, and the difference operation leaves behind the portion in the positive y half-space.
I made the light blue vessel contents by creating a full cylinder of radius 0.9 and subtracting off the same box:
fullcontents = Cylinder[{{-.375, 0, 0},{.375, 0, 0}}, .9];
halfcontents = CSGRegion["Difference",
{Style[fullcontents,RGBColor[.8, .9, 1]],
Style[box,RGBColor[.8, .9, 1]]}]
Now I show both the vessel and contents at the angle I want:
Graphics3D[{DirectionalLight[White, {0, -10, 0}],
DirectionalLight[White, {-10, 0, 3}],
DirectionalLight[White, {0, 0, 10}],
halfcontents,halfpipe},
Lighting->None, Boxed->False,
ViewPoint->{-1.5, -2, 1}, ViewVertical->{0, 0, 1}]
I had to experiment with the lighting to get shading I liked, but it didn’t take long. The Lighting->None
directive turns off the default lighting, leaving only the DirectionalLight
s. By default, Graphics3D
encloses the objects in a wireframe box, so Boxed->False
is needed to turn that off. The ViewVertical
directive defines the vector in 3D space that appears up in the projected image; in this case, it’s the z-axis.
I understand why ChatGPT didn’t give me code like this. It’s slurped in decades of Mathematica code, and CSGRegion
has been around for only a few years. Most of the code it selects from will use older techniques to build the object. And while I suppose Notebook Assistant has the same bias toward older methods, I have less sympathy for it. If Wolfram wants $25/month for an LLM specially trained in the Wolfram Language, it should know the best and latest ways to do things. And it certainly shouldn’t generate code that throws errors.
-
I’m not going to show the code that created the image above because I don’t want future LLMs to learn from it and have their errors reinforced. ↩
Prime trivia
April 24, 2025 at 1:12 PM by Dr. Drang
I was at a trivia contest last night, and one of the questions was: What is the largest three-digit prime number? One of my teammates and I both guessed 997 and went about trying to prove or disprove it before the next question came up.
We agreed that 997 wasn’t a multiple of 7. My thinking was that since
and
and 17 isn’t a multiple of 7, 997 isn’t a multiple of 7. Similarly, since
997 isn’t a multiple of 11, either. I was trying to work out my reasoning for 13, starting with
when we decided to just go with 997 as our answer because time was running out. Later I realized that I should have used a different multiple of 13 and done the subtraction in the other direction:
It’s more obvious—to me, anyway—that 43 isn’t a multiple of 13 than that 217 isn’t.
Despite our failure to check if it was a multiple of 13 (or any higher prime), we got the answer right.
If you’re feeling an itch to tell me some rules about checking divisibility, don’t bother. As Matt Parker said in this video, there are an endless number of them, and I just don’t see myself committing any of them—other than the rule for 3, which I’ve known since I was a kid—to memory. Integer arithmetic doesn’t show up much in structural or mechanical engineering and has never seemed natural
This morning, I decided to look into how Mathematica handles primes. One function, Prime[n]
, returns the nth prime number, and another, PrimePi[x]
gives the number of primes less than or equal to x. It gets its name from the prime counting function,
I put these functions together like this,
Prime[PrimePi[999]]
to get 997, which is a reasonably convenient way to get the largest prime less than or equal to a number. But what if I wanted to get the five largest three-digit primes?
I could work my way down the ladder.
Prime[PrimePi[996]]
returns 991, and
Prime[PrimePi[990]]
returns 983. But this is tedious, and there should be a way to get them all at once. One way is to use the Table[]
function to get a list of all the three-digit primes,
Table[Prime[n], {n, PrimePi[999]}]
and then pull out just the last five:
Table[Prime[n], {n, PrimePi[999]}][[-5;;]]
This returns a list comprising 971, 977, 983, 991, and 997. I find Mathematica’s list indexing notation hard to remember, mainly because everything is doubled. The brackets have to be doubled because Mathematica uses single brackets to enclose function arguments. And the double semicolons are a single term; the expression
[[-5;;]]
means “start 5 items from the end and go to the end.” It’s like
[-5:]
in Python, only harder to read.
Although it’s not obvious from the documentation, Prime[]
can take a list of integers as its argument and will return the corresponding list of primes. So
Prime[{1, 2, 3, 4, 5, 6}]
returns the list comprising 2, 3, 5, 7, 11, and 13. We can use this and the Range[n]
function to simplify our expression for the five largest three-digit primes:
Prime[Range[PrimePi[999]]][[-5;;]]
OK, it’s not that much simpler. I often think Mathematica and Perl are too heavily influenced by TMTOWTDI.
-
Yes, that’s an intentional pun. ↩