MathML with Pandoc

Since switching from MathJax to MathML to render equations here on ANIAT, I’ve tried several approaches to generate the MathML. There are many utilities and libraries that claim to do the conversion, but I’ve found all of them to be limited in one way or another. For a while, I was even writing MathML directly, albeit with the help of some Typinator abbreviations, because I couldn’t trust the converters to generate the correct characters or even understand some LaTeX commands I use regularly. Recently, I began using what I should have started out with: Pandoc.

It’s not that I wasn’t aware of Pandoc. Its famous in the Markdown/HTML/LaTeX world, and I probably first heard of it shortly after its release. But I’ve always thought of it as a document converter, not an equation converter. I was wrong. It’s very easy to use with a single equation.

pandoc --mathml <<<'$$T = \frac{1}{2} v_x^2$$'

produces

<p>
  <math display="block" xmlns="http://www.w3.org/1998/Math/MathML">
    <semantics>
      <mrow>
        <mi>T</mi>
        <mo>=</mo>
        <mfrac><mn>1</mn><mn>2</mn></mfrac>
        <msubsup><mi>v</mi><mi>x</mi><mn>2</mn></msubsup>
      </mrow>
      <annotation encoding="application/x-tex">
        T = \frac{1}{2} v_x^2
      </annotation>
    </semantics>
  </math>
</p>

where I’ve added linebreaks and indentation to the output to make it easier to read. Because it’s delimited by double dollar signs, the equation is rendered in block mode, like this:

T=12vx2

Single dollar signs would generate MathML with a display="inline" attribute.

(If you look at the source code for this page, you’ll see that I usually delete some of the code Pandoc generates—we’ll get to that later.)

All the converters handled simple equations, like v02, well, but more complicated stuff can be troublesome. One of the problems other converters have is dealing with multiline equations, something Pandoc handles with ease. For example, this piecewise function definition,

$$ f(x) = \left\{ \begin{array} {rcl} 
-1 & : & x<0 \\
0 & : & x=0 \\
+1 & : & x>0
\end{array} $$

is rendered exactly as expected:

f(x)={1:x<00:x=0+1:x>0

Well, perhaps not exactly as expected. If you’re reading this in Chrome (and, presumably, other Chrome-based browsers), all the cells in the array are aligned left, which puts the zero in the wrong spot, not vertically aligned with the ones. But that’s Chrome’s fault, not Pandoc’s.

Since Pandoc understands the \begin{array} command, it can do matrices, too:

k=EAL[1111]

So far, I’ve found only one small bug in Pandoc’s conversion from LaTeX to MathML. Here’s a simple formula that includes both a summation and a limit:

$$ e^x
= \sum_{n=0}^\infty \frac{x^n}{n!}
= \lim_{n\to\infty} \left( 1+ \frac{x}{n} \right)^n $$

This is what it should look like, a screenshot of the equation as rendered by LaTeX itself:

Screenshot of exponential expansion and limit

But here’s how it comes out after passing the equation to Pandoc:

ex=n=0xnn!=limn(1+xn)n

The summation is fine, but the limit is formatted incorrectly. The n part should be under the lim, not off to the side like a subscript. That subscript-like formatting is what you’d use for an inline equation, not a block equation.

Let’s see what happened. Here’s the MathML produced by Pandoc:

xml:
 1:  <math display="block" xmlns="http://www.w3.org/1998/Math/MathML">
 2:    <semantics>
 3:      <mrow>
 4:        <msup><mi>e</mi><mi>x</mi></msup>
 5:        <mo>=</mo>
 6:        <munderover>
 7:          <mo>∑</mo>
 8:          <mrow><mi>n</mi><mo>=</mo><mn>0</mn></mrow>
 9:          <mo accent="false">∞</mo>
10:        </munderover>
11:        <mfrac>
12:          <msup><mi>x</mi><mi>n</mi></msup>
13:          <mrow><mi>n</mi><mi>!</mi></mrow>
14:        </mfrac>
15:        <mo>=</mo>
16:        <munder>
17:          <mi>lim</mi><mo>⁡</mo>
18:          <mrow><mi>n</mi><mo>→</mo><mi>∞</mi></mrow>
19:        </munder>
20:        <msup>
21:          <mrow>
22:            <mo form="prefix" stretchy="true">(</mo>
23:              <mn>1</mn><mo>+</mo>
24:              <mfrac><mi>x</mi><mi>n</mi></mfrac>
25:            <mo form="postfix" stretchy="true">)</mo>
26:          </mrow>
27:          <mi>n</mi>
28:        </msup>
29:      </mrow>
30:      <annotation encoding="application/x-tex">
31:        e^x
32:        = \sum_{n=0}^\infty \frac{x^n}{n!}
33:        = \lim_{n\to\infty} \left( 1+ \frac{x}{n} \right)^n 
34:      </annotation>
35:    </semantics>
36:  </math>

The problem with the rendering of the limit is in Line 17. There’s an empty <mo>⁡</mo> element after the <mi>lim</mi> element. That’s what’s messing up the formatting. If we remove that empty element, the limit gets formatted the way it should:

ex=n=0xnn!=limn(1+xn)n

Obviously, I’m not going to try to fix Pandoc; I have no idea how to program in Haskell. I’ll send a note to John McFarlane (can he really still be the sole developer?) about the rendering bug, but in the meantime I’ll just remember to delete the empty <mo>⁡</mo> whenever I need a limit.

I think I’ve mentioned in the past that one of my favorite features of Markdown is that it allows you to mix HTML with regular Markdown text; it passes the HTML through unchanged. I’m using that here to add MathML equations to my blog posts. I write the equation in LaTeX, select it, and run a Keyboard Maestro that replaces the LaTeX with its MathML equivalent. Because I’m still messing around with the macro (and may change it to an automation that BBEdit runs directly) I won’t post it here, but I do want to include the Python script that runs Pandoc to do the conversion and then cleans up Pandoc’s output to make it more compact.

Here’ the script:

python:
 1:  #!/usr/bin/env python3
 2:  
 3:  import sys
 4:  import subprocess
 5:  from bs4 import BeautifulSoup
 6:  
 7:  # Get LaTeX from stdin, run it through Pandoc, and parse the HTML
 8:  latex = sys.stdin.read()
 9:  process = subprocess.run(['pandoc', '--mathml'], input=latex, text=True, capture_output=True)
10:  html = process.stdout
11:  soup = BeautifulSoup(html, 'lxml')
12:  
13:  # Extract the MathML
14:  math = soup.math
15:  
16:  # Delete the annotation
17:  math.annotation.decompose()
18:  
19:  # Delete the unnecessary <semantics> wrapper
20:  math.semantics.unwrap()
21:  
22:  # Delete the unnecessary top-level <mrow> wrapper in block display
23:  if math['display'] == 'block':
24:    math.mrow.unwrap()
25:  
26:  # Delete the unnecessary attribute for inline display
27:  if math['display'] == 'inline':
28:    del math['display']
29:  
30:  # Print the cleaned-up MathML
31:  print(math)

Lines 8–10 get the LaTeX equation from standard input, pass it through Pandoc via the subprocess.run function, and save the standard output to the html variable. Line 11 then parses html with Beautiful Soup, putting it in a form that makes it very easy to change.

Because we don’t need the <p></p> tags the MathML is wrapped in, we pull out just the <math></math> part in Line 14. The rest of the code removes elements and attributes that can be useful, but which don’t add to the rendering of the equations. You may disagree with my removal of these pieces, but it’s my blog.

First, I don’t want to keep the original LaTeX code, so Line 17 deletes the <annotation></annotation> tag and everything inside it. With that gone, <semantics> and </semantics> are no longer necessary, so I got rid of them, too. Unlike the decompose function, which removes tags and their contents, unwrap removes just the tags, leaving behind what’s between them.

I’ve noticed there’s always an extra <mrow></mrow> wrapper around block equations, so Lines 23–24 get rid of that. And because display="inline" is the default, Lines 27–28 deletes that attribute from the <math> tag. When the MathML code is in the middle of a paragraph, it helps readability to make it as small as possible.

Screenshot of BBEdit with inline equation highlighted

One thing I’m not entirely happy with is the need to select the equation before doing the conversion to MathML. I think I can use BBEdit’s AppleScript library to control the cursor and make the selection before sending the text to the Python script. But I haven’t been using this system long enough to know where my cursor is likely to be when I want to convert the equation. The obvious answer is to assume it’ll be after the final dollar sign, but I’ve been at this long enough to mistrust the obvious.