New syntax highlighting for Markdown

In this post from a couple of weeks ago, I introduced a new feature to the blog: syntax highlighting for included source code. I still like the idea, but have decided I didn’t like the implementation, so I’m making a change. The change should be almost invisible to readers, but it’s a significant improvement for me.

Deficiencies in the original highlighter

My original syntax highlighting solution consisted of a callout to Pygments, the Python-based syntax highlighter, from within PHP Markdown Extra Math (PHPMEM), my fork of Michel Fortin’s Markdown processor. Because Pygments can add line numbers if asked, I would not put line numbers in my Markdown source for an article. Here’s an excerpt from the source of last week’s post about audiobook track names in iTunes:

So I wrote the script in Python using the [appscript library][2]. Here it is:

    ::: python linenos
    #!/usr/bin/python

    import appscript

    selected = appscript.app('iTunes').selection.get()

    for t in selected:
      oldname = t.name.get()
      newname = oldname.replace('Chapter', 'HP7 - Ch', 1)
      t.name.set(newname)

I selected all the *Deathly Hallows* tracks and ran the script to apply the changes.

As you can see, the code for the script is indented 4 spaces in the usual Markdown way. The first line, ::: python linenos, is a directive to Pygments to treat the code as Python and to include line numbers in the output. The directive itself is not included in the output.

At first this seemed fine, but after using it a while I decided it was just wasn’t in keeping with the “Markdown Way.” The driving philosophy of Markdown is that the source should be just as readable as the formatted HTML output—Markdown’s formatting commands should blend in with the text itself, using longstanding (or at least natural) plain text conventions. What I had come up with violated this philosophy in two ways:

It’s the second violation that bothered me the most. When I post source code with line numbers, it’s usually because I want to refer to those line numbers later on. When the Markdown source doesn’t have line numbers, later references to line numbers make no sense until the Markdown has been formatted. And separate from the philosophy, there’s a practical problem: If I want to discuss, say, the arguments to the replace method in Line 9, I have to count to know that it is Line 9. This isn’t much of a burden in a short snippet, but it is if the included code is dozens of lines long.2

Before I had any syntax highlighting, I would always include line numbers in the Markdown source if I planned to talk about particular lines later in the post. (A simple JavaScript utility styled the line numbers to make them less obtrusive in the output.) It was a pleasant way to write and I wanted to go back to it.

Using the new highlighter

For the new highlighter, I’ve reduced the noise in the language declaration line, and I’m including line numbers when appropriate.

In the Markdown source (what I see in my text editor), the script for renaming iTunes tracks looks like this:

    python:
     1:  #!/usr/bin/python
     2:  
     3:  import appscript
     4:  
     5:  selected = appscript.app('iTunes').selection.get()
     6:  
     7:  for t in selected:
     8:    oldname = t.name.get()
     9:    newname = oldname.replace('Chapter', 'HP7 - Ch', 1)
    10:    t.name.set(newname)

Here’s what it turns into:

python:
 1:  #!/usr/bin/python
 2:  
 3:  import appscript
 4:  
 5:  selected = appscript.app('iTunes').selection.get()
 6:  
 7:  for t in selected:
 8:    oldname = t.name.get()
 9:    newname = oldname.replace('Chapter', 'HP7 - Ch', 1)
10:    t.name.set(newname)

If I keep the language line, but don’t include the line numbers in the source, the code is highlighted, but the output has no line numbers and no button to bring up a numberless version.

python:
#!/usr/bin/python

import appscript

selected = appscript.app('iTunes').selection.get()

for t in selected:
  oldname = t.name.get()
  newname = oldname.replace('Chapter', 'HP7 - Ch', 1)
  t.name.set(newname)

If I leave off the language line at the top, the code appears with no syntax highlighting. This is a backward compatibility thing; older posts that didn’t have a language line will continue to appear as they always have.

Here’s what it looks like with line numbers but no language line.

 1:  #!/usr/bin/python
 2:  
 3:  import appscript
 4:  
 5:  selected = appscript.app('iTunes').selection.get()
 6:  
 7:  for t in selected:
 8:    oldname = t.name.get()
 9:    newname = oldname.replace('Chapter', 'HP7 - Ch', 1)
10:    t.name.set(newname)

And here’s what it looks like with neither line numbers nor language line.

#!/usr/bin/python

import appscript

selected = appscript.app('iTunes').selection.get()

for t in selected:
  oldname = t.name.get()
  newname = oldname.replace('Chapter', 'HP7 - Ch', 1)
  t.name.set(newname)

Which, but for the background color, is just how it looks in the source.

(By the way, if you’re reading this in Internet Explorer, the output will always look like my Markdown source—no syntax highlighting and no diminution of the line numbers. When I first wrote the line number styling code back in 2007, problems with the JavaScript replace method in IE gave me so much trouble I eventually added a few lines to exclude any styling if IE is the browser. That exclusionary code is still present, although I may try to change that if a reader who uses Explorer asks nicely and is willing to run tests for me.)

The line numbers don’t have to start at 1. If I want to discuss a function in the middle of a program, I can include something like this in the Markdown source:

    javascript:
    10:  function addLineNumbers(lineArray, start) {
    11:    var currentLine = start;
    12:    var maxLine = start + lineArray.length - 1;
    13:    var lnWidth = maxLine.toString().length;
    14:    var numberedLineArray = [];
    15:    for (var i=0; i<lineArray.length; i++) {
    16:      numberedLineArray.push('<span class="ln">' + padLeft(currentLine.toString(), lnWidth) + '  </span>' + lineArray[i]);
    17:      currentLine++;
    18:    }
    19:    return numberedLineArray;
    20:  }

It will appear with the line numbers that start with 10:

javascript:
10:  function addLineNumbers(lineArray, start) {
11:    var currentLine = start;
12:    var maxLine = start + lineArray.length - 1;
13:    var lnWidth = maxLine.toString().length;
14:    var numberedLineArray = [];
15:    for (var i=0; i<lineArray.length; i++) {
16:      numberedLineArray.push('<span class="ln">' + padLeft(currentLine.toString(), lnWidth) + '  </span>' + lineArray[i]);
17:      currentLine++;
18:    }
19:    return numberedLineArray;
20:  }

Here’s something that may seem odd to you: if the line numbers in the source are jumbled up or repeated or have gaps,

    javascript:
    10:  function addLineNumbers(lineArray, start) {
    15:    var currentLine = start;
    23:    var maxLine = start + lineArray.length - 1;
    18:    var lnWidth = maxLine.toString().length;
    14:    var numberedLineArray = [];
    11:    for (var i=0; i<lineArray.length; i++) {
    11:      numberedLineArray.push('<span class="ln">' + padLeft(currentLine.toString(), lnWidth) + '  </span>' + lineArray[i]);
    37:      currentLine++;
    38:    }
    39:    return numberedLineArray;
    40:  }

the line numbers in the output will be sequential, starting with the first numbered line.

javascript:
10:  function addLineNumbers(lineArray, start) {
15:    var currentLine = start;
23:    var maxLine = start + lineArray.length - 1;
18:    var lnWidth = maxLine.toString().length;
14:    var numberedLineArray = [];
11:    for (var i=0; i<lineArray.length; i++) {
11:      numberedLineArray.push('<span class="ln">' + padLeft(currentLine.toString(), lnWidth) + '  </span>' + lineArray[i]);
37:      currentLine++;
38:    }
39:    return numberedLineArray;
40:  }

This is consistent3 with the way Markdown handles ordered lists, where

1. Apples
2. Oranges
2. Bananas
5. Grapes

turns into

  1. Apples
  2. Oranges
  3. Bananas
  4. Grapes

How it’s implemented

The syntax highlighting is now done on the client side, though an extension of my line number styler to call a function from the Highlight.js library. Highlight.js is not as full-featured as Pygments, but it has what I need.

Here’s the JavaScript I now use to style the line numbers and do syntax highlighting:

javascript:
 1:  function padLeft(string, width) {
 2:    var padded = string;
 3:    var needed = width - string.length;
 4:    for (var i=0; i<needed; i++) {
 5:      padded = " " + padded;
 6:    }
 7:    return padded;
 8:  }
 9:      
10:  function addLineNumbers(lineArray, start) {
11:    var currentLine = start;
12:    var maxLine = start + lineArray.length - 1;
13:    var lnWidth = maxLine.toString().length;
14:    var numberedLineArray = [];
15:    for (var i=0; i<lineArray.length; i++) {
16:      numberedLineArray.push('<span class="ln">' + padLeft(currentLine.toString(), lnWidth) + '  </span>' + lineArray[i]);
17:      currentLine++;
18:    }
19:    return numberedLineArray;
20:  }
21:  
22:  function styleCode() {
23:    // IE wouldn't work with early versions of this function, so I stopped trying
24:    // to get it to work. Now that the function's been rewritten, I may need to
25:    // revisit this decision.
26:    var isIE = navigator.appName.indexOf('Microsoft') != -1;
27:    if (isIE) return;
28:    
29:    // Go through each of the <pre><code> blocks.
30:    $('pre code').each( function(i, elem) {
31:      var oldContent = elem.innerHTML;
32:      var newContent = [];
33:      
34:      // Get the language, if it's given, and remove it.
35:      var lang = oldContent.match(/^(bash|cmake|cpp|css|diff|xml|html|ini|java |javascript|lisp|lua|perl|php|python|ruby|scala|sql|tex):\n/);
36:      if (lang) {
37:        lang = lang[1];
38:        oldContent = oldContent.split("\n").slice(1).join("\n");
39:      }
40:      
41:      // Get the starting line number, if it's given, and remove the line numbers.
42:      var line = oldContent.match(/^( *)(\d+):(  )/);
43:      if (line) {
44:        line = parseInt(line[2]);
45:        oldContent = oldContent.replace(/^( *)(\d+):(  )/mg, "");
46:      }
47:      
48:      // Remove trailing empty lines, if any.
49:      oldContent = oldContent.replace(/\n+$/, "");
50:      
51:      // Put the unnumbered code back into the element.
52:      elem.innerHTML = oldContent;
53:      
54:      // Highlight the code if the language is given.
55:      if (lang) {
56:        $(this).addClass("language-" + lang);
57:        hljs.highlightBlock(elem);
58:      }
59:      
60:      // Put the line numbers back in if they were removed.
61:      if (line) {
62:        var newContent = elem.innerHTML.split("\n");
63:        newContent = addLineNumbers(newContent, line);
64:        newContent.push("");
65:        newContent.push('<button onclick="showPlain(this.parentNode)">Without line numbers</button>');
66:        elem.innerHTML = newContent.join("\n");
67:      }
68:      
69:    })
70:  }
71:  
72:  function showPlain(code) {
73:    var oldCode = code.cloneNode(true);
74:    for (var i=0; i<oldCode.childNodes.length; i++){
75:      node = oldCode.childNodes[i];
76:      if (node.nodeName == 'SPAN' || node.nodeName == 'BUTTON'){
77:        oldCode.removeChild(node);
78:      }
79:    }
80:    var w = window.open("", "", "width=800,height=500,resizable=yes,scrollbars=yes");
81:    var d = w.document;
82:    d.open();
83:    d.write("<html><head><title>Code</title></head><body><pre><code>", oldCode.innerHTML, "</code></pre></body></html>");
84:    d.close();
85:  }

The styleCode function is the workhorse. As you can see on Line 30, it uses jQuery’s CSS-based selection methods to find and iterate through all the <pre><code> blocks in the document. Lines 34-39 pluck out the language, and Lines 41-46 remove the line numbers, keeping track of the starting line number.

Lines 54-58 add a class with the name of the chosen language to the <code> element, and call the highlightBlock function from the Highlight.js library. The now-highlighted code has the line numbers reinserted and the “Without line numbers” button added in Lines 60-67.

Finally…

This has been an awfully long post; congratulations if you made it this far. Just a few more notes:

  1. The changes to PHPMEM that called Pygments were all kept in the “pygments” branch of the repository, a branch that I don’t intend to develop any further. If you want to use PHPMEM to handle math equations via MathJax, you can use the “master” branch without worry—there’s no syntax highlighting code in it.
  2. The switch from Pygments to Highlight.js has meant a new set of CSS styles for the highlighted code. I’ll probably be fiddling with them for a while as I try to find a set of colors and weights I like.
  3. When I cut and past code into an article, I number the lines by calling this Python script from within TextMate.
  4. This change should cause the problem with unaligned line numbers on the iPhone to disappear.

Finally, I’ve been working on this change for a few days, testing and debugging locally by using the “highlighter” branch in the repository of my blog-preview utility. I’m much happier with this than I was with the Pygments solution and expect it to be a permanent feature of the blog.


  1. They also call out “My programmer was too lazy to figure out a nice way to indicate the language, so he used a big ugly signal that bears no relationship to normal punctuation.” ↩︎

  2. Yes, I can also have the code in its own window and use the line numbers in TextMate’s gutter, but that requires a window switch I shouldn’t have to make when I’m discussing something in the very file I’m editing. ↩︎

  3. Well, it’s not entirely consistent. Markdown always starts an ordered list at 1, whereas the code handler will always start at the first line number given. I think this is a reasonable compromise. ↩︎