Cleaning up my Markdown table cleanup script

Way back in 2008, I wrote a script that made it easier to have nicely formatted tables in a Markdown1 document. The idea was to take a hastily written table like this,

| Column 1 | Column 2 | Column 3 |
|--|:--:|--:|
| first | second | third |
| column | column | column |
| left | center | right |

and turn it into this,

| Column 1 | Column 2 | Column 3 |
|:---------|:--------:|---------:|
| first    |  second  |    third |
| column   |  column  |   column |
| left     |  center  |    right |

which is much easier to read. Note that both result in the same generated HTML and will look the same in the output document:

Column 1 Column 2 Column 3
first second third
column column column
left center right

The script was all about making the Markdown source look better. It’s also easier to see if you have mistakes when the tables look like tables. I called the script Normalize Table.

The original script was incorporated into a TextMate command, which was the editor I used at the time. Later, when I switched to BBEdit, I took the script and saved it as a BBEdit Text Filter.

One of the problems with that script was that it didn’t handle Unicode characters correctly. Unicode characters use multiple bytes, which messed up the vertical alignment of the pipes. A table like this,

| Cølumn 1 | Côlumn 2 | Colümn 3 |
|--|:--:|--:|
| fírst | sèçond | thîrd |
| column | column | column |
| left | center | right |

would end up like this,

| Cølumn 1 | Côlumn 2 | Colümn 3 |
|:----------|:---------:|----------:|
| fírst    |  sèçond |    thîrd |
| column    |   column  |    column |
| left      |   center  |     right |

Earlier this month, Nils Schulte am Hülse sent me a fix for Unicode, which I thanked him for and incorporated in my BBEdit Text Filter. Now tables with Unicode characters get properly formatted:

| Cølumn 1 | Côlumn 2 | Colümn 3 |
|:---------|:--------:|---------:|
| fírst    |  sèçond  |    thîrd |
| column   |  column  |   column |
| left     |  center  |    right |

Nils also pointed out that my script assumed the separator line (the one with the hyphens between the column headings and the body of the table) couldn’t include spaces. This is incorrect for both MultMarkdown and PHP Markdown Extra. Nils included a simple fix for this, too, so now tables like this

| Cølumn 1 | Côlumn 2 | Colümn 3 |
| -- | :--: | --: |
| fírst | sèçond | thîrd |
| column | column | column |
| left | center | right |

won’t generate errors.

As I said, I thanked Nils for his improvements, but something was nagging at me. The separator issue was new to me—I had never written a table with spaces in the separator line and hadn’t even considered whether it was legal—but I thought I’d fixed the Unicode problem. And yet, when I applied the Normalize Table text filter in BBEdit, it didn’t work right until I incorporated Nils’s changes.

So I started to write a post to explain the changes and went searching through the vast and dusty ANIAT archives for links to the old table formatting entries. Which is when I found this one, in which I showed a solution to the Unicode problem as sent to me by reader Christoph Kepper. Apparently, I

  1. didn’t copy the right Normalize Table script when I switched from TextMate to BBEdit;
  2. forgot that I actually had a script that handled Unicode correctly; and
  3. didn’t realize any of this when I got Nils’s email.

I like to think that if it were my actual job to remember what I write I’d be better at it, but after episodes like this I’m not so sure. My grandfather used to tell me it’s hell to get old but there’s no good alternative.

Anyway, there is one silver lining. Nils’s Unicode solution is slightly shorter than Christoph’s in that it affects only one line of my original script. Here’s the current version:

python:
 1:  #!/usr/bin/python
 2:  
 3:  import sys
 4:  
 5:  def just(string, type, n):
 6:      "Justify a string to length n according to type."
 7:      
 8:      if type == '::':
 9:          return string.center(n)
10:      elif type == '-:':
11:          return string.rjust(n)
12:      elif type == ':-':
13:          return string.ljust(n)
14:      else:
15:          return string
16:  
17:  
18:  def normtable(text):
19:      "Aligns the vertical bars in a text table."
20:      
21:      # Start by turning the text into a list of lines.
22:      lines = text.splitlines()
23:      rows = len(lines)
24:      
25:      # Figure out the cell formatting.
26:      # First, find the separator line.
27:      for i in range(rows):
28:          if set(lines[i]).issubset('|:.- '):
29:              formatline = lines[i]
30:              formatrow = i
31:              break
32:      
33:      # Delete the separator line from the content.
34:      del lines[formatrow]
35:      
36:      # Determine how each column is to be justified.
37:      formatline = formatline.strip(' ')
38:      if formatline[0] == '|': formatline = formatline[1:]
39:      if formatline[-1] == '|': formatline = formatline[:-1]
40:      fstrings = formatline.split('|')
41:      justify = []
42:      for cell in fstrings:
43:          ends = cell[0] + cell[-1]
44:          if ends == '::':
45:              justify.append('::')
46:          elif ends == '-:':
47:              justify.append('-:')
48:          else:
49:              justify.append(':-')
50:      
51:      # Assume the number of columns in the separator line is the number
52:      # for the entire table.
53:      columns = len(justify)
54:      
55:      # Extract the content into a matrix.
56:      content = []
57:      for line in lines:
58:          line = line.strip(' ')
59:          if line[0] == '|': line = line[1:]
60:          if line[-1] == '|': line = line[:-1]
61:          cells = line.split('|')
62:          # Put exactly one space at each end as "bumpers."
63:          linecontent = [ ' ' + x.strip() + ' ' for x in cells ]
64:          content.append(linecontent)
65:      
66:      # Append cells to rows that don't have enough.
67:      rows = len(content)
68:      for i in range(rows):
69:          while len(content[i]) < columns:
70:              content[i].append('')
71:      
72:      # Get the width of the content in each column. The minimum width will
73:      # be 2, because that's the shortest length of a formatting string and
74:      # because that matches an empty column with "bumper" spaces.
75:      widths = [2] * columns
76:      for row in content:
77:          for i in range(columns):
78:              widths[i] = max(len(row[i]), widths[i])
79:      
80:      # Add whitespace to make all the columns the same width and 
81:      formatted = []
82:      for row in content:
83:          formatted.append('|' + '|'.join([ just(s, t, n) for (s, t, n) in zip(row, justify, widths) ]) + '|')
84:      
85:      # Recreate the format line with the appropriate column widths.
86:      formatline = '|' + '|'.join([ s[0] + '-'*(n-2) + s[-1] for (s, n) in zip(justify, widths) ]) + '|'
87:      
88:      # Insert the formatline back into the table.
89:      formatted.insert(formatrow, formatline)
90:      
91:      # Return the formatted table.
92:      return '\n'.join(formatted)
93:  
94:          
95:  # Read the input, process, and print.
96:  unformatted = unicode(sys.stdin.read(), "utf-8")
97:  print normtable(unformatted)

Nils’s fix for the separator problem is in Line 28. His fix for the Unicode problem is in Line 96.


  1. Strictly speaking, there is no table format in Markdown. Tables like you see in this post are available only in Markdown variants like MultiMarkdown or PHP Markdown Extra. Because I suspect that these variants, taken in aggregate, are more popular than Gruber’s One True Markdown, I use the name Markdown to refer to all of them.