Duh, BOM

September 15, 2010 at 7:15 AM by Dr. Drang

One day last week, I decided to make a plot of my weight, which I’ve been tracking daily on my iPhone since the beginning of the year. It turned out to be much harder than necessary because the note-taking app Elements fiddled with the file format without my knowledge.

Let’s begin at the beginning. Back in January, I started using Simplenote to record my weight every morning. When I switched to Elements a few weeks ago, I copied the data from the Simplenote website into a file called Weight.txt and saved it in my ~/Dropbox/Elements folder.

The file looks like this:

01/01/10   195.0
01/02/10   196.0
01/03/10   195.0
01/04/10   196.0
01/05/10   196.5
      <etc>

An mm/dd/yy date, followed by three spaces and my weight to the nearest half pound.

Until last week, I’d hadn’t plotted my weight. A plot wasn’t going to teach me anything I didn’t already know, I kept telling myself, but the temptation of over 200 data points sitting in that file was too much for me to resist forever. I fired up Gnuplot, entered a few commands,

set xdata time
set timefmt '%m/%d/%y'
plot 'Weight.txt' using 1:2 with lp

and got back an error message telling me I had an “illegal month.”

I dug into the documentation for timefmt to see if I’d made a mistake with that setting. I changed the separator from three spaces to a tab character. I turned on the Show Invisibles setting in TextMate. I did some searching on the internet. No resolution to the problem.

Finally, I did a hex dump of the Weight.txt file and saw this:

0000000: efbb bf30 312f 3031 2f31 3020 2020 3139  ...01/01/10   19
0000010: 352e 300a 3031 2f30 322f 3130 2020 2031  5.0.01/02/10   1
0000020: 3936 2e30 0a30 312f 3033 2f31 3020 2020  96.0.01/03/10   
0000030: 3139 352e 300a 3031 2f30 342f 3130 2020  195.0.01/04/10  
0000040: 2031 3936 2e30 0a30 312f 3035 2f31 3020   196.0.01/05/10 
0000050: 2020 3139 362e 350a 3031 2f30 362f 3130    196.5.01/06/10

There were three bytes at the beginning of the file, before the first data point. In hex, the bytes were EF, BB, and BF, the byte order mark for a UTF-8 file. I got rid of them and had no more trouble plotting the data.

What’s a byte order mark? It’s a long story that has to do with the endianness of different computer systems.

Computers typically deal with data in multibyte chunks, known as words. Different processors use different byte orders in their words; some put the most significant byte (the one that represents the high parts of the number) in the highest memory address of the word, while some put it in the lowest memory address of the word. These are known as “little-endian” and “big-endian” systems, respectively, the names taken from the warring religions in Gulliver’s Travels.

Because Unicode files use multibyte sequences to represent characters, a file created on a big-endian system would have its bytes in the wrong order to be read on a little-endian system and vice versa. The byte order mark signifies which endianness the file uses; an opposite endian computer can read the mark and flip the bytes to get the proper interpretation.

But why did the file have a BOM in the first place? TextMate doesn’t save UTF-8 files with a BOM (unless the BOM has already been placed there by another editor). My suspicion was that Elements was adding the BOM, but I didn’t have time to run tests—I’d already wasted enough time trying to figure out why the data wouldn’t plot.

This week, though, I confirmed my suspicion about Elements. If I put a text file that I knew was BOMless into ~/Dropbox/Elements and then edited it in Elements on my phone, a BOM would appear at the beginning of the file as soon as Dropbox synced. Delete the BOM on my computer and resave, edit the file again in Elements, and the BOM would reappear. No question that Elements was the culprit.

I sent a support email to Second Gear Software, complaining about this rude behavior. I can understand—though not approve of—putting a BOM in files first created in Elements, but it’s not cricket to go around sticking a BOM in an existing file that doesn’t have one. Second Gear replied that they’re investigating the issue for a future release.

Frankly, for most of the files I use Elements with, it doesn’t matter whether there’s a BOM or not; the files are human-readable either way. It’s only when a file is going to be read and processed by a program that the BOM matters. But for those files, it’s essential that there be no BOM.

Here’s an extremely short Perl program for deleting the BOM from a file that has one. I call it debom.

#!/usr/bin/perl -i.bak -p
s/^\xef\xbb\xbf//;

Much of the work is done by the switches in the shebang line. The -i.bak switch tells Perl to edit the file in place and save a copy of the original with a .bak extension. The -p switch tells it to read every line of the input file, process it according to the body of the script, and then output the processed line. The one-line body simply deletes the three leading bytes of the BOM if it’s present.

If you want to live dangerously, you can delete the .bak from the -i switch, which will edit the file in place without making a backup of the original.

I mentioned above that the purpose of a byte order mark is to signal the byte order of the multibyte characters in Unicode files. But UTF-8 is unique among the various Unicode representations in that it doesn’t always use multibyte characters. ASCII characters, for example, are still encoded as a single byte. UTF-8 was designed this way so it could subsume the large corpus of existing ASCII files with no alteration.

So is there really a need for a BOM in a UTF-8 file? No. In fact, the Unicode standard says

For UTF-8, the encoding scheme consists merely of the UTF-8 code units (= bytes) in sequence. Hence, there is no issue of big- versus little-endian byte order for data represented in UTF-8.

and

The endian order entry for UTF-8 in Table 2-4 is marked N/A because UTF-8 code units are 8 bits in size, and the usual machine issues of endian order for larger code units do not apply. The serialized order of the bytes must not depart from the order defined by the UTF-8 encoding form. Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature.

It seems kind of stupid to add a byte order mark to a file for which there cannot be any byte order confusion, so you will not be surprised to learn that the practice of adding BOMs to UTF-8 files is being driven by Microsoft. In its online library, you’ll find this gem:

Always prefix a Unicode plain text file with a byte order mark, which informs an application receiving the file that the file is byte-ordered.

Even when the file is not byte-ordered. Ugh.

My guess is that Second Gear is adding the BOM so Elements doesn’t get crosswise with Windows text editors—Windows compatibility is usually the path of least resistance. But if you’re a Mac or Linux user, Elements is changing your files without your approval, and that’s something that should be resisted strongly.

And now it’s all this

I just said what I said and it was wrong
Or was taken wrong

Duh, BOM

Site search

Meta

Recent posts

Credits

And now it’s all this

I just said what I said and it was wrong Or was taken wrong

Duh, BOM

Site search

Meta

Recent posts

Credits

I just said what I said and it was wrong
Or was taken wrong