The voice in my head said "Unicode"
September 12, 2005 at 11:20 PM by Dr. Drang
Last Friday’s post about Attaché magazine contained three acute e (é) characters, which got me thinking about encoding and whether I’m doing things right. The Right Thing, of course, is to use Unicode (specifically UTF-8) so that every character I could ever need is available and I’m not so US- and Western Euro-centric.
I couldn’t do the Right Thing when I was using Linux, because my fave rave text editor (NEdit) didn’t do UTF-8. So I stuck with ISO Latin-1, and apart from a nagging feeling that the world was passing me by, everything was hunky dory.
When I first moved to the Mac, I thought “Now’s the time to go UTF-8.” But when I wrote my first Perl program on the Mac and tried to run it from the command line, I got a withering “cannot execute binary file” error message from the shell. So back to Latin-1 and the nagging feeling.
After the Attaché post—done in Latin-1—I decided to have a go at Unicode again. I learned that Unicode is the official encoding of the Web and that Latin-1 holdouts like me were to be looked down on with a combination of pity and derision. I also learned that UTF-8 files often have a byte order mark (BOM) at the beginning that identifies the file as UTF-8. The BOM gets in the way of the shell, which is looking for the shebang (#!) as the first two characters. The BOM, however, doesn’t have to be present, and my new fave rave text editor (BBEdit) has a way of choosing to encode as UTF-8 without the BOM. (I wasn’t using BBEdit when I first changed over to the Mac, and my early attempts at UTF-8 that didn’t pass muster with the shell must have been BOMed.) So I set my default encoding for new files in BBEdit to UTF-8 sans BOM, and changed the encoding of the Attaché post and the surrounding blog template code to the same.
Which wrecked the blog, causing funny characters to show up wherever é was supposed to be. Ah, I thought, Safari is probably set to use Latin-1 by default. Wrong, Safari’s preferences were set to UTF-8. But the files were UTF-8! WTF-8? After a bit of nosing around in Google and a visit to http://web-sniffer.net, I learned that the browser was being forced to interpret the charset as Latin-1 by the HTTP Content-Type header. A little more exploration of the blog templating code (blosxom, by the way) and I’m now Unicode all the way.
And feeling pretty righteous about it.