Good characters

I’m pretty happy with the way GMail filters spam. It captures about 85% of the spam I get, so I don’t waste bandwidth downloading junk. And the filter built into Apple’s gets almost all the rest.

(As an aside, GMail makes handling email on two computers a breeze. I run on my desktop machine at work, and it downloads from GMail and keeps local copies of important messages. My laptop, on the other hand, uses GMail directly and exclusively; I know that any work-related messages I send from my laptop will be picked up by the desktop machine the next time it checks mail. With this arrangement I have my important messages in two locations (actually three, because I backup my desktop every night), but I don’t have to worry about syncing mail between my two computers.)

But GMail could be improved with one simple addition: the ability to filter based on the character set used in the message. I cannot read anything written Asian or Cyrillic characters and no one I know would send me such a message, so it must be spam. Back in my Linux days, I used the procmail filter given in the Bogofilter FAQ to eliminate Asian spam before my spam filter even saw it. (Unfortunately, this clever filtering was all done locally, so there was no bandwidth savings.)

The character set used in an email message is supposed to be the header—this is how an email client knows which font to use when displaying the message. If the Content-Type header is “text/plain” or “text/html,” a “charset= something ” subheader will tell us when the message uses Asian characters (the something is usually “gb2312” or “big5”) or Cyrillic (where the something is typically “koi8-r” or “windows-1251”). For multipart messages, the header for each part will have this sort of information. The Google folks are generally considered the smartest working on the web today; they should be able to whip up a character set filter in no time.

I’ll bet if they did, their users would love GMail even more. I sure would.