Current best-practices around normalize_charset?

Jay Sekora Tue, 16 Jul 2013 08:31:46 -0700

Hi. We're running SpamAssassin 3.3.1, and pursuant to some advice I'veseen in archives of this list and spamassassin-dev (e.g.,http://osdir.com/ml/spamassassin-dev/2009-07/msg00156.html), I am *not*using normalize_charset. Unfortunately, this makes filtering text inbinary encodings almost impossible, since even if you can come up with aword you want to match, word boundaries aren't at byte boundaries, so ifI were to try to write rules byte-by-byte, I'd need several possiblematch strings, and I wouldn't be able to match the first or lastcharacter of the phrase I want to match (which for, say, Chinese, wherewords tend to be one or two characters long, is a big problem). That'son top of the alternative patterns needed to represent non-Unicodeencodings, of course.

Anyway, my question is, is that advice still valid (for 3.3.1, which ispackaged for Debian Squeeze, or for latest stable)? And if so, what dopeople tend to do to write rules for East Asian character sets (or, forthat matter, for Western character sets encoded in binary to make themharder to filter)? The traffic on the bug report quoted in the abovemessage is kind of ambiguous.

(I will note that ok_languages and ok_locales are pretty useless here,at least for site-wide use, since we have users with correspondence inpretty much any language we've ever seen spam in.)


Jay

--
Jay Sekora
Linux system administrator and postmaster,
The Infrastructure Group
MIT Computer Science and Artificial Intelligence Laboratory

Current best-practices around normalize_charset?

Reply via email to