Hi. We're running SpamAssassin 3.3.1, and pursuant to some advice I've seen in archives of this list and spamassassin-dev (e.g., http://osdir.com/ml/spamassassin-dev/2009-07/msg00156.html), I am *not* using normalize_charset. Unfortunately, this makes filtering text in binary encodings almost impossible, since even if you can come up with a word you want to match, word boundaries aren't at byte boundaries, so if I were to try to write rules byte-by-byte, I'd need several possible match strings, and I wouldn't be able to match the first or last character of the phrase I want to match (which for, say, Chinese, where words tend to be one or two characters long, is a big problem). That's on top of the alternative patterns needed to represent non-Unicode encodings, of course.

Anyway, my question is, is that advice still valid (for 3.3.1, which is packaged for Debian Squeeze, or for latest stable)? And if so, what do people tend to do to write rules for East Asian character sets (or, for that matter, for Western character sets encoded in binary to make them harder to filter)? The traffic on the bug report quoted in the above message is kind of ambiguous.

(I will note that ok_languages and ok_locales are pretty useless here, at least for site-wide use, since we have users with correspondence in pretty much any language we've ever seen spam in.)

Jay

--
Jay Sekora
Linux system administrator and postmaster,
The Infrastructure Group
MIT Computer Science and Artificial Intelligence Laboratory

Reply via email to