On Tue, 24 Jul 2012 08:36:53 -0400 David F. Skoll wrote: > On Tue, 24 Jul 2012 09:41:19 +0200 > Simon Loewenthal <si...@klunky.co.uk> wrote: > > > I have Bayes correctly scoring BAYES_99 on Dutch and French > > straight out of the box. No problems. --
Dutch, French etc are very similar to English with most characters being compatible with ascii. > It does work, but with a caveat: SpamAssassin does not normalize the > character set. So if you train it on Chinese in the GB2312 character > set, that will do nothing for you if you receive UTF-8 Chinese spam. > Furthermore, if some random character set A and another random > character set B share byte sequences, your Bayes training may confuse > them. > > Also, I don't believe SpamAssassin has any type of logic for > recognizing word boundaries in ideographic character sets vs. > alphabetic ones. There's also a problem with non-roman alphabets represented with multibyte characters whereby the maximum token length (15) is hit on relatively short words. There is some attempt to work around this by converting such tokens into byte pairs. > Bayes is pretty robust, so it "works" in the face of a lot of noise, > but SA's implementation still leaves quite a bit to be desired. In most spams aimed at English speakers, spammers avoid leaving any useful tokens in the text and Bayes still works with headers and mark-up.