Joshua, C.S. Chen wrote: > Hello folks, > My users speak Chinese. I found that spamassassin seems not working well > about chinese chset (utf8 or big5) on the bayes issue. Many normal mails > (almost) get BAYES_99 score although the real spam also get BAYES_99. It > looks like foreign language like Chinese is very easy to be high bayes > scored. > I have setup ok_locales all but it doesn't help the false-positive problem. > > And another question: just wonder what if I do sa-learn --dump? Am I > supposed to see the phrase that SA has learned? In sa 2.6x or older, yes.. in sa 3.0.0 or higher, no.
First, phrases isn't quite accurate.. bayes stores tokens, and most of the tokens are simply words, not phrases. In SA 3.0.0 or higher the text tokens themselves are not stored, only the SHA1 hash of them is stored. This cannot be easily reversed to figure out what the text token was, but it's easy to figure out the hash of another token and compare the two. Thus, it's impossible for dump to display the text tokens, it doesn't know what they are. The main reason to do this in SA 3.x is performance. All the SHA hashes are the same size. No more variable-length string compares, just straight fixed-width binary compares. Ditto for record reads. A side effect is increased security.. nobody can look at your bayes DB and make assumptions about what your email conversations talk about. If you want to see the text tokens that match bayes for a particular message, you can do this by feeding a message to spamassassin in bayes debug mode.. spamassassin -D bayes=255 <message.txt That should let you know which tokens in the message are matching bayes, and what probability each gets (from 0.0000 to 1.0000, which represents 0% to 100%). Word of advice: if you see a LOT of innocuous words matching in the range of 0.90-1.0 you can worry. But do not worry about every single word that seems "wrong". A typical message will match a dozen or more tokens. All that said, how do you fix it? Feed your problem messages to sa-learn --ham. If it's really bad, wipe your bayes DB and start over. > some key phrases, words > in the spam mails? If so, can I see some chinese phrases? > I've never tried, but the above should work for Chinese text, provided your local terminal supports it.