Re: BAYES_99 makes lots of false-positive

Matt Kettler Thu, 13 Jul 2006 07:14:17 -0700

Joshua, C.S. Chen wrote:
> Hello folks,
> My users speak Chinese. I found that spamassassin seems not working well
> about chinese chset (utf8 or big5) on the bayes issue. Many normal mails
> (almost) get BAYES_99 score although the real spam also get BAYES_99. It
> looks like foreign language like Chinese is very easy to be high bayes
> scored.
> I have setup ok_locales all but it doesn't help the false-positive problem.
>
> And another question: just wonder what if I do sa-learn --dump? Am I
> supposed to see the phrase that SA has learned? 
In sa 2.6x or older, yes.. in sa 3.0.0 or higher, no.


First, phrases isn't quite accurate.. bayes stores tokens, and most of
the tokens are simply words, not phrases.

In SA 3.0.0 or higher the text tokens themselves are not stored, only
the SHA1 hash of them is stored. This cannot be easily reversed to
figure out what the text token was, but it's easy to figure out the hash
of another token and compare the two. Thus, it's impossible for dump to
display the text tokens, it doesn't know what they are.

The main reason to do this in SA 3.x is performance. All the SHA hashes
are the same size. No more variable-length string compares, just
straight fixed-width binary compares. Ditto for record reads. A side
effect is increased security.. nobody can look at your bayes DB and make
assumptions about what your email conversations talk about.

If you want to see the text tokens that match bayes for a particular
message, you can do this by feeding a message to spamassassin in bayes
debug mode..

spamassassin -D bayes=255 <message.txt

That should let you know which tokens in the message are matching bayes,
and what probability each gets (from 0.0000 to 1.0000, which represents
0% to 100%).

Word of advice: if you see a LOT of innocuous words matching in the
range of 0.90-1.0 you can worry. But do not worry about every single
word that seems "wrong". A typical message will match a dozen or more
tokens.

All that said, how do you fix it? Feed your problem messages to sa-learn
--ham. If it's really bad, wipe your bayes DB and start over.



> some key phrases, words
> in the spam mails? If so, can I see some chinese phrases?
>   
I've never tried, but the above should work for Chinese text, provided
your local terminal supports it.

Re: BAYES_99 makes lots of false-positive

Reply via email to