Re: BAYES_99 makes lots of false-positive

Joshua, C.S. Chen Thu, 13 Jul 2006 23:45:58 -0700

Matt Kettler wrote:

In sa 2.6x or older, yes.. in sa 3.0.0 or higher, no.

First, phrases isn't quite accurate.. bayes stores tokens, and most of
the tokens are simply words, not phrases.


In SA 3.0.0 or higher the text tokens themselves are not stored, only
the SHA1 hash of them is stored. This cannot be easily reversed to
figure out what the text token was, but it's easy to figure out the hash
of another token and compare the two. Thus, it's impossible for dump to
display the text tokens, it doesn't know what they are.

The main reason to do this in SA 3.x is performance. All the SHA hashes
are the same size. No more variable-length string compares, just
straight fixed-width binary compares. Ditto for record reads. A side
effect is increased security.. nobody can look at your bayes DB and make
assumptions about what your email conversations talk about.

Thanks Matt, for the details.

If you want to see the text tokens that match bayes for a particular
message, you can do this by feeding a message to spamassassin in bayes
debug mode..

spamassassin -D bayes=255 <

some key phrases, words
in the spam mails? If so, can I see some chinese phrases?

I've never tried, but the above should work for Chinese text, provided
your local terminal supports it.

message.txt

That should let you know which tokens in the message are matching bayes,
and what  each gets (from 0.0000 to 1.0000, which represents
0% to 100%).

Word of advice: if you see a LOT of innocuous words matching in the
range of 0.90-1.0 you can worry. But do not worry about every single
word that seems "wrong". A typical message will match a dozen or more
tokens.

All that said, how do you fix it? Feed your problem messages to sa-learn
--ham. If it's really bad, wipe your bayes DB and start over.

It sounds great to be able to see which tokens mach those in the bayes db.
I tried a test message with -D bayes=255 like

$ spamassassin -D bayes=255 < /tmp/message
>From [EMAIL PROTECTED] Fri Jul 14 10:32:01 2006
Return-Path: <[EMAIL PROTECTED]>
X-Spam-Checker-Version: SpamAssassin 3.1.0 (2005-09-13) on asiaa.sinica.edu.tw
X-Spam-Level:
X-Spam-Status: No, score=-102.2 required=6.0 tests=ALL_TRUSTED,AWL,
        FROM_IAA_LOCAL_SITE1,USER_IN_WHITELIST autolearn=no version=3.1.0
Received: from [140.109.177.202] (genesis.asiaa.sinica.edu.tw [140.109.177.202])
        by asiaa.sinica.edu.tw (8.13.1/8.13.1) with ESMTP id k6E2VqVw011774
        for <[EMAIL PROTECTED]>; Fri, 14 Jul 2006 10:31:52 +0800
Message-ID: <[EMAIL PROTECTED]>
Date: Fri, 14 Jul 2006 10:31:52 +0800
From: "Joshua, C.S. Chen" <[EMAIL PROTECTED]>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.13) Gecko/20060418 Red Hat/1.7.13-1.4.1
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: =?Big5?B?rEyswA==?= <[EMAIL PROTECTED]>
Subject: test for spamassassin -D bayes=255
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: by amavisd-new
X-Keywords:
X-UID: 9719
Status: O
Content-Length: 88
Lines: 4

This is a test. How I want to see the tokens' details that bayes thinks.

Cheers
Joshua

It just showed the original message, not the tokens and probabilities. Am I missing something here?

Thanks very much

Cheers
Joshua

Re: BAYES_99 makes lots of false-positive

Reply via email to