Re: [SAtalk] Bayes and whitelisting

Carlo Wood Tue, 02 Sep 2003 07:13:58 -0700

On Tue, Sep 02, 2003 at 01:02:47PM +1200, Simon Byrnand wrote:
> Well don't forget that the auto_learn_spam threshold is 15 in 2.55 and 12 
> in 2.60, and its very rare that a spam pasted into the *body* of a message 
> will be autolearnt, simply because most high scoring spams get most of 
> their high score from headers tests, which fail to match when the message 
> is pasted into the body of a new message.


I set my auto-learn threshold at 4.0, not 15.

I do that because I want to learn as much spams as possible,
*especially* learn those that are close to failure (to be detected).

My true ham never scores higher than 2.0 (98% score 0 or lower),
so - using 4.0 is very safe for me.

Anyway, there should be a possibility for me to make sure
that certain mails are not auto-learned at all - independant of
their score.

[...]
> (which I've never seen happen in my time on the list) I think you'll find 
> bayesian classifiers don't work the way you probably think they do. All 
> they do is tokenize a message based on word boundries, and build word 
> statistics based on that. They do *not* learn specific messages or phrases 
> as spam or ham, they simply count the statistical prevalence of words.

Yes, so if this message is marked as spam, ALL of it is marked as spam.
That means that it will think that 'boundries' (see quote of you) is a
word that at least once appeared in a spam post.

> >We really need a way to stop certain (white listed) mail to be auto-learned
> >at all as spam or ham - ever.
> 
> Maybe, maybe not. I think perhaps you're looking for a problem where there 
> isn't one. Do you have any examples of this in your own collection of 
> received emails from this list ? Do you have even one that says 
> "autolearn=spam" for a message from this list that was just someone pasting 
> a copy of a spam in a new message ? If you do, you might have a case, 
> otherwise...

I certainly do - spammers will eventually learn how to get their
mail headers right.  What we need to aim for is detecting spam by looking
at the body.  Before I started to use SA I have hand-crafted (procmail)
filters that filtered almost all spam based on the body text;  I started
to use SA because my filters took about 1 minute to process a 100 kb post
and was starting to annoy me (it froze my PC).

So, I moved the filters to my firewall that had some cpu cycles spare
and started to SA with the idea to add my rules to it in local.cf in the
end.

At this moment I have the simple problem that I can write simple rules
that detect wether or not I want that mail to be autolearned by the Bayesian
engine, and there is no support for that.

I think I can work around this by abusing the existing tflags, but
it would be better if a new flag was added for this particular purpose.

-- 
Carlo Wood <[EMAIL PROTECTED]>


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] Bayes and whitelisting

Reply via email to