Re: [SAtalk] Bayes and whitelisting

Simon Byrnand Tue, 02 Sep 2003 02:00:56 +0000

At 02:24 2/09/2003 +0200, Carlo Wood wrote:

On Tue, Sep 02, 2003 at 09:56:54AM +1200, Simon Byrnand wrote: > On the other hand, there is nothing to stop the message being autolearnt if > its score before the whitelisting value is added, so for example if a spam > would normally score 20 and be autolearnt, and you for some reason > whitelisted the spammer, their final score would be -80 and the message > would not be tagged as spam, but it WOULD be learnt as spam. (Because it is > spam)

:(

That is not what I want though.
Take for example this mailinglist, this very mail, it is full of words
like "whitelist", "SpamAssassin", "autolearnt", "score", "man pages"
etc.

Yes, so what ? :)

  If you included a SPAM as example (quite possible on this list,
and the reason why I whitelist it) then I still don't want it to be
autolearnt: that would mean that the mentioned words get tagged as
spammy, and they are not.

Well don't forget that the auto_learn_spam threshold is 15 in 2.55 and 12 in 2.60, and its very rare that a spam pasted into the *body* of a message will be autolearnt, simply because most high scoring spams get most of their high score from headers tests, which fail to match when the message is pasted into the body of a new message.

I don't whitelist this mailing list and I know at least one of the developers (Justin) doesn't either and I don't have problems. I very rarely see a message from this list with a sample spam in it which scores above my threshold of 5, and I've *never* seen one high enough to be autolearnt. (> 15 not counting bayes)

Learning mails that *discuss* spam as being spam will make the Bayesian
classifier less accurate.

Are you quite sure about that ? Lets assume for a moment that a spam reposted in the list in the body of a message managed to score over 15, (which I've never seen happen in my time on the list) I think you'll find bayesian classifiers don't work the way you probably think they do. All they do is tokenize a message based on word boundries, and build word statistics based on that. They do *not* learn specific messages or phrases as spam or ham, they simply count the statistical prevalence of words.

We really need a way to stop certain (white listed) mail to be auto-learned
at all as spam or ham - ever.

Maybe, maybe not. I think perhaps you're looking for a problem where there isn't one. Do you have any examples of this in your own collection of received emails from this list ? Do you have even one that says "autolearn=spam" for a message from this list that was just someone pasting a copy of a spam in a new message ? If you do, you might have a case, otherwise...

> There was a "bug" in older versions where I think the man pages were
> installed to a *different* path than older verisons of SpamAssassin, so if
> you started with 2.4x and then updated later to 2.5x (I think that was it)
> then your man command could be looking at the old version.
>
> Have a search in the various man paths on your system and see if you can
> find old versions of the man pages and delete them...

Thanks!  That was indeed the case!
I now removed the old man pages.

Good :)

> >Your webpage http://www.spamassassin.org/doc/Mail_SpamAssassin_Conf.html > >says > > It's not "our" web page, both Kai and myself are end users, just like you :)
Then where should I post when I want to convince the maintainers of SA
to never autolearn a whitelisted mail? :)

Well, the developers read this list so they're no doubt following this message thread...

Mails on a list like this can easily score as spam, but - in the same
message - might contain a lot of valuable words/tokens and that does
screw up the Bayesian database when learning it as spam.

Perhaps a 'never_autolearn_whitelist_to' should be added?

Maybe in theory... although personally I think its not needed in practice....

Perhaps the developers will have a comment on this.

Regards,
Simon

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] Bayes and whitelisting

Reply via email to