At 14:20 2/09/2003 +0200, Carlo Wood wrote:
On Tue, Sep 02, 2003 at 01:02:47PM +1200, Simon Byrnand wrote:
> Well don't forget that the auto_learn_spam threshold is 15 in 2.55 and 12
> in 2.60, and its very rare that a spam pasted into the *body* of a message
> will be autolearnt, simply because most high scoring spams get most of
> their high score from headers tests, which fail to match when the message
> is pasted into the body of a new message.

I set my auto-learn threshold at 4.0, not 15.

I do that because I want to learn as much spams as possible,
*especially* learn those that are close to failure (to be detected).

Oh dear.


Thats just *asking* for trouble. There is a very good reason that 15 (or 12) is the default for autolearning. The whole point of the autolearning thresholds is that you need to *guarentee* (to 99.9% or so) that you are NOT learning ham as spam or vica versa.

By trying to set you autolearn threshold so low, you're probably contaminating your bayes database with incorrectly learnt stuff.

Also you're probably not aware that there is a 4 point saftey margin around your required_hits for exactly this reason - if your required_hits is set to 5 then autolearn will not learn any spam below 9 regardless of what you set the threshold to. This is to try to stop you shooting yourself in the foot, which you're trying to do :)


My true ham never scores higher than 2.0 (98% score 0 or lower),
so - using 4.0 is very safe for me.

Anyway, there should be a possibility for me to make sure
that certain mails are not auto-learned at all - independant of
their score.

How about starting by not setting rediculously low autolearn thresholds that are *below* the default cutoff of 5, let alone lower than the default of 15 :) You reduce a default value to a ridiculous setting, and then complain that its learning things it shouldnt ? Hmm..


[...]
> (which I've never seen happen in my time on the list) I think you'll find
> bayesian classifiers don't work the way you probably think they do. All
> they do is tokenize a message based on word boundries, and build word
> statistics based on that. They do *not* learn specific messages or phrases
> as spam or ham, they simply count the statistical prevalence of words.

Yes, so if this message is marked as spam, ALL of it is marked as spam.
That means that it will think that 'boundries' (see quote of you) is a
word that at least once appeared in a spam post.

Which is why by default nothing less than 15 autolearnt as spam. Better not to learn something at all than to learn something that is uncertain.


> >We really need a way to stop certain (white listed) mail to be auto-learned
> >at all as spam or ham - ever.
>
> Maybe, maybe not. I think perhaps you're looking for a problem where there
> isn't one. Do you have any examples of this in your own collection of
> received emails from this list ? Do you have even one that says
> "autolearn=spam" for a message from this list that was just someone pasting
> a copy of a spam in a new message ? If you do, you might have a case,
> otherwise...


I certainly do - spammers will eventually learn how to get their
mail headers right.  What we need to aim for is detecting spam by looking
at the body.  Before I started to use SA I have hand-crafted (procmail)
filters that filtered almost all spam based on the body text;  I started
to use SA because my filters took about 1 minute to process a 100 kb post
and was starting to annoy me (it froze my PC).

I don't see that happening any time soon. The whole way that spammers obscure their origin is by forging headers. If they stop forging headers then they're giving themselves away.


The bulk of most spam scores come from the headers, not the body.

Regards,
Simon



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to