Re: how do train SpamAssassin

Loren Wilton Sat, 02 Jul 2005 03:36:16 -0700

> Do i really need to train SA..so what is the purpose of auto learn bayes?


Not if you aren't planning on using Bayes.

> Do I also need to install Razor?
Not if you aren't using Razor.

> why why please make it simple and easy

Ok, simple and easy:

1.    SA is a tool for fighting email spam

2.    SA does this by applying various rules to the email.

3.    The various rules look fo rdifferent spam signs, and assign different
scores if the rules match things found in the mail.

4.    If enough rules match things in the mail, the total score is greater
than the spam threshhold (by default, 5) and SA will mark the message as
spam.

5.    The more different rules you have checking various things, the more
likely you are to be able to catch different kinds of spam.  If you have
fewer rules, you are less likely to detect that a spam is spam.

6.    Bayes and Razor are two different kinds of rules that can be used to
detect spam.  If you have them turned on, you have a better chance of
catching spam.  If you don't have them turned on, there is a better chance
the spam will make it into your inbox.

7.    If you are using SA to eliminate spam, then you probably want it to do
the best job it can.  This would seem to imply that you want to have as many
rules as possible to catch spam.  Since Bayes and Razor are rules to catch
spam, you might want to use them.

8.    BUT --- Bayes and Razor are not "simple" rules.  You have to do
something yourself to make them work.  You might find this too hard, or too
much work, or have other problems with doing it.  For instance, you might
have to pay someone to be able to use Razor, depending on how you are using
it.  So Bayes and Razor (and some other rules) are optional.  You don't have
to use them if you don't want to.

9.    Why do you have to train Bayes --- Bayes is a very special kind of
rule.  It matches words it finds in email to words it has found before in
ham mails and spam mails.  If it finds a lot of words that match spam mails
in the current email, it guesses that it is spam.  If it finds a lot of
words that match ham mails, it guesses that the current email is ham.

Bayes does not "know" which words appear in YOUR ham and YOUR spam.  You
have to give it a handful of mail and say "these are ham" and another
handful and say "these are spam".  Bayes can then go in an decide which
words show up in YOUR spam and in YOUR ham.  Then it can render judgement on
new emails.

There are two ways to "train" Bayes - manual learning and auto-learning.

With Manual Learning, someone with a brain, you for instance, looks at the
emails and says "these are spam" and "these others are ham".  There is no
guesswork involved -- YOU have decided what is spam and what is ham.  You
then tell Bayes this, and it begins to know about your mail patterns.

With auto-learning, Bayes does not really KNOW which mails are REALLY spam
and ham.  Instead, it relies on the other rules.  Which are pretty good, but
aren't as good as a real person looking at the mail.  The mail is given an
initial score from the other rules.  Then this score is compared to the
"bayes auto-learn" thresholds.  If the mail scores LESS than the "ham"
threshold, SA gives it to Bayes to learn as ham.  Likewise if it scores more
than the Spam threshhold, SA gives it to Bayes to learn as spam.  Then SA
adds the Bayes score to the mail and reports it as the final score.

Most people that have problems with Bayes rely on the auto-learning.  And
for one reason or another, this doesn't work as well as they want it to, and
pretty soon Bayes starts thinking ham is spam and spam is ham, and screwing
up the score.

Now, can auto-learning work?  Yes.  *IF* you first train Bayes manually to
get it on the right track, and then adjust the bayes ham auto-learn score
down a little bit, so that it is less likely to learn spam as ham.

You can also spend a few hours feeding bayes, leave auto-learn off, and then
forget about it, and Bayes will work.  Probably very well.  Every month or
two you might feed Bayes a few more low-scoring spam or high-scoring ham to
keep it up with changing patterns.

Or if that is too much work, just don't use Bayes at all.

10.    Razor is an optional test because you might have to pay for it,
depending on your situation.  There is no requirement that you use it -- 
that is why it is optional.  Just turn it off and forget about it if you
don't want it.

Simple enough?

        Loren

Re: how do train SpamAssassin

Reply via email to