At 10:08 AM 11/9/2004, Ronan wrote:
thats quite comprehensive answering there matt - most appreciated... :D

one more though. sa-learn ham. Is this to explicity demark what should not be learnt as spam? so should you feed it the rest of your mailbox?

sa-learn --ham is used to teach SA what non-spam looks like. This has nothing to do with adjusting what should or should not be learned in the future. It is a direct, integral, and REQUIRED part of bayes training.


Bayes makes a judgement about how probable it is that a given mail is spam. To do this, it needs to know what common words/phrases/tokens in spam look like, and what they look like in nonspam.

When you train, sa-learn breaks a message into "tokens" (tokens are mostly words from the body, but also various headers get encoded.). It then puts these in a database and tracks how many times it was seen in spam, and how many in times in nonspam. Based on the count of spam/ham matches, SA can calculate a probability that a given token appears in a spam email (in percentage 0% to 100%)

When new mail comes, bayes looks for token matches against it's existing learning. It then comes to a probability of spam for the whole message based on combining the probabilities of the tokens it matched.

It's all a simple statistical word-frequency thing...

Without ham training, bayes will think that everything is spam. (Fortunately, SA will flatly refuse to use bayes until 200 hams have been trained, as well as 200 spams)





Reply via email to