Robert Menschel wrote:
Hello Matias,

Friday, February 11, 2005, 5:32:10 AM, you wrote:

MLB> A couple of weeks ago I started storing the spam flagged messages
MLB> by SA. Currently, I have like 20400 messages stored, I'm planing
MLB> to sa-learn them, but now I got another question ;)


MLB> The sa-learn man page says that for a good training of the
MLB> Bayesian filter, you need to train it with equal amounts of spam
MLB> and ham, or more ham if is possible. So if I sa-learn the spam
MLB> folder, the spam tokens are going to grow a lot compared to ham
MLB> tokens.

IMO, if you manually train ONLY spam into the system, then yes, you
may end up with Bayes problems. Emphasis: may. It might work just
fine.

You don't need to worry about training Bayes with equal amounts of
spam and ham -- my ratio has varied from 10:1 to 15:1 spam:ham, with
no problem.

But it's important to feed ham into the system as well. I would
hesitate exceeding a 100:1 ratio, unless your actual spam load exceeds
100:1.

How much non-spam are you able to capture, verify, and sa-learn?

Last week I enable a flag in milter-spamc to redirect the non-spam mail to some email address.
I'm going to use a few hundred of messages to maintain the radio that you suggest. I only need to verify the messages one by one in order to not train the Bayesian filter in the wrong way :-P


Thanks Bob :)

BR,
Matías.

Reply via email to