On Fri, 30 Jun 2006 09:45:07 +1000, "Leigh Sharpe" <[EMAIL PROTECTED]> wrote:
>So it looks like I have to reset my Bayes and re-train it. I want to do >it properly this time. I will be making sure I personally review every >message that our users put into the spam folder first, to make sure they >haven't put spam into the wrong folder. However, I have a couple of >questions: > >1) Am I better off to feed it a few emails a day, or wait until I get a >few hundred, then feed them all to sa-learn at once? Is there really a >difference? >2) How many spams should I feed it? I've heard in some places that 200 >is OK, I've heard elsewhere that 10000 or more are needed. >3) Just how 'balanced' should it's diet be? Should I use the same >quantity of ham as spam, or can I get away with less ham than spam? > > >Regards, > Leigh > >Leigh Sharpe >Network Systems Engineer >Pacific Wireless >Ph +61 3 9584 8966 >Mob 0408 009 502 >email [EMAIL PROTECTED] >web www.pacificwireless.com.au > The minimum corpus is recommended as 200 spam and 200 ham, then add in on an as received basis. My initial corpus was around 500 of each and my bayes has remained stable for several years. The numbers should be about equal though in my experience they don't have to be exact. Though if you do 200 ham and 2000 spam you will skew the scoring in bayes. Here as FPs or FNs are reported they are trained in accordingly. I don't use the auto train feature, I've personally found that to be problematic. HTH Nigel