Simon Byrnand writes: >At 14:16 17/07/03 -0400, Barry McLarnon wrote: >>On Jul 16, 2003 09:34 pm, Simon Byrnand wrote: >> > Anybody have any suggestions why almost all the ham I manually >> > train won't budge below BAYES_30 ? >> >>I think you should suggest to your correspondents that they become >>more literate. :-) I just took a look at the ham in my inbox... of >>160 messages, 104 had BAYES_01, 27 had BAYES_10, 19 had BAYES_20, >>10 had BAYES_30, and none had higher. Hard to say why your mileage >>is varying so much, but maybe you can run Bayesian analysis on >>individual ham messages and see which tokens are scoring relatively >>high. > >My hunch is that auto-learning waters down the effectiveness of manual >training. Our Bayes database is now up to nearly 60,000 spam and 60,000 >ham, and I suspect that the token numbers for common words are quite large, >therefore training on individual messages has a correspondingly small >effect compared to if I only had say 2,000 spam and 2,000 ham. > >Anyone agree with this theory ?
Yeah, that sounds likely -- for the common words. However if a word is not very common, e.g. appeared in ham 1 time and spam 4 times, and you learn it as ham, you'll still have a big effect; the probability will move closer to .5. Try it out using sa-learn --dump | grep token to view the line before and after. However, I'd say at that kind of volume, there really isn't much need to keep auto-learning if you don't want to; you could also turn it off and just train on what's reported as FPs/FNs and that should work OK. --j. ------------------------------------------------------- This SF.net email is sponsored by: VM Ware With VMware you can run multiple operating systems on a single machine. WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines at the same time. Free trial click here: http://www.vmware.com/wl/offer/345/0 _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk