>My hunch is that auto-learning waters down the effectiveness of manual >training. Our Bayes database is now up to nearly 60,000 spam and 60,000 >ham, and I suspect that the token numbers for common words are quite large, >therefore training on individual messages has a correspondingly small >effect compared to if I only had say 2,000 spam and 2,000 ham. > >Anyone agree with this theory ?
Yeah, that sounds likely -- for the common words. However if a word is not very common, e.g. appeared in ham 1 time and spam 4 times, and you learn it as ham, you'll still have a big effect; the probability will move closer to .5. Try it out using sa-learn --dump | grep token to view the line before and after.
Ok, might try that.
However, I'd say at that kind of volume, there really isn't much need to keep auto-learning if you don't want to; you could also turn it off and just train on what's reported as FPs/FNs and that should work OK.
What I was starting to think about, is maybe there could be a different weighting for auto-learning as compared to manual learning ?
EG, the ability to give manual learning a lot stronger weight.
I'm not sure how you'd go about that exactly, but maybe a --weight option could be added on sa-learn which basically multiplies the added token count of each token by a factor of up to say 10, for each token, which would give manually learnt ham/spam comparitively more effect than auto-learnt stuff. (Which would always have a factor of 1)
Sound plausable ?
Regards, Simon
------------------------------------------------------- This SF.net email is sponsored by: VM Ware With VMware you can run multiple operating systems on a single machine. WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines at the same time. Free trial click here: http://www.vmware.com/wl/offer/345/0 _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk