The sa-learn man page says that for a good training of the Bayesian filter, you need to train it with equal amounts of spam and ham, or more ham if is possible. So if I sa-learn the spam folder, the spam tokens are going to grow a lot compared to ham tokens. Here are my training efforts:
[EMAIL PROTECTED] root]# sa-learn --dump magic | head -4 0.000 0 3 0 non-token data: bayes db version 0.000 0 1932 0 non-token data: nspam 0.000 0 1973 0 non-token data: nham 0.000 0 170590 0 non-token data: ntokens [EMAIL PROTECTED] root]#
This possible increase in the spam data would have adverse effects on the bayes filter classifying the spam or ham messages??
The manpage is suggesting an ideal situation.. Really, you can be pretty wildly off and bayes will work reasonably well.
Training a lot more spam than ham makes bayes more likely to misclassify a nonspam email as spam, but really, I'm VERY off balance and I've not had any problems with this at all. The difference between spam and nonspam here is just too great. Even a massive imbalance isn't causing FPs.
Look at my stats:
0.000 0 2 0 non-token data: bayes db version 0.000 0 565896 0 non-token data: nspam 0.000 0 24693 0 non-token data: nham 0.000 0 180900 0 non-token data: ntokens
My spam training outnumbers my ham training by 22:1. That's pretty far off from the ideal 1:1 or 1:1+. I've got more FN problems than FP problems with my bayes DB, but I rarely have problems with either.
Also, I trust Dan and the SADevs to know bayes better than I do, and they have tested and have found 1:1 to work best.. However, from a mathematics perspective I'm still not sure why that works best. My original impressions were the ratio should match your real-world ratio... Sometime when I have some spare time I intend to test this myself so I can better understand it.