On Tue, 13 Feb 2018 21:02:46 +0000 Horváth Szabolcs wrote: > One more question: is there a recommended ham to spam ratio? 1:1?
No, this is a myth. Bayes computes token probabilities from a token's frequencies in spam and ham, so it all scales through. If you have 2000 ham and 200 spam the problem is too few spams, not a bad ratio. Theoretically there is a case for new training to match the ratio that's already in the database because then a new token will get a token probability that reflects its frequencies in recent mail. But I wouldn't worry about that, it's hard to stick to, and probably minor.