Kim Christensen wrote: > Hey list, > > I've recently started training our bayesian filter with spam/ham from my > personal mailbox, to prepare for live usage on our customer accounts. > > % sa-learn --dump magic > ... > 0.000 0 340 0 non-token data: nspam > 0.000 0 475 0 non-token data: nham > 0.000 0 53404 0 non-token data: ntokens > ... > > So far so good, and spamd is actually using the bayesian db when > examining incoming mails. However, I find that a few of the legit ham > (not a majority) mails get unusually high bayesian points, while some > of the real spam (which gets scored as spam by sa) often get bayesian > points < 1. > > Now, I'm sure I haven't trained the database with wrong messages. Is it > a good idea to continue feeding sa-learn with example spam and ham until > it reaches a few thousands messages, before relying on the results? > > I would think my current amount is sufficient, but I guess something's > wrong with this picture :-) > > > > If you want to see what the tokens are that are throwing bayes off, try running a mis-categorized message through spamassassin -D bayes. This will turn on bayes debugging, and will print all the bayes-matching tokens in the message (in text form) and their individual probabilities.
It's completely normal for a message to have a few tokens on "the wrong side". So don't over-worry about testing every message this way, that can lead to the mistake of micro-managing your bayes. However, it can be useful to figure out what bayes is thinking when you have odd results.