Kim Christensen wrote:
> Hey list,
>
> I've recently started training our bayesian filter with spam/ham from my
> personal mailbox, to prepare for live usage on our customer accounts.
>
> % sa-learn --dump magic
> ...
> 0.000          0        340          0  non-token data: nspam
> 0.000          0        475          0  non-token data: nham
> 0.000          0      53404          0  non-token data: ntokens
> ...
>
> So far so good, and spamd is actually using the bayesian db when
> examining incoming mails. However, I find that a few of the legit ham 
> (not a majority) mails get unusually high bayesian points, while some
> of the real spam (which gets scored as spam by sa) often get bayesian
> points < 1. 
>
> Now, I'm sure I haven't trained the database with wrong messages. Is it
> a good idea to continue feeding sa-learn with example spam and ham until
> it reaches a few thousands messages, before relying on the results?
>
> I would think my current amount is sufficient, but I guess something's
> wrong with this picture :-)
>
>
>
>   
If you want to see what the tokens are that are throwing bayes off, try
running a mis-categorized message through spamassassin -D bayes. This
will turn on bayes debugging, and will print all the bayes-matching
tokens in the message (in text form) and their individual probabilities.

It's completely normal for a message to have a few tokens on "the wrong
side". So don't over-worry about testing every message this way, that
can lead to the mistake of micro-managing your bayes. However, it can be
useful to figure out what bayes is thinking when you have odd results.


Reply via email to