At 16:58 17/07/03 -0700, Justin Mason wrote:


>My hunch is that auto-learning waters down the effectiveness of manual
>training. Our Bayes database is now up to nearly 60,000 spam and 60,000
>ham, and I suspect that the token numbers for common words are quite large,
>therefore training on individual messages has a correspondingly small
>effect compared to if I only had say 2,000 spam and 2,000 ham.
>
>Anyone agree with this theory ?

Yeah, that sounds likely -- for the common words.  However if a word is
not very common, e.g. appeared in ham 1 time and spam 4 times, and you
learn it as ham, you'll still have a big effect; the probability will move
closer to .5.  Try it out using sa-learn --dump | grep token to view the
line before and after.

Ok, might try that.


However, I'd say at that kind of volume, there really isn't much need to
keep auto-learning if you don't want to; you could also turn it off
and just train on what's reported as FPs/FNs and that should work OK.

What I was starting to think about, is maybe there could be a different weighting for auto-learning as compared to manual learning ?


EG, the ability to give manual learning a lot stronger weight.

I'm not sure how you'd go about that exactly, but maybe a --weight option could be added on sa-learn which basically multiplies the added token count of each token by a factor of up to say 10, for each token, which would give manually learnt ham/spam comparitively more effect than auto-learnt stuff. (Which would always have a factor of 1)

Sound plausable ?

Regards,
Simon



-------------------------------------------------------
This SF.net email is sponsored by: VM Ware
With VMware you can run multiple operating systems on a single machine.
WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines at the
same time. Free trial click here: http://www.vmware.com/wl/offer/345/0
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to