Re: [SAtalk] Trouble training bayes ?

Justin Mason Thu, 17 Jul 2003 17:41:21 -0700

Simon Byrnand writes:
>At 14:16 17/07/03 -0400, Barry McLarnon wrote:
>>On Jul 16, 2003 09:34 pm, Simon Byrnand wrote:
>> > Anybody have any suggestions why almost all the ham I manually
>> > train won't budge below BAYES_30 ?
>>
>>I think you should suggest to your correspondents that they become
>>more literate. :-)  I just took a look at the ham in my inbox... of
>>160 messages, 104 had BAYES_01, 27 had BAYES_10, 19 had BAYES_20,
>>10 had BAYES_30, and none had higher.  Hard to say why your mileage
>>is varying so much, but maybe you can run Bayesian analysis on
>>individual ham messages and see which tokens are scoring relatively
>>high.
>
>My hunch is that auto-learning waters down the effectiveness of manual 
>training. Our Bayes database is now up to nearly 60,000 spam and 60,000 
>ham, and I suspect that the token numbers for common words are quite large, 
>therefore training on individual messages has a correspondingly small 
>effect compared to if I only had say 2,000 spam and 2,000 ham.
>
>Anyone agree with this theory ?


Yeah, that sounds likely -- for the common words.  However if a word is
not very common, e.g. appeared in ham 1 time and spam 4 times, and you
learn it as ham, you'll still have a big effect; the probability will move
closer to .5.  Try it out using sa-learn --dump | grep token to view the
line before and after.

However, I'd say at that kind of volume, there really isn't much need to
keep auto-learning if you don't want to; you could also turn it off
and just train on what's reported as FPs/FNs and that should work OK.

--j.


-------------------------------------------------------
This SF.net email is sponsored by: VM Ware
With VMware you can run multiple operating systems on a single machine.
WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines at the
same time. Free trial click here: http://www.vmware.com/wl/offer/345/0
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] Trouble training bayes ?

Reply via email to