I'm sure this is a silly question, but it's one that I haven't been able to find documentation on and has kept me from using the Bayes stuff in SA. Does it matter whether the header info is included in the e-mail when submitted to sa-learn? From what has been said in this thread, it sounds like only the text of the message really matters. The whole point of the question is that I wasn't sure how to get my ham and spam out of Outlook and into sa-learn. Any thoughts?
J. Jeffrey J Funk President/CEO Badger Internet, Inc. [EMAIL PROTECTED] 608.661.4240 -----Original Message----- From: Fox Flanders [mailto:[EMAIL PROTECTED] Sent: Friday, July 18, 2003 6:46 AM To: Barry McLarnon; Simon Byrnand Cc: [EMAIL PROTECTED] Subject: Re: [SAtalk] Trouble training bayes ? Yes. I having been using Bayes since about the day Paul Graham published his algorithm. I have always hand picked messages I knew where spam (trollboxes) or ham (hand picked). I found that filter, which I still maintain, was so much more effective than SA autolearn, that I disabled SA's bayes filter. I suspect that maybe autolearn would work well if you had a sizable, modern and accurate corpus to start with and then autolearned from there. Fox ----- Original Message ----- From: "Simon Byrnand" <[EMAIL PROTECTED]> To: "Barry McLarnon" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Thursday, July 17, 2003 7:12 PM Subject: Re: [SAtalk] Trouble training bayes ? > At 14:16 17/07/03 -0400, Barry McLarnon wrote: > >On Jul 16, 2003 09:34 pm, Simon Byrnand wrote: > > > Anybody have any suggestions why almost all the ham I manually > > > train won't budge below BAYES_30 ? > > > >I think you should suggest to your correspondents that they become > >more literate. :-) I just took a look at the ham in my inbox... of > >160 messages, 104 had BAYES_01, 27 had BAYES_10, 19 had BAYES_20, > >10 had BAYES_30, and none had higher. Hard to say why your mileage > >is varying so much, but maybe you can run Bayesian analysis on > >individual ham messages and see which tokens are scoring relatively > >high. > > My hunch is that auto-learning waters down the effectiveness of manual > training. Our Bayes database is now up to nearly 60,000 spam and 60,000 > ham, and I suspect that the token numbers for common words are quite large, > therefore training on individual messages has a correspondingly small > effect compared to if I only had say 2,000 spam and 2,000 ham. > > Anyone agree with this theory ? > > Regards, > Simon > > > > ------------------------------------------------------- > This SF.net email is sponsored by: VM Ware > With VMware you can run multiple operating systems on a single machine. > WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines at the > same time. Free trial click here: http://www.vmware.com/wl/offer/345/0 > _______________________________________________ > Spamassassin-talk mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/spamassassin-talk ------------------------------------------------------- This SF.net email is sponsored by: VM Ware With VMware you can run multiple operating systems on a single machine. WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines at the same time. Free trial click here: http://www.vmware.com/wl/offer/345/0 _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk ------------------------------------------------------- This SF.net email is sponsored by: VM Ware With VMware you can run multiple operating systems on a single machine. WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines at the same time. Free trial click here: http://www.vmware.com/wl/offer/345/0 _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk