I've been running a Bayesian filter at my company about since Paul Graham published his paper. Lack of ham will result in some false positives where messages that are not spam are marked incorrectly as spam. There is a diminishing point of return to adding ham, but I haven't found it yet. I do know you can get reasonable mileage out of small corpii consisting of just several thousand messages. It seems the more ham and spam I have, the more accurate my database. Adding fresh messages is necessary to keep up with the changing times.
I have libraried 75,000 spam (100% pure) and 75,000 (99.9% pure) nonspam messages which today I am re-compiling into a Bayesian database (mysql based). This rendition I am using three word tokens in addition to one and two word tokens, and hope to run statistics on how it affects accuracy. The master database is about four gigs with all tokens and 600 megs with tokens counted five or more times. I would like to offer it for download as an SA Bayes database, if I could get it compressed to under 100 megs. I hope to work on that later this week. We will find out how fast and durable the SA Bayes database message adder is. I have had to re-write mine several times to give it enterprise features. It still takes about twenty fours for my 1.13 gigahertz server to crunch the 150,000 messages. Fox ----- Original Message ----- From: "Pierre Beck" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Thursday, July 17, 2003 9:17 AM Subject: [SAtalk] Bayes balance problem > Hi, > > I'm afraid my Bayes filter will learn too much spam. I'm getting 90% > Spam, so I can't give an equal amount of ham, even if I sa-learn every > single ham I have. I'll be facing problems, right? > > > > ------------------------------------------------------- > This SF.net email is sponsored by: VM Ware > With VMware you can run multiple operating systems on a single machine. > WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines at the > same time. Free trial click here: http://www.vmware.com/wl/offer/345/0 > _______________________________________________ > Spamassassin-talk mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/spamassassin-talk ------------------------------------------------------- This SF.net email is sponsored by: VM Ware With VMware you can run multiple operating systems on a single machine. WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines at the same time. Free trial click here: http://www.vmware.com/wl/offer/345/0 _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk