I've been running a Bayesian filter at my company about since Paul Graham
published his paper.  Lack of ham will result in some false positives where
messages that are not spam are marked incorrectly as spam.  There is a
diminishing point of return to adding ham, but I haven't found it yet.  I do
know you can get reasonable mileage out of small corpii consisting of just
several thousand messages.  It seems the more ham and spam I have, the more
accurate my database.  Adding fresh messages is necessary to keep up with
the changing times.

I have libraried 75,000 spam (100% pure) and 75,000 (99.9% pure) nonspam
messages which today I am re-compiling into a Bayesian database (mysql
based).  This rendition I am using three word tokens in addition to one and
two word tokens, and hope to run statistics on how it affects accuracy.  The
master database is about four gigs with all tokens and 600 megs with tokens
counted five or more times.  I would like to offer it for download as an SA
Bayes database, if I could get it compressed to under 100 megs.

I hope to work on that later this week.  We will find out how fast and
durable the SA Bayes database message adder is.  I have had to re-write mine
several times to give it enterprise features.  It still takes about twenty
fours for my 1.13 gigahertz server to crunch the 150,000 messages.

Fox


----- Original Message -----
From: "Pierre Beck" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Thursday, July 17, 2003 9:17 AM
Subject: [SAtalk] Bayes balance problem


> Hi,
>
> I'm afraid my Bayes filter will learn too much spam. I'm getting 90%
> Spam, so I can't give an equal amount of ham, even if I sa-learn every
> single ham I have. I'll be facing problems, right?
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: VM Ware
> With VMware you can run multiple operating systems on a single machine.
> WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines at the
> same time. Free trial click here: http://www.vmware.com/wl/offer/345/0
> _______________________________________________
> Spamassassin-talk mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/spamassassin-talk



-------------------------------------------------------
This SF.net email is sponsored by: VM Ware
With VMware you can run multiple operating systems on a single machine.
WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines at the
same time. Free trial click here: http://www.vmware.com/wl/offer/345/0
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to