Re: Training Bayes properly

jdow Thu, 29 Jun 2006 20:10:16 -0700

From: "Rick Macdougall" <[EMAIL PROTECTED]>

Nigel Frankcom wrote:

On Fri, 30 Jun 2006 09:45:07 +1000, "Leigh Sharpe"

So it looks like I have to reset my Bayes and re-train it. I want to do
it properly this time. I will be making sure I personally review every
message that our users put into the spam folder first, to make sure they
haven't put spam into the wrong folder. However, I have a couple of
questions:

1) Am I better off to feed it a few emails a day, or wait until I get a
few hundred, then feed them all to sa-learn at once? Is there really a
difference?
2) How many spams should I feed it? I've heard in some places that 200
is OK, I've heard elsewhere that 10000 or more are needed.
3) Just how 'balanced' should it's diet be? Should I use the same
quantity of ham as spam, or can I get away with less ham than spam?


The minimum corpus is recommended as 200 spam and 200 ham, then add in
on an as received basis. My initial corpus was around 500 of each and
my bayes has remained stable for several years. The numbers should be
about equal though in my experience they don't have to be exact.
Though if you do 200 ham and 2000 spam you will skew the scoring in
bayes.

Here as FPs or FNs are reported they are trained in accordingly.

I don't use the auto train feature, I've personally found that to be
problematic.

Hi,

I use auto-train plus feed all my personal spam to bayes (I get 100 -400 spams a day in my personal email account because I've had the sameaddress since 1995 and I get postmaster, dns, hostmaster, abuse etc).


After 6 month's I'm at

0.000          0    1258041          0  non-token data: nspam
0.000          0     996687          0  non-token data: nham

And my hit rates are


EEEEK! I bet you are running system wide Bayes for a very non-homogeneous
collection of people. I've appended my figures (not the best I have
seen but very good) below yours. Your BAYES_00 is better than mine
only if you do not consider the figure I consider most significant,
the ratio of %OFHAM/%OFSPAM. Your BAYES_99 is worse than mine either
absolute or vie the %OFSPAM/%OFHAM ratio.

For HAM
RANK    RULE NAME    COUNT %OFRULES %OFMAIL %OFSPAM  %OFHAM
   1    BAYES_00     22819    24.15   54.61    1.65   96.70

    1    BAYES_00     47047    11.65   57.35    0.05   78.57

And SPAM
RANK    RULE NAME    COUNT %OFRULES %OFMAIL %OFSPAM  %OFHAM
 4      BAYES_99     10419     4.64   24.93   57.28    0.05

  1      BAYES_99     18898     4.42   23.04   85.29    0.04

That 1.65 % SPAM is bayes_00 is spam slipping through that I learn lateras spam.


The slip through on BAYES_00 hints you can do better. The scoring
makes me think you need to feed escaped spams back through to learn
them as spam more often, if possible.

It's been stable now for the last 5 months with about 100K emails a day.


Whereas I do not use automatic anything and process on the order of
2500 per day. This is a fine YMMV example, isn't it?

{^_^}

Re: Training Bayes properly

Reply via email to