From: "Rick Macdougall" <[EMAIL PROTECTED]>
Nigel Frankcom wrote:
On Fri, 30 Jun 2006 09:45:07 +1000, "Leigh Sharpe"
So it looks like I have to reset my Bayes and re-train it. I want to do
it properly this time. I will be making sure I personally review every
message that our users put into the spam folder first, to make sure they
haven't put spam into the wrong folder. However, I have a couple of
questions:
1) Am I better off to feed it a few emails a day, or wait until I get a
few hundred, then feed them all to sa-learn at once? Is there really a
difference?
2) How many spams should I feed it? I've heard in some places that 200
is OK, I've heard elsewhere that 10000 or more are needed.
3) Just how 'balanced' should it's diet be? Should I use the same
quantity of ham as spam, or can I get away with less ham than spam?
The minimum corpus is recommended as 200 spam and 200 ham, then add in
on an as received basis. My initial corpus was around 500 of each and
my bayes has remained stable for several years. The numbers should be
about equal though in my experience they don't have to be exact.
Though if you do 200 ham and 2000 spam you will skew the scoring in
bayes.
Here as FPs or FNs are reported they are trained in accordingly.
I don't use the auto train feature, I've personally found that to be
problematic.
Hi,
I use auto-train plus feed all my personal spam to bayes (I get 100 -
400 spams a day in my personal email account because I've had the same
address since 1995 and I get postmaster, dns, hostmaster, abuse etc).
After 6 month's I'm at
0.000 0 1258041 0 non-token data: nspam
0.000 0 996687 0 non-token data: nham
And my hit rates are
EEEEK! I bet you are running system wide Bayes for a very non-homogeneous
collection of people. I've appended my figures (not the best I have
seen but very good) below yours. Your BAYES_00 is better than mine
only if you do not consider the figure I consider most significant,
the ratio of %OFHAM/%OFSPAM. Your BAYES_99 is worse than mine either
absolute or vie the %OFSPAM/%OFHAM ratio.
For HAM
RANK RULE NAME COUNT %OFRULES %OFMAIL %OFSPAM %OFHAM
1 BAYES_00 22819 24.15 54.61 1.65 96.70
1 BAYES_00 47047 11.65 57.35 0.05 78.57
And SPAM
RANK RULE NAME COUNT %OFRULES %OFMAIL %OFSPAM %OFHAM
4 BAYES_99 10419 4.64 24.93 57.28 0.05
1 BAYES_99 18898 4.42 23.04 85.29 0.04
That 1.65 % SPAM is bayes_00 is spam slipping through that I learn later
as spam.
The slip through on BAYES_00 hints you can do better. The scoring
makes me think you need to feed escaped spams back through to learn
them as spam more often, if possible.
It's been stable now for the last 5 months with about 100K emails a day.
Whereas I do not use automatic anything and process on the order of
2500 per day. This is a fine YMMV example, isn't it?
{^_^}