Nigel Frankcom wrote:
On Fri, 30 Jun 2006 09:45:07 +1000, "Leigh Sharpe"
<[EMAIL PROTECTED]> wrote:
So it looks like I have to reset my Bayes and re-train it. I want to do
it properly this time. I will be making sure I personally review every
message that our users put into the spam folder first, to make sure they
haven't put spam into the wrong folder. However, I have a couple of
questions:
1) Am I better off to feed it a few emails a day, or wait until I get a
few hundred, then feed them all to sa-learn at once? Is there really a
difference?
2) How many spams should I feed it? I've heard in some places that 200
is OK, I've heard elsewhere that 10000 or more are needed.
3) Just how 'balanced' should it's diet be? Should I use the same
quantity of ham as spam, or can I get away with less ham than spam?
The minimum corpus is recommended as 200 spam and 200 ham, then add in
on an as received basis. My initial corpus was around 500 of each and
my bayes has remained stable for several years. The numbers should be
about equal though in my experience they don't have to be exact.
Though if you do 200 ham and 2000 spam you will skew the scoring in
bayes.
Here as FPs or FNs are reported they are trained in accordingly.
I don't use the auto train feature, I've personally found that to be
problematic.
Hi,
I use auto-train plus feed all my personal spam to bayes (I get 100 -
400 spams a day in my personal email account because I've had the same
address since 1995 and I get postmaster, dns, hostmaster, abuse etc).
After 6 month's I'm at
0.000 0 1258041 0 non-token data: nspam
0.000 0 996687 0 non-token data: nham
And my hit rates are
For HAM
RANK RULE NAME COUNT %OFRULES %OFMAIL %OFSPAM %OFHAM
1 BAYES_00 22819 24.15 54.61 1.65 96.70
And SPAM
RANK RULE NAME COUNT %OFRULES %OFMAIL %OFSPAM %OFHAM
4 BAYES_99 10419 4.64 24.93 57.28 0.05
That 1.65 % SPAM is bayes_00 is spam slipping through that I learn later
as spam.
It's been stable now for the last 5 months with about 100K emails a day.
Regards,
Rick