Troy Settle wrote on Tue, 18 Nov 2008 15:19:56 -0500:

> From incoming mail.

well, but how? By auto-learning? In that case you are just multiplying your 
problem. It seems a lot of spam gets miscategorized as ham. Auto-learning 
that spam as ham means enforcing this miscategorization and that's what you 
see as a result.

> 0.000          0      44946          0  non-token data: nspam
> 0.000          0      36757          0  non-token data: nham
> 0.000          0     545675          0  non-token data: ntokens

looking fine if the ham tokens were really ham.

> 0.000          0 1227007705          0  non-token data: last expiry atime

> 0.000          0     393274          0  non-token data: last expire 
> reduction count

Hm, you just did an expire that slashed your db almost in half? You may 
want to let it grow a bit.

> 
> FWIW, how bad would I screw things up if I were to override the BAYES_00 
> score to 0?

As it is causing you grief now, probably not much. It means that real ham 
that also gets detected as Bayes_00 will not enjoy the benefits of this 
negative score. Maybe switching Bayes off for a while is better.

I would start over with that db.

1. stop Bayes and check how the categorization without Bayes works, by 
theory you should have a good number of miscategorized spam (as ham) 
already without Bayes.

2. collect some ham and spam where you can be absolutely sure that they are 
in the right category and then train Bayes with these. Stop autolearning 
for bayes for a while.

3. switch it on with your new db and check if Bayes seems to categorize 
better now

4. if it does then switch auto-learning on, but move the auto-learning 
threshold for ham a bit down, so that the chance of spam creeping in is 
smaller.


Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com



Reply via email to