retraining bayes question (Was: bayes_99 matching since sa-update)

Rolf Loudon Mon, 14 Jan 2008 16:54:41 -0800

hello

I have been trying to retrain my BayesDB to correct whateverstrangeness had crept in to show a dramatically different numbers ofspam and ham in the output of sa-learn --dump magic.

As recommended below I collected 420 messages each of spam and ham andchecked for wrongly assessed ones. I did a sa-learn --clear then a sa-learn --spam /path/to/spam/mail and sa-learn --ham /path/to/ham/mail.The sa-learn process reported it learned 416 and 418 of the 420supplied for the two types.

Ever since, the number of ham reported from sa-learn --dump magic hasgrown faster than the number of spam. After 3 hours spam had risen to460 while nham was at 790. Now, 21 hours later nspam is 468 whilenham is 1619. At this rate the disparity brought to my attention frommy original post will be reached.


Methinks this is not normal, is something wrong?

The only changes I have made from a standard package install on debianlinux (currently version 3.1.7) are:

(a) to use the sa-update mechanism to incorporate updated rules fromthe channels saupdates.openprotect.com and updates.spamassassin.org

(b) occasionally adjust downwards a few rules that repeatedly causefalse positives. In the case of those FPs they are where the totalscore is quite low excepting particular rule hits that have scores of2+ Scores I have adjusted are for rules such as the MANGLED_ series(_WHILE, _TOOL, _MEN, _OFF, _GOOD, _NAIL etc), the TVD_ series andindividual ones like DEAR_SOMETHING and DATE_IN_FUTURE_12_24.

(c) added a couple of very primitive simple rules capturing literalstrings of an offensive nature in the Subject lines.

(d) added various blacklist_from and whitelist_from entries asappropriate.


Are these kind of adjustments ill-advised?

I receive far more ham than spam. Is this problematic for Bayesianlearning?

Any ideas why the ratio of spam/ham is growing at this rate and whatchanges could be made, if indeed it is actually a developing problem.


many thanks

r.




On 20/11/2007, at 4:39 PM, Matt Kettler wrote:

Rolf Loudon wrote:
What's a
"sa-learn --dump magic" output look like?
# sa-learn --dump magic
0.000 0 3 0 non-token data: bayes dbversion
0.000          0        297          0  non-token data: nspam
0.000          0     982365          0  non-token data: nham
0.000          0     160628          0  non-token data: ntokens
0.000          0 1195344836          0  non-token data: oldest atime
0.000          0 1195532636          0  non-token data: newest atime
0.000          0 1195532327          0  non-token data: last journal
sync atime
0.000 0 1195517625 0 non-token data: last expiryatime
0.000          0     172800          0  non-token data: last expire
atime delta
0.000          0      72520          0  non-token data: last expire
reduction count

Thoughts?
That's a *really* unusual sa-learn dump, and would imply that bayeswas
completely inactive until recently.

Note that there are 900k messages that have been trained as ham (ie:
nonspam email), but only 297 trained as spam. That's very little spam
compared to the quantity of ham. Usually you see by more spam thanham,but not by that large a margin (50:1 spam to ham isn't unheard of..but
this is 1:3307).

Did you do some really goofy hand training with sa-learn, or did the
autolearner really do that? If it's all autolearning, do you have alot
of spam matching ALL_TRUSTED?

Also bayes won't become active until there are at least 200 spams and
200 hams, and given there's only 297 spams, it may not have crossedthat
line until recently and bayes may have been disabled.

I'd be very concerned about the health of your bayes database. It's
possible the autolearner went awry and learned poorly here.

I would seriously consider doing the following, if at all possible:
1) round up a few hundred spam and nonspam messages as text files(with
complete headers)
2) run sa-learn --clear to wipe out your bayes database
3) use sa-learn --spam and sa-learn --ham to hand-train those messages
from step 1.
Once given a little hand training, usually the autolearner is fine(withthe occasional hand training to fix minor confusions, but it lookslike
you're way past minor confusion...).

retraining bayes question (Was: bayes_99 matching since sa-update)

Reply via email to