Re: bayes_99 matching since sa-update

Rolf Loudon Tue, 20 Nov 2007 17:56:46 -0800

 What's a
"sa-learn --dump magic" output look like?
# sa-learn --dump magic
0.000 0 3 0 non-token data: bayes dbversion
0.000          0        297          0  non-token data: nspam
0.000          0     982365          0  non-token data: nham
0.000          0     160628          0  non-token data: ntokens
0.000          0 1195344836          0  non-token data: oldest atime
0.000          0 1195532636          0  non-token data: newest atime
0.000          0 1195532327          0  non-token data: last journal
sync atime
0.000 0 1195517625 0 non-token data: lastexpiry atime
0.000          0     172800          0  non-token data: last expire
atime delta
0.000          0      72520          0  non-token data: last expire
reduction count

Thoughts?
That's a *really* unusual sa-learn dump, and would imply that bayeswas
completely inactive until recently.
Note that there are 900k messages that have been trained as ham (ie:
nonspam email), but only 297 trained as spam. That's very little spam
compared to the quantity of ham. Usually you see by more spam thanham,but not by that large a margin (50:1 spam to ham isn't unheard of..but
this is 1:3307).

Did you do some really goofy hand training with sa-learn, or did the
autolearner really do that? If it's all autolearning, do you have alot
of spam matching ALL_TRUSTED?


I have not done hand training. The autolearner did it.

Spam is not kept for very long and in the collection I now have thereare no occurrences of ALL_TRUSTED.

I'd be very concerned about the health of your bayes database. It's
possible the autolearner went awry and learned poorly here.

 I would seriously consider doing the following, if at all possible:

1) round up a few hundred spam and nonspam messages as text files(with

complete headers)
2) run sa-learn --clear to wipe out your bayes database
3) use sa-learn --spam and sa-learn --ham to hand-train those messages
from step 1.

I would like to do this. I have yet to find a way to extract hamfrom our corporate mail system. It runs IBM's Lotus Notes and a mailmessage is split up awkwardly into database fields and I know of noway to extract a raw form message. Is there any such tool?

SA runs on a gateway box which does not store any mail passingthrough it except that which it quarantines (for possible retrievalin the case of false positives) as spam. So I have a source of spamthere.

Apart from hand learning, what would be the overall effect ofclearing the bayes db as it currently stands and having autolearn tostart again?


Thanks for your help and suggestions thus far.

Once given a little hand training, usually the autolearner is fine(withthe occasional hand training to fix minor confusions, but it lookslike
you're way past minor confusion...).




This message may contain confidential information which is intended only for 
the individual named.
If you are not the named addressee you should not disseminate, distribute or 
copy this email.
Please notify the sender immediately by email if you have received this email 
by mistake and delete this email from your system.
Email transmission cannot be guaranteed to be secure or error-free as 
information could be intercepted, corrupted, lost, destroyed, arrive late or 
incomplete, or contain viruses.
The sender therefore does not accept liability for any errors or omissions
in the contents of this message which arise as a result of email transmission.
If verification is required please request a hard copy version.

Re: bayes_99 matching since sa-update

Reply via email to