What's a
"sa-learn --dump magic" output look like?

# sa-learn --dump magic
0.000 0 3 0 non-token data: bayes db version
0.000          0        297          0  non-token data: nspam
0.000          0     982365          0  non-token data: nham
0.000          0     160628          0  non-token data: ntokens
0.000          0 1195344836          0  non-token data: oldest atime
0.000          0 1195532636          0  non-token data: newest atime
0.000          0 1195532327          0  non-token data: last journal
sync atime
0.000 0 1195517625 0 non-token data: last expiry atime
0.000          0     172800          0  non-token data: last expire
atime delta
0.000          0      72520          0  non-token data: last expire
reduction count

Thoughts?
That's a *really* unusual sa-learn dump, and would imply that bayes was
completely inactive until recently.
Note that there are 900k messages that have been trained as ham (ie:
nonspam email), but only 297 trained as spam. That's very little spam
compared to the quantity of ham. Usually you see by more spam than ham, but not by that large a margin (50:1 spam to ham isn't unheard of.. but
this is 1:3307).

Did you do some really goofy hand training with sa-learn, or did the
autolearner really do that? If it's all autolearning, do you have a lot
of spam matching ALL_TRUSTED?

I have not done hand training. The autolearner did it.

Spam is not kept for very long and in the collection I now have there are no occurrences of ALL_TRUSTED.

I'd be very concerned about the health of your bayes database. It's
possible the autolearner went awry and learned poorly here.

 I would seriously consider doing the following, if at all possible:

1) round up a few hundred spam and nonspam messages as text files (with
complete headers)
2) run sa-learn --clear to wipe out your bayes database
3) use sa-learn --spam and sa-learn --ham to hand-train those messages
from step 1.

I would like to do this. I have yet to find a way to extract ham from our corporate mail system. It runs IBM's Lotus Notes and a mail message is split up awkwardly into database fields and I know of no way to extract a raw form message. Is there any such tool?

SA runs on a gateway box which does not store any mail passing through it except that which it quarantines (for possible retrieval in the case of false positives) as spam. So I have a source of spam there.

Apart from hand learning, what would be the overall effect of clearing the bayes db as it currently stands and having autolearn to start again?

Thanks for your help and suggestions thus far.

Once given a little hand training, usually the autolearner is fine (with the occasional hand training to fix minor confusions, but it looks like
you're way past minor confusion...).




This message may contain confidential information which is intended only for 
the individual named.
If you are not the named addressee you should not disseminate, distribute or 
copy this email.
Please notify the sender immediately by email if you have received this email 
by mistake and delete this email from your system.
Email transmission cannot be guaranteed to be secure or error-free as 
information could be intercepted, corrupted, lost, destroyed, arrive late or 
incomplete, or contain viruses.
The sender therefore does not accept liability for any errors or omissions
in the contents of this message which arise as a result of email transmission.
If verification is required please request a hard copy version.

Reply via email to