What's a
"sa-learn --dump magic" output look like?
# sa-learn --dump magic
0.000 0 3 0 non-token data: bayes db
version
0.000 0 297 0 non-token data: nspam
0.000 0 982365 0 non-token data: nham
0.000 0 160628 0 non-token data: ntokens
0.000 0 1195344836 0 non-token data: oldest atime
0.000 0 1195532636 0 non-token data: newest atime
0.000 0 1195532327 0 non-token data: last journal
sync atime
0.000 0 1195517625 0 non-token data: last
expiry atime
0.000 0 172800 0 non-token data: last expire
atime delta
0.000 0 72520 0 non-token data: last expire
reduction count
Thoughts?
That's a *really* unusual sa-learn dump, and would imply that bayes
was
completely inactive until recently.
Note that there are 900k messages that have been trained as ham (ie:
nonspam email), but only 297 trained as spam. That's very little spam
compared to the quantity of ham. Usually you see by more spam than
ham,
but not by that large a margin (50:1 spam to ham isn't unheard of..
but
this is 1:3307).
Did you do some really goofy hand training with sa-learn, or did the
autolearner really do that? If it's all autolearning, do you have a
lot
of spam matching ALL_TRUSTED?
I have not done hand training. The autolearner did it.
Spam is not kept for very long and in the collection I now have there
are no occurrences of ALL_TRUSTED.
I'd be very concerned about the health of your bayes database. It's
possible the autolearner went awry and learned poorly here.
I would seriously consider doing the following, if at all possible:
1) round up a few hundred spam and nonspam messages as text files
(with
complete headers)
2) run sa-learn --clear to wipe out your bayes database
3) use sa-learn --spam and sa-learn --ham to hand-train those messages
from step 1.
I would like to do this. I have yet to find a way to extract ham
from our corporate mail system. It runs IBM's Lotus Notes and a mail
message is split up awkwardly into database fields and I know of no
way to extract a raw form message. Is there any such tool?
SA runs on a gateway box which does not store any mail passing
through it except that which it quarantines (for possible retrieval
in the case of false positives) as spam. So I have a source of spam
there.
Apart from hand learning, what would be the overall effect of
clearing the bayes db as it currently stands and having autolearn to
start again?
Thanks for your help and suggestions thus far.
Once given a little hand training, usually the autolearner is fine
(with
the occasional hand training to fix minor confusions, but it looks
like
you're way past minor confusion...).
This message may contain confidential information which is intended only for
the individual named.
If you are not the named addressee you should not disseminate, distribute or
copy this email.
Please notify the sender immediately by email if you have received this email
by mistake and delete this email from your system.
Email transmission cannot be guaranteed to be secure or error-free as
information could be intercepted, corrupted, lost, destroyed, arrive late or
incomplete, or contain viruses.
The sender therefore does not accept liability for any errors or omissions
in the contents of this message which arise as a result of email transmission.
If verification is required please request a hard copy version.