From: "Gene Heskett" <[EMAIL PROTECTED]>
On Tuesday 07 February 2006 15:27, Clay Davis wrote:
Does anyone have any good techniques for capturing a sample of ham
that can be used as the ham corpus. I'm in a corporate environment
and am not keen on the idea of intercepting non-spam messages. I
will if I have to, but was hoping someone had a better idea.
I wouldn't have too guilty a consience(sp?) on that subject because
generally, you won't be reading very much other than intercepted spam.
There may be an FP in there occasionally, but you'll soon learn to
catch those and feed them to the ham learner & hence move them to the
correct mailbox folder. In other words, to make an omelete, you
normally have to break a few eggs. What you accidently read in an FP
should be treated with the usual amount of salt and otherwise
forgotten.
Intercept some ham, feed it through SpamAssasin's salearn, forget to store
it on the way out. You don't have to know WHAT you trained with. You just
have to know it's ham.
Now, if you are in a corporate environment and don't have a strong email
policy you'd best do that first. Then you can sample the email, with
some discretion, legally and properly to get a test set of ham messages.
It MAY even be good corporate policy to save, for at least a short time,
all incoming and outgoing emails. 3 months to 6 months may be OK. This
will be handy if an employee is caught engaging in illegal activities
and must be terminated for cause, for example. Just make sure that the
company has a firm and clear email policy with regards to permissable
uses and notify the employees that the company reserves the right to
read emails in and out. If you don't your company could face some
"interesting time" if the fit hits the shan.
{^_^}
Regards,
Clay
On 2/7/2006 at 3:16 pm, in message <[EMAIL PROTECTED]>,
Matt Kettler
<[EMAIL PROTECTED]> wrote:
[EMAIL PROTECTED] wrote:
Can you just feed spamassassin spam or do you need to give it ham
also?
I read the docs and it didn't say you had to feed it ham.
I then read another doc and it said you should feed it equal
amounts of spam and ham.
Yes, you really should feed it both. You also should strive for a
1:1 ratio of
spam and nonspam, but don't kill yourself to get there.
SA's use of chi-squared combining makes it very tolerant of wild
imbalances in
training. However, the closer you are to a 1:1 ratio the better SA
will be able
to distinguish tokens that are present in both kinds of mail and
ignore them. So
this is a worthwhile goal to strive for as long as it doesn't become
a burden.
My current training ratio is about 7:1 spam:nonspam, but in the past
it's been
as bad as 20:1. Both of those are very far off from equal amounts,
but the imbalance has never caused me any problems.
From my sa-learn --dump magic output as of today:
0.000 0 995764 0 non-token data: nspam
0.000 0 145377 0 non-token data: nham
That works out to a ratio of 6.85:1
--
Cheers, Gene
People having trouble with vz bouncing email to me should add the word
'online' between the 'verizon', and the dot which bypasses vz's
stupid bounce rules. I do use spamassassin too. :-)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.