Hello Bill,

Saturday, September 6, 2003, 5:46:29 PM, you wrote:

RM>> All spam is then kept to be used as part of our corpus. Our spam
RM>> corpus is nearing 20k messages -- we'll probably start deleting the
RM>> oldest spam shortly.  

BP> Here's a newbie question for you:

BP> When you say "used as part of our corpus," what are you actually
BP> using it for? Are you using it to train the Bayesian filter system?

Mostly I use it for identifying patterns within spam, creating and then
validating rules, and determining reasonable scores for my new rules.

Example: I have a rule developed a short while ago which I've named
"toomany" -- the email is addressed to too many people (20 or more). I
thought that this would be a good rule for identifying spam. I found that
while it does match a whole lot of spam (456 messages), it also matches a
goodly amount of ham (118). So while I've kept the rule and am including
it as part of my anti-spam arsenal, I score it only 0.38

Yes, I do train Bayes.

Bayes Training # 1 -- wrongly classified email

Whenever I get a false negative (spam which doesn't meet my 9.0
threshold), I feed it to Bayes, and determine what else I need to do to
flag the spam, if anything.  I use two related methods of feeding these
to Bayes, depending on the recipient.

All ham sent to POP3 users who download their email is also copied to a
hamtrap. When false negatives are found, I copy the offending email to a
"spam to learn" mailbox file.

Webmail users have a "spam to learn" folder and a "ham to learn" folder.
When they find misclassified email, they use their internal webmail links
to copy the offending email to the appropriate folder.

There's a cron job which checks for these Bayes feeder files hourly, and
issues the appropriate sa-learn command against them when found.

Bayes Training # 2 -- improving classification

Approximately once a week I go through the recently collected spam and
identify those that scored less than Bayes-80. I dump those into a
learn-as-spam mailbox. I then go through the recently collected and
verified ham, and dump an approximately equal number into the
learn-as-ham mailbox.

Bayes Training # 3 -- Restart

When one of my domains moved from one server to another, we lost the
Bayes database for some reason. The files looked OK, but SA couldn't
access them according to the spamassassin debugging output. So I deleted
the files, and dumped the 500 most recent messages from each into the
appropriate sa-learn mailbox.

BP> Since I don't want the stuff appended by SA to be part of the email
BP> used to train Bayesian, I have to go through each message (I use PINE
BP> for this) and write out the ORIGINAL message to a separate file which
BP> I then use for training.   

SA knows its own headers, and ignores them during the learning process. I
make no changes at all to the messages fed into Bayes.

To me that's a non-issue anyway, since Bayes looks at ALL tokens in the
message. Since Bayes does its testing before SA adds headers to the
message, those headers aren't in the message being tested to confuse
Bayes.

BP> Is there an easier way to do thia? If I had to worry about thousands
BP> of messages, that would represent a large chunk of my time, manually
BP> writing out each original message.

Don't bother cleaning up your messages.  The important thing is to make
absolutely sure you don't feed spam as ham or ham as spam. As long as you
have that set, just dump the messages into sa-learn as needed.

Bob Menschel




-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to