Hello Bill, Saturday, September 6, 2003, 5:46:29 PM, you wrote:
RM>> All spam is then kept to be used as part of our corpus. Our spam RM>> corpus is nearing 20k messages -- we'll probably start deleting the RM>> oldest spam shortly. BP> Here's a newbie question for you: BP> When you say "used as part of our corpus," what are you actually BP> using it for? Are you using it to train the Bayesian filter system? Mostly I use it for identifying patterns within spam, creating and then validating rules, and determining reasonable scores for my new rules. Example: I have a rule developed a short while ago which I've named "toomany" -- the email is addressed to too many people (20 or more). I thought that this would be a good rule for identifying spam. I found that while it does match a whole lot of spam (456 messages), it also matches a goodly amount of ham (118). So while I've kept the rule and am including it as part of my anti-spam arsenal, I score it only 0.38 Yes, I do train Bayes. Bayes Training # 1 -- wrongly classified email Whenever I get a false negative (spam which doesn't meet my 9.0 threshold), I feed it to Bayes, and determine what else I need to do to flag the spam, if anything. I use two related methods of feeding these to Bayes, depending on the recipient. All ham sent to POP3 users who download their email is also copied to a hamtrap. When false negatives are found, I copy the offending email to a "spam to learn" mailbox file. Webmail users have a "spam to learn" folder and a "ham to learn" folder. When they find misclassified email, they use their internal webmail links to copy the offending email to the appropriate folder. There's a cron job which checks for these Bayes feeder files hourly, and issues the appropriate sa-learn command against them when found. Bayes Training # 2 -- improving classification Approximately once a week I go through the recently collected spam and identify those that scored less than Bayes-80. I dump those into a learn-as-spam mailbox. I then go through the recently collected and verified ham, and dump an approximately equal number into the learn-as-ham mailbox. Bayes Training # 3 -- Restart When one of my domains moved from one server to another, we lost the Bayes database for some reason. The files looked OK, but SA couldn't access them according to the spamassassin debugging output. So I deleted the files, and dumped the 500 most recent messages from each into the appropriate sa-learn mailbox. BP> Since I don't want the stuff appended by SA to be part of the email BP> used to train Bayesian, I have to go through each message (I use PINE BP> for this) and write out the ORIGINAL message to a separate file which BP> I then use for training. SA knows its own headers, and ignores them during the learning process. I make no changes at all to the messages fed into Bayes. To me that's a non-issue anyway, since Bayes looks at ALL tokens in the message. Since Bayes does its testing before SA adds headers to the message, those headers aren't in the message being tested to confuse Bayes. BP> Is there an easier way to do thia? If I had to worry about thousands BP> of messages, that would represent a large chunk of my time, manually BP> writing out each original message. Don't bother cleaning up your messages. The important thing is to make absolutely sure you don't feed spam as ham or ham as spam. As long as you have that set, just dump the messages into sa-learn as needed. Bob Menschel ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk