Re: shifting the midpoint between the average spam and average ham

Joe Flowers 4 Sep 2004 20:16:04 -0000

My anti-spam system design went something like this (I integrated NetMail running on Novell NetWare to SpamAssassin running on SuSe or RedHat Linux):

1. To me, it's seems like most of the "action" in SpamAssassin (by default), occurs around the Mail::SpamAssassin::PerMsgStatus::get_hits = 5.0 mark, since that the dividing line between spam and ham.

2. From initial testing, it looked like a range of 0.0 to 10.0 is all I needed.

3. I didn't want to throw out any information in the scores. i.e., I didn't want to drop the tenths place in the scores, so I multiply the score by 10 to get a number from 0 to 100.

4. So, the header in each message that I am inserting looks something like this:

X-AddHeadr-CHASS: ++++++++++++ <NetMailMessageIDThatCanBeUsedToTrackMessagesThroughTheSystem>

5. I didn't want to deal with 0 pluses in my program, so 0.0 or anything less is remapped to 1 plus. Which implies that 10.0 is remapped to 101 pluses and 5.0 is really 51 pluses, etc.

6. That's another reason why I didn't pick a range of say -5.0 to 15 instead of 0 to 10.0, because I didn't think I really needed or should add 200 pluses to email messages.

7. Once I've inserted the header, I let the message go back in the mail queue for the NetMail Rules daemon to do the actual sorting.

8. Initially, I have everyone's Rule set to 51 to correspond to the SA score of 5.0.

9. We have a web page with some explanation, of course, and allow the users to change there Spam number from 1 to 101.

10. Messages that are detected as having the user's set number of pluses or more pluses are moved to the user's "MostlySpam" folder by the NetMail rules daemon.

11. I have worked hard to train the Bayes filter correctly, using the online documetation and the O'Reilly "SpamAssassin" book by Alan Schwatrz. I've fed it a lot of spam and ham messages. I also have AWL (auto-whitelisting enabled). I'm using SA v2.64. I'm using SA with all of it's defaults.

12. In the meantime, on the line of Pierre Thomson's idea of adding a number to the scores (because it will just shift the curves and not skew or distort them or have no effect on the averages, like multiplying will), I'll probably be sliding the midpoint between the average spam and average ham scores back to 5.0 by just adding a number to all of the "get_hits" scores for each message in one of my programs, based on some sort of Bayes training method - saving a large number of spam and ham messages, separating them, and then adding up all the pluses in the header of all of the messages and then dividing by the number of messages examined to get the spam and ham sample averages.

*****At the least and in the future, it would be nice if SA had a system-wide option to add in a correction number to all of the "get_hits" scores instead of having to write it in a separate program/script somewhere.*****

*****More and in the future, it would be nice if SA was able to automatically/on-the-fly auto-correct the correction number to add to all of the "get_hits" scores.*****

Joe

Re: shifting the midpoint between the average spam and average ham

Reply via email to