I'm going to gloss over some details of the mass-check and GA process for SpamAssassin. However this process is done every so often as a part of the SpamAssassin development and isn't a realtime analysis. For example, there is currently a mass-check run going on to create the scores that will be used when 2.50 is released. The tools for this stuff is in the masses sub-dir of the distribution if you want more information.
First, a number of contributors create a large pile of hand-sorted real spam and real nonspam, called their corpus. This corpus is analyzed to see what rules match the emails. As far as this process is concerned, each of the SPAM_PHRASE_XX_XX rules is an completely independent rule.
Later, the results of all the corpus analysis is run through a genetic algorithm then analyzes all of the results and assigns a set of scores to all the rules it can that optimize the correct placement of spam into the spam-pile and nonspam into the non-spam pile.
Looking at the 2.43 scores it is obvious how non-linear the real patterns that exist in real email are:
score SPAM_PHRASE_34_55 2.516
score SPAM_PHRASE_55_XX 0.505
score SPAM_PHRASE_21_34 1.856
score SPAM_PHRASE_13_21 1.337
score SPAM_PHRASE_08_13 1.385
score SPAM_PHRASE_05_08 1.640
score SPAM_PHRASE_03_05 1.084
score SPAM_PHRASE_00_01 0.781
score SPAM_PHRASE_02_03 0.758
score SPAM_PHRASE_01_02 0.500
So for example the low score of "SPAM_PHRASE_55_XX", even though that represents the highest number of hits against the spam-phrase test, is likely the result of real long reports which are not spam. They however contain a lot of text and will likely score artificially high. Real spam is generally fairly short and is not likely to have that many hits anyway.
At 08:23 AM 1/15/2003 -0500, Christopher Van Oosterhout wrote:
Greetings All,
I am new to SpamAssassin and hope that this question was not recently asked or covered by obvious documentation. If in fact there is documentation (I searched but did not yet find) to cover this, please point me to it).
I am trying to find the relationship between the "value" associated with specific phrases and the hits assigned to emails. My config file, that includes the list of specific phrases, has a number associated with each phrase. And a maximum total value assigned at the top line of this phrase. I am not sure just how this works, but I assume that all the frequent phrases that are found are added up. The divided by the maximum total value. The result of the division is the number of hits assigned to frequent phrase category.
Is my understanding correct? Could you send along any URLs for documentation that will cover this?
Thanks,
Christopher
-------------------------------------------------------
This SF.NET email is sponsored by: Take your first step towards giving your online business a competitive advantage. Test-drive a Thawte SSL certificate - our easy online guide will show you how. Click here to get started: http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0027en
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk
-------------------------------------------------------
This SF.NET email is sponsored by: Take your first step towards giving your online business a competitive advantage. Test-drive a Thawte SSL certificate - our easy online guide will show you how. Click here to get started: http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0027en
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk