The scores are assigned by a genetic algorithm. Essentially two piles of email are created, one of spam, one of nonspam. A SpamAssassin mass-check is run to generate a set of one-line reports as to what rules each email in each pile matches. The GA then has the task of examining these rule-match sets and trying to assign scores that correctly categorize the most mail.
Often a rule which sounds like it should be a sign of spam gets a negative score. There are several causes of this. The score set for 2.40/2.41 seemed to be plagued with a lot of them. I know the dev's have recently done a "pruning" of poor-performing rules and re-ran the GA with much better results. I think the new scores will be in 2.42. As for cases that tend to cause unexpected negative scores, here's a few I can think of: 1) Something you thought only spammers did is done by lots of nonspammers too. This is probably the case in FROM_HAS_MIXED_NUMS. All those [EMAIL PROTECTED] email addresses that people use for their personal chatter aren't spam. Idi0ts maybe, but a lot of people have these as "disposable" addresses that aren't spammers. 2) Something you think at causal glance is a spam feature is also a feature of a few MUA's that spammer's generally don't use. SUPERLONG_LINE is in this category I think. Some spams match it but also some obscure MUA's do this to all emails (ie: some MUA's tend to send emails as one single line per paragraph). Also most spam consists of lots of single-line messages ("buy now!") without a lot of lengthy paragraphs, but conversational emails tend to have very long paragraphs in them. 3) A typo or bug in a rule makes it match some common non-spam expression, instead of the spam phrase.. One such bug was an attempt to match "no credit" and some other common credit repair phrases which also matched "notice: your credit card will be billed when your order is shipped". It wasn't requiring a space or word-break after the "no" part :) 4) Sometimes a rule get's "weighed down on" to correct a common particularly high scoring false-positive case. If there's a common set of rules causing FP's, generally the one with the least spam matches will wind up being pushed negative to compensate. 5) some spam, or reports of spam slip into the nonspam pile during evaluation. Most of the time this is pretty low-impact, but If the rule doesn't have a lot of hits in general, a few mis-placed emails can wildly swing the score. (the mis-placed to correctly placed email ratio needs to be less than the degree to which the GA favors avoiding tagging nonspam, at the expense of missing a little spam) 6) Yes, there are some glitches in the GA itself, but those are getting better. At 10:07 PM 9/26/2002 -0600, Danita Zanre wrote: >I'm admittedly new to this stuff, so please bear with me. I just got a >message with the following explanations: > >Trying to understand the "negative" values here - why would a line longer >than 199 characters "decrease" the score? Also, why would the "From" >lines having mixed numbers/no real name decrease the value? > >I realize I can change these values for myself if I choose, but I guess >before I start messing with the values I'd like to understand the logic >behind these settings. > >Thanks. > >Danita > > > >------------------------------------------------------- >This sf.net email is sponsored by:ThinkGeek >Welcome to geek heaven. >http://thinkgeek.com/sf >_______________________________________________ >Spamassassin-talk mailing list >[EMAIL PROTECTED] >https://lists.sourceforge.net/lists/listinfo/spamassassin-talk ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk