The scores are assigned by a genetic algorithm. Essentially two piles of 
email are created, one of spam, one of nonspam. A SpamAssassin mass-check 
is run to generate a set of one-line reports as to what rules each email in 
each pile matches. The GA then has the task of examining these rule-match 
sets and trying to assign scores that correctly categorize the most mail.

Often a rule which sounds like it should be a sign of spam gets a negative 
score. There are several causes of this. The score set for 2.40/2.41 seemed 
to be plagued with a lot of them. I know the dev's have recently done a 
"pruning" of poor-performing rules and re-ran the GA with much better 
results. I think the new scores will be in 2.42.

As for cases that tend to cause unexpected negative scores, here's a few I 
can think of:

1) Something you thought only spammers did is done by lots of nonspammers 
too. This is probably the case in FROM_HAS_MIXED_NUMS. All those 
[EMAIL PROTECTED] email addresses that people use for their personal 
chatter aren't spam. Idi0ts maybe, but a lot of people have these as 
"disposable" addresses that aren't spammers.

2) Something you think at causal glance is a spam feature is also a feature 
of a few MUA's that spammer's generally don't use. SUPERLONG_LINE is in 
this category I think. Some spams match it but also some obscure MUA's do 
this to all emails (ie: some MUA's tend to send emails as one single line 
per paragraph). Also most spam consists of lots of single-line messages 
("buy now!") without a lot of lengthy paragraphs, but conversational emails 
tend to have very long paragraphs in them.

3) A typo or bug in a rule makes it match some common non-spam expression, 
instead of the spam phrase.. One such bug was an attempt to match "no 
credit" and some other common credit repair phrases which also matched 
"notice: your credit card will be billed when your order is shipped". It 
wasn't requiring a space or word-break after the "no" part :)

4) Sometimes a rule get's "weighed down on" to correct a common 
particularly high scoring false-positive case. If there's a common set of 
rules causing FP's, generally the one with the least spam matches will wind 
up being pushed negative to compensate.

5) some spam, or reports of spam slip into the nonspam pile during 
evaluation. Most of the time this is pretty low-impact, but If the rule 
doesn't have a lot of hits in general, a few mis-placed emails can wildly 
swing the score. (the mis-placed to correctly placed email ratio needs to 
be less than the degree to which the GA favors avoiding tagging nonspam, at 
the expense of missing a little spam)

6) Yes, there are some glitches in the GA itself, but those are getting better.


At 10:07 PM 9/26/2002 -0600, Danita Zanre wrote:
>I'm admittedly new to this stuff, so please bear with me.  I just got a 
>message with the following explanations:
>
>Trying to understand the "negative" values here - why would a line longer 
>than 199 characters "decrease" the score?  Also, why would the "From" 
>lines having mixed numbers/no real name decrease the value?
>
>I realize I can change these values for myself if I choose, but I guess 
>before I start messing with the values I'd like to understand the logic 
>behind these settings.
>
>Thanks.
>
>Danita
>
>
>
>-------------------------------------------------------
>This sf.net email is sponsored by:ThinkGeek
>Welcome to geek heaven.
>http://thinkgeek.com/sf
>_______________________________________________
>Spamassassin-talk mailing list
>[EMAIL PROTECTED]
>https://lists.sourceforge.net/lists/listinfo/spamassassin-talk



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to