Daniel Quinlan wrote:

DQ> If a positively-scored rule matches a spam, it's goodness goes up by
DQ> its score, but if it matches a non-spam, it's goodness goes down by
DQ> its score.  The inverse is true for negatively-scored rules.  You can
DQ> weight false-positives if you want (I didn't in the below table).

The way the GA evaluates "goodness" is by trying to minimize:

(number of false neg) + (weight * number of false pos) + (weight * log(sum of
scores of false-pos)) - log(sum of scores of false-neg)

The log base is e, which is about half of the threshold of 5.0, which is handy 
for scaling purposes.  The "weight" is the same in both cases, and combines a 
"prefer false positives over false negatives" notion, with also balancing for 
the relative frequency of spam vs nonspam in the corpus.

DQ> Some interesting points:
DQ> 
DQ> (1) I tried exempting all company-internal messages since that's what
DQ>     I do with my real mail (I don't exempt them when developing new
DQ>     rules, though, since some people might run internal mail through
DQ>     SA).  However, when I didn't exempt them, it changed the results
DQ>     for some negative tests, always in the same direction:
DQ> 
DQ>                              all messages         internal excluded
DQ>       FROM_AND_TO_SAME            bad                    good
DQ>       VERY_SUSP_CC_RECIPS         bad                    good
DQ>       VERY_SUSP_RECIPS            bad                    good
DQ>       X_NOT_PRESENT               bad                    good
DQ>       X_PRIORITY_HIGH             bad                    good
DQ> 
DQ>     A clearer way to put it is: the above tests don't work well if you
DQ>     run your local mail through them.  It makes sense too.

This does make sense.  Perhaps we ought to have 2 sets of scores in the 
distribution, and docs suggesting that people not pass their internal mail 
through SA.  Requires more complex configuration of the MTA though...

DQ> (3) Why are these negative?
DQ> 
DQ>     MAILTO_WITH_SUBJ
DQ>     HTML_WITH_BGCOLOR
DQ>     SLIGHTLY_UNSAFE_JAVASCRIPT
DQ>     OPPORTUNITY
DQ>     ALL_CAPS_HEADER
DQ>     [ and more ]
DQ> 
DQ>     Seems like they've earned negative scores even though they are
DQ>     clearly spam detectors.  Rules that aren't effective, but are not
DQ>     intended to detect legitimate mail should be scored at 0.0, not
DQ>     negatively.

Although they occur more often in spam than in nonspam, they generally correlate 
with other high-scoring rules, and so aren't needed in those cases to identify 
spam.  However, they probably also appear in nonspam, and would generate false 
positives if their scores were higher.  The GA abhors false positives.

DQ> (4) Some rules match so rarely, they might as well get deleted.  Maybe
DQ>     they match for someone else...

Yes, some match 0 times against my corpus.  However, the ones that match 0 times 
and are still in there are in there because they are definite signs that 
something absolutely is guaranteed to be spam (more or less).

DQ> (5) Clearly, a few rules need to be fixed.  X_AUTH_WARNING especially.
DQ>     Even when I exempt internal mail, this one is still the worst of
DQ>     the lot.

Hmm, didn't notice when that rule went in -- Matt Cline seems to have inserted 
it, though there isn't much discussion on bugzilla #255 of the reason for the 
rules...  Matt?  Any comment?

DQ> These are sorted from worst to best.  Rules near the top need to be
DQ> reexamined, IMO.  Some may need to have their score sign-reversed.
DQ> Some just may need to be removed.  This is all very me-ish, of course.

I would recommend re-generating the list, but define goodness in a more 
sophisticated way (possibly the way the GA does, but possibly slightly 
different) which takes into account false-positives and false-negatives, rather 
than purely whether a message is spam or not.  X_AUTH_WARNING could be a 
just-fine rule if it's just making lots of nonspams have a score of 1.0 while 
making those 57 spams have a score of 5.3!

C

DQ> Scores that are not GA-evolved have a "*" next to them.
DQ> 
DQ> Table:
DQ> 
DQ> rule                              goodness     spams non-spams
DQ> ------------------------------------------------------------------------
DQ> X_AUTH_WARNING                       -1217        57      1274
DQ> COPYRIGHT_CLAIMED                     -130        89         6

<chomp>

Also, you can look at Bugzilla #143 which has the "freqs" and "analysis" files 
for 2.20 -- the first is basically your file, but with counts of occurrence 
rather than sum of scores, and the latter is identifying which rules are most 
often involved in false-identifications (pos or neg).

C


_______________________________________________________________

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to