Daniel Quinlan wrote: DQ> If a positively-scored rule matches a spam, it's goodness goes up by DQ> its score, but if it matches a non-spam, it's goodness goes down by DQ> its score. The inverse is true for negatively-scored rules. You can DQ> weight false-positives if you want (I didn't in the below table).
The way the GA evaluates "goodness" is by trying to minimize: (number of false neg) + (weight * number of false pos) + (weight * log(sum of scores of false-pos)) - log(sum of scores of false-neg) The log base is e, which is about half of the threshold of 5.0, which is handy for scaling purposes. The "weight" is the same in both cases, and combines a "prefer false positives over false negatives" notion, with also balancing for the relative frequency of spam vs nonspam in the corpus. DQ> Some interesting points: DQ> DQ> (1) I tried exempting all company-internal messages since that's what DQ> I do with my real mail (I don't exempt them when developing new DQ> rules, though, since some people might run internal mail through DQ> SA). However, when I didn't exempt them, it changed the results DQ> for some negative tests, always in the same direction: DQ> DQ> all messages internal excluded DQ> FROM_AND_TO_SAME bad good DQ> VERY_SUSP_CC_RECIPS bad good DQ> VERY_SUSP_RECIPS bad good DQ> X_NOT_PRESENT bad good DQ> X_PRIORITY_HIGH bad good DQ> DQ> A clearer way to put it is: the above tests don't work well if you DQ> run your local mail through them. It makes sense too. This does make sense. Perhaps we ought to have 2 sets of scores in the distribution, and docs suggesting that people not pass their internal mail through SA. Requires more complex configuration of the MTA though... DQ> (3) Why are these negative? DQ> DQ> MAILTO_WITH_SUBJ DQ> HTML_WITH_BGCOLOR DQ> SLIGHTLY_UNSAFE_JAVASCRIPT DQ> OPPORTUNITY DQ> ALL_CAPS_HEADER DQ> [ and more ] DQ> DQ> Seems like they've earned negative scores even though they are DQ> clearly spam detectors. Rules that aren't effective, but are not DQ> intended to detect legitimate mail should be scored at 0.0, not DQ> negatively. Although they occur more often in spam than in nonspam, they generally correlate with other high-scoring rules, and so aren't needed in those cases to identify spam. However, they probably also appear in nonspam, and would generate false positives if their scores were higher. The GA abhors false positives. DQ> (4) Some rules match so rarely, they might as well get deleted. Maybe DQ> they match for someone else... Yes, some match 0 times against my corpus. However, the ones that match 0 times and are still in there are in there because they are definite signs that something absolutely is guaranteed to be spam (more or less). DQ> (5) Clearly, a few rules need to be fixed. X_AUTH_WARNING especially. DQ> Even when I exempt internal mail, this one is still the worst of DQ> the lot. Hmm, didn't notice when that rule went in -- Matt Cline seems to have inserted it, though there isn't much discussion on bugzilla #255 of the reason for the rules... Matt? Any comment? DQ> These are sorted from worst to best. Rules near the top need to be DQ> reexamined, IMO. Some may need to have their score sign-reversed. DQ> Some just may need to be removed. This is all very me-ish, of course. I would recommend re-generating the list, but define goodness in a more sophisticated way (possibly the way the GA does, but possibly slightly different) which takes into account false-positives and false-negatives, rather than purely whether a message is spam or not. X_AUTH_WARNING could be a just-fine rule if it's just making lots of nonspams have a score of 1.0 while making those 57 spams have a score of 5.3! C DQ> Scores that are not GA-evolved have a "*" next to them. DQ> DQ> Table: DQ> DQ> rule goodness spams non-spams DQ> ------------------------------------------------------------------------ DQ> X_AUTH_WARNING -1217 57 1274 DQ> COPYRIGHT_CLAIMED -130 89 6 <chomp> Also, you can look at Bugzilla #143 which has the "freqs" and "analysis" files for 2.20 -- the first is basically your file, but with counts of occurrence rather than sum of scores, and the latter is identifying which rules are most often involved in false-identifications (pos or neg). C _______________________________________________________________ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk