Michael C. Berch <[EMAIL PROTECTED]> writes: > I came up with the name "Five-Card Charlie", which is a reference to the > game of Blackjack, where under some rules the player wins if he has any > hand of five cards and does not bust (exceed 21). I figured if any > message tripped 5 positive tests, the chances of it being non-spam were > very small, so that would tip it over into the SPAM=yes category > > So if anyone has coded this up, I'd be happy to test it. Otherwise I'll > play around with the idea a bit.
This is very easy to test by seeing if you can use multiple matches to reduce the number of false negatives (missed spam) without increasing the number of false positives (caught nonspam). Using my test corpus of 6146 messages (1322 spam, 4824 nonspam), let's test some multiple matches on false negatives and false positives. First we can test the false negatives on the spam. The first column is the number of rules matched, the second is the corresponding number of false negatives (out of 149 total). rules count 0 1 1 6 2 19 3 26 4 35 5 29 6 11 7 16 8 6 Now, let's do the same on the non-spam. We have to test every one that was previously correctly not caught since we're trying to avoid additional false positives. rules count 0 1299 1 1720 2 1342 3 363 4 79 5 12 6 2 7 1 10 1 So, for "five-card charlie" type rules, here are the ratios of spam to non-spam. The counts in columns 2 and 3 are the number of messages matched. rules spam nonspam 0 1 1299 1 6 1720 2 19 1342 3 26 363 4 35 79 5 29 12 6 11 2 7 16 1 8 6 0 9 0 0 10 0 1 For any number of rules under 5, far more non-spam than spam would be caught. For 5 and more, more non-spam would be caught, but the ratio is not especially good. I suspect this is because the GA is working as designed. Lots of very small scores do not easily add up to a message being spam. For 5 rules, a ratio of 29/(29+12) = 0.70 which is very roughly comparable to a GA score of about -0.17 (I'm just taking the average of similarly accurate scores). Meaning, the GA would probably not rate it very hightly. Perhaps between -1 and 1. Let's add 1 to the score of a any message with 5 or more matched rules and see how that effects things: additional caught spam = 28 additional caught nonspam = 2 For 6 rules, a ratio of 11/(11+2) = 0.84 which is very roughly comparable to a GA score of about 0.14, so the score would be a bit better. Probably positive. What about the effect of adding 1 to the score of messages that match 6 or more rules? additional caught spam = 21 additional caight nonspam = 1 Maybe it's an improvement, but I'm not so sure. Previously, I had 5 false positives and 149 false negatives. A ratio of .0336. Now, it went up to .0579 (5 or more rules) and .0469 (6 or more rules). That's bad. I guess I'm trying to say that it doesn't seem like a "knock the ball out of the park" type of rule. For every spam that gets through with lots of matched rules, you're ignoring many nonspams with a similar number of matched rules that intentionally get through. My gut feeling is that effort is better spent improving the overall performance of the GA and how we score rules that aren't GA-evolved (like AWL and RBL rules). I think second-guessing the GA is a waste of time. This is probably not much different from the effect of lowering the required spam score from 5.0 to 4.5 or 4.0. Dan _______________________________________________________________ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk