Michael C. Berch <[EMAIL PROTECTED]> writes:

> I came up with the name "Five-Card Charlie", which is a reference to the 
> game of Blackjack, where under some rules the player wins if he has any 
> hand of five cards and does not bust (exceed 21).   I figured if any 
> message tripped 5 positive tests, the chances of it being non-spam were 
> very small, so that would tip it over into the SPAM=yes category
> 
> So if anyone has coded this up, I'd be happy to test it.  Otherwise I'll 
> play around with the idea a bit.

This is very easy to test by seeing if you can use multiple matches to
reduce the number of false negatives (missed spam) without increasing
the number of false positives (caught nonspam).

Using my test corpus of 6146 messages (1322 spam, 4824 nonspam), let's
test some multiple matches on false negatives and false positives.

First we can test the false negatives on the spam.  The first column
is the number of rules matched, the second is the corresponding number
of false negatives (out of 149 total).

    rules   count   
     0          1
     1          6
     2         19
     3         26
     4         35
     5         29
     6         11
     7         16
     8          6

Now, let's do the same on the non-spam.  We have to test every one
that was previously correctly not caught since we're trying to avoid
additional false positives.

    rules   count
     0       1299
     1       1720
     2       1342
     3        363
     4         79
     5         12
     6          2
     7          1
    10          1

So, for "five-card charlie" type rules, here are the ratios of spam to
non-spam.  The counts in columns 2 and 3 are the number of messages
matched.

    rules    spam    nonspam
     0          1       1299
     1          6       1720
     2         19       1342
     3         26        363
     4         35         79
     5         29         12
     6         11          2
     7         16          1
     8          6          0
     9          0          0
    10          0          1

For any number of rules under 5, far more non-spam than spam would be
caught.  For 5 and more, more non-spam would be caught, but the ratio
is not especially good.  I suspect this is because the GA is working
as designed.  Lots of very small scores do not easily add up to a
message being spam.

For 5 rules, a ratio of 29/(29+12) = 0.70 which is very roughly
comparable to a GA score of about -0.17 (I'm just taking the average
of similarly accurate scores).  Meaning, the GA would probably not
rate it very hightly.  Perhaps between -1 and 1.

Let's add 1 to the score of a any message with 5 or more matched rules
and see how that effects things:

    additional caught spam = 28
    additional caught nonspam = 2

For 6 rules, a ratio of 11/(11+2) = 0.84 which is very roughly
comparable to a GA score of about 0.14, so the score would be a bit
better.  Probably positive.

What about the effect of adding 1 to the score of messages that match
6 or more rules?

   additional caught spam = 21
   additional caight nonspam = 1

Maybe it's an improvement, but I'm not so sure.  Previously, I had 5
false positives and 149 false negatives.  A ratio of .0336.  Now, it
went up to .0579 (5 or more rules) and .0469 (6 or more rules).
That's bad.

I guess I'm trying to say that it doesn't seem like a "knock the ball
out of the park" type of rule.  For every spam that gets through with
lots of matched rules, you're ignoring many nonspams with a similar
number of matched rules that intentionally get through.

My gut feeling is that effort is better spent improving the overall
performance of the GA and how we score rules that aren't GA-evolved
(like AWL and RBL rules).  I think second-guessing the GA is a waste
of time.  This is probably not much different from the effect of
lowering the required spam score from 5.0 to 4.5 or 4.0.

Dan

_______________________________________________________________

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to