On Thursday, May 30, 2002, at 02:19 PM, Daniel Quinlan wrote: > Michael C. Berch <[EMAIL PROTECTED]> writes: >> I came up with the name "Five-Card Charlie", which is a reference to >> the >> game of Blackjack, where under some rules the player wins if he has any >> hand of five cards and does not bust (exceed 21). I figured if any >> message tripped 5 positive tests, the chances of it being non-spam were >> very small, so that would tip it over into the SPAM=yes category >> >> So if anyone has coded this up, I'd be happy to test it. Otherwise >> I'll >> play around with the idea a bit. > > This is very easy to test by seeing if you can use multiple matches to > reduce the number of false negatives (missed spam) without increasing > the number of false positives (caught nonspam). > > Using my test corpus of 6146 messages (1322 spam, 4824 nonspam), let's > test some multiple matches on false negatives and false positives.
Daniel, Thanks very much for the testing. Very interesting analysis. As a very concise summary, I would say that it look like 5-card charlie is probably not a winner, but 6-card charlie might be, overall. But for me, even 5-card charlie might be worth it. One thing that I've noticed in reading and paying attention to SAtalk is that the user community is not at all homogenous, and that we have very different interests and values in running SA. For me, catching an additional 20 uncaught spams at the cost of 1 or even 2 false positives is a win. This is undoubtedly not the case for everyone, especially those running SA site-wide or on an ISP, where false positives are a support problem and also reduce user confidence. As for me, I just have procmail stick all the stuff SA marks as spam into a mailbox on my server, then I look in on it daily or more often, scanning headers very quickly. I don't mind if a message or two gets stuck in there falsely, and I can reduce the chances of that by manual whitelisting (or AWL). But I *hate* when stuff gets through, especially things that are slow to render in my MUA, e.g., image-rich messages, fancy HTML, and so forth. Killing them, even at the cost of a few extra FPs, is worth it to me. (And simply lowering the threshold score does not seem to be very effective at that, although I'm considering it, maybe from 5.0 to 4.6 after testing.) I like to use my own recent archive as a corpus, since it more accurately reflects the spam ratio. Having had the same email address for about 12 years, I get a *lot* of spam. -- Michael C. Berch [EMAIL PROTECTED] _______________________________________________________________ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk