On Fri, 08 Oct 2004 15:49:08 -0700, [EMAIL PROTECTED] (Justin Mason) writes:
> However, that doesn't take in account the situation where multiple rules > are hitting mostly the same mail; for example, like this: > > S1 S2 S3 S4 S5 H1 H2 H3 H4 H5 > RULE1: x x x x > RULE2: x x x x > RULE3: x x x > RULE4: x > > obviously, RULE1 and RULE2 overlap entirely, and therefore either (a) one > should be removed, or (b) both should share half the score as equal > contributors. (b) is what the perceptron currently does. > > RULE3, by contrast, would be considered a lousy rule under our current > scheme, because it hits ham 33% of the time; however in this case, it's > actually quite informational to a certain extent, because it's hitting > spam that the others cannot hit. > > RULE4 is even better than RULE3, because it's hitting the mail that > RULE1 and RULE2 miss, yet it doesn't appear that good because: > > - it has a hit-rate half that of RULE3 > - it has a hit-rate 4 times lower than RULE1 and RULE2 > > This is the kind of effect we do see now -- a lot of our rules are > actually firing in combination, and some rules that hit e.g. 0.5% of > spam are in effect more useful than some rules that hit 20%, because > they're hitting the 0.5% of spam that *gets past* the other rules. > > So, what I'm looking for is a statistical method to measure this effect, > and report > > - (a) that RULE1 and RULE2 overlap almost entirely Cross Entropy // Information Gain between the two rules. Cross entropy can also identify if one rule is redundant with respect to, eg, two different rules. I think it may be possible to create a formula akin to CE / IG, but biased toward avoding FP's. > - (b) that RULE3 is worthwhile, because it can hit that 20% of the > messages the other rules cannot Information gain of RULE3 over the set of email that the other rules miss. > - (c) that RULE4 is better than RULE3 because it has a lower > false-positive rate > So -- statisticians? any tips? ;) (if anyone can fwd this on > to their resident stats guy, that would be appreciated, too.) A google for 'cross entropy' and 'information gain' found this thread which looks to have a few citations. http://www.mail-archive.com/perl-ai@perl.org/msg00127.html This question also makes me think very strongly of decision tree algorithms --- if a rule doesn't have a prominent place in the decision tree, it probably doesn't contribute much. Scott