Re: statistics help needed

Scott A Crosby 11 Oct 2004 00:19:10 -0000

On Fri, 08 Oct 2004 15:49:08 -0700, [EMAIL PROTECTED] (Justin Mason) writes:


> However, that doesn't take in account the situation where multiple rules
> are hitting mostly the same mail; for example, like this:
>
>              S1  S2  S3  S4  S5  H1  H2  H3  H4  H5
>     RULE1:   x   x   x   x                       
>     RULE2:   x   x   x   x                       
>     RULE3:               x   x                   x
>     RULE4:                   x                    
>

> obviously, RULE1 and RULE2 overlap entirely, and therefore either (a) one
> should be removed, or (b) both should share half the score as equal
> contributors.  (b) is what the perceptron currently does.
>
> RULE3, by contrast, would be considered a lousy rule under our current
> scheme, because it hits ham 33% of the time; however in this case, it's
> actually quite informational to a certain extent, because it's hitting
> spam that the others cannot hit.
>
> RULE4 is even better than RULE3, because it's hitting the mail that
> RULE1 and RULE2 miss, yet it doesn't appear that good because:
>
>     - it has a hit-rate half that of RULE3
>     - it has a hit-rate 4 times lower than RULE1 and RULE2
>
> This is the kind of effect we do see now -- a lot of our rules are
> actually firing in combination, and some rules that hit e.g. 0.5% of
> spam are in effect more useful than some rules that hit 20%, because
> they're hitting the 0.5% of spam that *gets past* the other rules.
>
> So, what I'm looking for is a statistical method to measure this effect,
> and report
>
>     - (a) that RULE1 and RULE2 overlap almost entirely

Cross Entropy // Information Gain between the two rules. 

Cross entropy can also identify if one rule is redundant with respect
to, eg, two different rules. I think it may be possible to create a
formula akin to CE / IG, but biased toward avoding FP's.

>     - (b) that RULE3 is worthwhile, because it can hit that 20% of the
>       messages the other rules cannot

Information gain of RULE3 over the set of email that the other rules miss.

>     - (c) that RULE4 is better than RULE3 because it has a lower
>       false-positive rate


> So -- statisticians?  any tips? ;)   (if anyone can fwd this on
> to their resident stats guy, that would be appreciated, too.)

A google for 'cross entropy' and 'information gain' found this thread
which looks to have a few citations.

  http://www.mail-archive.com/perl-ai@perl.org/msg00127.html

This question also makes me think very strongly of decision tree
algorithms --- if a rule doesn't have a prominent place in the
decision tree, it probably doesn't contribute much.


Scott

Re: statistics help needed

Reply via email to