Justin Mason wrote:

S1 S2 S3 S4 S5 H1 H2 H3 H4 H5
RULE1: x x x x RULE2: x x x x RULE3: x x x
RULE4: x


(S1-S5 = 5 spam mails; H1-H5 = 5 ham/nonspam mails. "x" means a "hit"
by a rule, " " means no hit -- our rules are boolean.)


...

So, what I'm looking for is a statistical method to measure this effect,
and report

- (a) that RULE1 and RULE2 overlap almost entirely
- (b) that RULE3 is worthwhile, because it can hit that 20% of the
messages the other rules cannot
- (c) that RULE4 is better than RULE3 because it has a lower
false-positive rate


Well, it should be fairly straightforward to calculate correlation values between all of the rules, but I'm not sure how far that will take you.

I've got an idea of something which will give you a ruleset which:
 1 - Maximizes the amount of detected spam,
 2 - Minimizes false-positives, and
 3 - Contains the fewest rules possible.

But I'm not sure that's what you want either, because the algorithm would gravitate toward giving you a ruleset wherein each spam would be matched by a single rule in the set... which makes me uneasy.

Frankly, I'd rather have a set of rules which hit on ever spam I receive (provided that they don't increase my false-positives), because doing so only sends the spam score of the spam messages higher... which widens the numerical gap between my ham and spam scores... which I regard as a good thing.

- Joe



Reply via email to