On Mon, 12 Oct 2009 10:49:06 -0700 Ted Mittelstaedt <t...@ipinc.net> wrote:
> I think if you sit down and start trying to define examples > and run them through large databases of spam and ham you > will find that it doesen't work the way you think it does. That > is what I was talking about when I said that statistical > mathematics has parts that are non-intuitive. I think what you are saying is, you tried it and it didn't work for you. That's doesn't mean that it can't be made to work - the basic principle is sound. One way I think it might be done is to tokenize large corpora of ham and spam (mainly fraud), and look for token combinations that are very strong spam indicators. For example I suspect the simple two-token combination of lottery+barrister is a pretty reliable indicator. Meta-rules would be an inefficient way of implementing it though. > The reason you probably think that "meta" rules work better > is because you have created meta rules that are in reality, > a grouping of a useless rules with a useful rule. Thus, giving > the illusion that "a rule that isn't scoring individually" > actually is scoring when in a meta rule. Sometimes meta-rules just make more sense, "paypal" and "yahoo" in the same From header is worth scoring, "paypal" or "yahoo" isn't.