RW wrote:
On Fri, 09 Oct 2009 23:40:01 -0700
Ted Mittelstaedt <t...@ipinc.net> wrote:


I know that it seems like the idea of building up "meta" rules with
a lot of small rules will give you a more accurate hit rate, but
this is one of those non-intuitive things that can be shown by
statistical mathmatics, that is that the concept won't work.  Or
rather, it won't work any better than the existing paradigm.

I think you just just made that up. It clearly depends on the
circumstances.

No, it doesn't.

If two rules correlate strongly in spam and weakly
correlate or anti-correlate in ham, there's a case for creating a
meta rule.

That is true only for a mail message that meets a very specific
criteria that matches those rules.  It's going to be overridden by
the law of averages, though.

In some cases it's possible to create useful meta-rules out
of rules that aren't worth scoring individually.

I think if you sit down and start trying to define examples
and run them through large databases of spam and ham you
will find that it doesen't work the way you think it does.  That
is what I was talking about when I said that statistical
mathematics has parts that are non-intuitive.

Many people have tried doing exactly this with other kinds of
correlations.  You often see this in sports predictions, for
example - amateurs work out "systems" that look for these kinds of
false causalities and use the results to predict the winner of
the next Super Bowl, for example.  It might even work a few
times - but over the long run they fall down.

Fundamentally, a spam rule is either worth scoring or not.  If
it is completely useless - for example, an anti-spam rule that
assumes that any sender with an e-mail address shorter than 15 characters is more likely a spammer - then if you analyze the
times the rule triggers with a large volume of both ham and
spam, drawn from a wide disparity of sources, you will find it
triggers equally on both ham and spam.

However if a rule does have some point-value, it's going to
ALWAYS trigger more on one side - either trigger more on the
ham side, or more on the spam side.

If it scores more on the spam side then you calculate the
percentage of scoring and use that to assign a point value.

If it triggers more on the ham side then it's useful because
it can be scored to SUBTRACT from the point score.

The reason you probably think that "meta" rules work better
is because you have created meta rules that are in reality,
a grouping of a useless rules with a useful rule.  Thus, giving
the illusion that "a rule that isn't scoring individually"
actually is scoring when in a meta rule.

Most of the focus in SA has been in the search for the "killer
rule" that will ALWAYS score on the spam side and NEVER score
on the ham side - because naturally, people want to believe that
content filtering is black-and-white and that there's somewhere
an elusive magic "thang" that separates the ham from the
spam.

But in reality, what is happening in the spam war is that as
time passes, the more easily recognizable spam is being eliminated
by the "low-hanging fruit" anti-spam rules that are being added -
and the spammers are adapting, by making their spam look less
and less like spam and more and more like ham.

One of these days the spam will be so indistinguishable from the
ham that the differences will only be detectable by computer
in corpus es of thousands to tens of thousands of pieces of ham and spam. At that time, SA will hopefully be advanced enough to
keep up - because we will be approaching the complexity of
the rulesets used by the human brain to distinguish between
ham and spam.

Fun stuff!

Ted

Reply via email to