Marc Perkel wrote:
I've brought this idea up over the years but I'll try to explain it in a
different way. Maybe we can do this with a lot of meta rules.
What we need are rules that combine a lot of simple rules into concepts
and then combine those rules into rules that score - and score big. As
an example, lets take a standard nigerian scam email.
From <> reply to:
[I don't know you] Dear stranger, I am mr, ms. mrs. my name is
[I am connected] I am a soldier in Iraq, I and the daughter of an
african president, I work at a bank in hong hong
[I have money] I have the sum of 56 million dollars USD
[the money is hot] no beneficiaries, sneak it out of the country,
oppressive regime
[transfer to your account] splitting the funds, wire to your account
[i need you information] name, address, account number
[i want you to contact me] by email, phone
[keep this a secret] confidential discretion
So - we create a lot of simple rules with no points with key words and
phases and then combine these rules using meta rules to get these
concepts. That way we have a meta rule like, "they don't know me" "that
are talking about transferring millions" "they want my information"
"they are talking about hot money". Then you combine those concepts into
rules that can definitively determine it is spam.
And - I am still looking for someone who might do baysian or some other
automatic system that looks for rule combinations and increases scores
based on that.
I know that it seems like the idea of building up "meta" rules with
a lot of small rules will give you a more accurate hit rate, but
this is one of those non-intuitive things that can be shown by
statistical mathmatics, that is that the concept won't work. Or
rather, it won't work any better than the existing paradigm.
In other words, the current system of assigning little points to
a lot of little rules will yield the same result for any given
set of spam messages as organizing all
these small rules into groups that have bigger point values.
The only thing the organization does is for humans to understand
what is going on better. This is because how humans think about
math like statistics is a lot different than how a computer
works with mathematics like statistics.
Ted