Title: RE: SA needs a new paradigm for rule structure


Chris Santerre wrote:


> -----Original Message-----
> From: Ted Mittelstaedt [mailto:t...@ipinc.net]
> Sent: 2009-10-10 02:40
> To: Marc Perkel
> Cc: users@spamassassin.apache.org
> Subject: Re: SA needs a new paradigm for rule structure
>
>
> Marc Perkel wrote:
> > I've brought this idea up over the years but I'll try to
> explain it in a
> > different way. Maybe we can do this with a lot of meta rules.
> >
> > What we need are rules that combine a lot of simple rules
> into concepts
> > and then combine those rules into rules that score - and
> score big. As
> > an example, lets take a standard nigerian scam email.
> >
> >  From <> reply to:
> >
> > [I don't know you] Dear stranger, I am mr, ms. mrs. my name is
> >
> > [I am connected] I am a soldier in Iraq, I and the daughter of an
> > african president, I work at a bank in hong hong
> >
> > [I have money] I have the sum of 56 million dollars USD
> >
> > [the money is hot] no beneficiaries, sneak it out of the country,
> > oppressive regime
> >
> > [transfer to your account] splitting the funds, wire to your account
> >
> > [i need you information] name, address, account number
> >
> > [i want you to contact me] by email, phone
> >
> > [keep this a secret] confidential discretion
> >
> > So - we create a lot of simple rules with no points with
> key words and
> > phases and then combine these rules using meta rules to get these
> > concepts. That way we have a meta rule like, "they don't
> know me" "that
> > are talking about transferring millions" "they want my information"
> > "they are talking about hot money". Then you combine those
> concepts into
> > rules that can definitively determine it is spam.
> >
> > And - I am still looking for someone who might do baysian
> or some other
> > automatic system that looks for rule combinations and
> increases scores
> > based on that.
> >
>
> I know that it seems like the idea of building up "meta" rules with
> a lot of small rules will give you a more accurate hit rate, but
> this is one of those non-intuitive things that can be shown by
> statistical mathmatics, that is that the concept won't work.  Or
> rather, it won't work any better than the existing paradigm.
>
> In other words, the current system of assigning little points to
> a lot of little rules will yield the same result for any given
> set of spam messages as organizing all
> these small rules into groups that have bigger point values.
>
> The only thing the organization does is for humans to understand
> what is going on better.  This is because how humans think about
> math like statistics is a lot different than how a computer
> works with mathematics like statistics.
>
> Ted

I thought I remembered a few years back that Baysian chains had a 10% increase in capture rate over straight Bayes rules. I would think that this is similar.

The problem with meta rules is that they can be fooled by a single change. Hit 4 out of 5 and you don't get the 7.0 score because the spammer changed one single thing. But with single rules at least those 4 things would have scored.

You would need to constantly tweak the meta rules.

I like the idea, and have thought on it before. I understand Ted's point on the statistics. I think it can be made better, but not with current SA code. And I know the old quote from JM, "All code samples are always welcome." :-)  So I hope to one day get something written to try.

--Chris


I've always thought that a second basian filter that would just look at rule hits would be worth trying. No message content, just rule combinations.

Reply via email to