Re: SA needs a new paradigm for rule structure

Adam Katz Tue, 13 Oct 2009 16:19:44 -0700

Chris Santerre wrote:
>> I thought I remembered a few years back that Baysian chains had a
>> 10% increase in capture rate over straight Bayes rules. I would
>> think that this is similar.

Marc Perkel wrote:
> I've always thought that a second basian filter that would just
> look at rule hits would be worth trying. No message content, just
> rule combinations.

SA uses "naive Bayes" which mathematicians sometimes call "idiot
Bayes" for its simplicity.  No priors are used, so everything is
judged at the same level.

That said, SA's point-based system actually accomplishes this to a
degree, since rules are automatically re-scored based on their
likelihood of matching ham.  This is basically Bayesian, though
simplified into a point-based system rather than a probabilistic one,
presumably for computational simplicity.

As to how much ground would be gained migrating from a point-based
system to a probabilistic one (versus the added complexity), that's a
great question.  I'm sure ground would be gained, though I'm not sure
if it would warrant the change, especially as it makes writing custom
rules quite difficult (since they'd need to be thrown into the wild
and automatically scored, despite that custom rules are often written
to address short-term problems).

I'm going to refer the the current bag-of-words bayes db as
bayes-words and the SA rules-rescoring bayes db as bayes-scores.

Now to really dig into the issue:  Typically, Bayes uses users'
evaluation of their own data to judge their own future data.  SA's
scoring system uses ~developers' evaluation of masscheck data to judge
third-party future data.  If we move to a Bayesian scoring system of
the rules themselves, whose input does it reflect, and is that wise?

My immediate thought is to use the best of everything.  The official
masscheck bayes-scores db could be pulled by each deployment regularly
and merged (perhaps with some user-configurable weight favoring the
masscheck db, defaulting to 70% masscheck) with the deployment's own
bayes-scores db and the same for the bayes-words dbs (perhaps the
user-configurable weight on words would be 30% masscheck).  The
bayes-scores result acts as a prior for the bayes-words test, and the
final outcome is derived from that.

Rather than merging masscheck probabilities with local probabilities,
another option is to take four weighted passes.

Re: SA needs a new paradigm for rule structure

Reply via email to