Re: [SAtalk] Bayesian attack

Ross Vandegrift Wed, 20 Nov 2002 19:15:38 -0800

On Wed, Nov 20, 2002 at 11:16:52AM -0500, Sean Redmond wrote:
> Maybe the weakness is in converting a probability to a score. Quoting 
> Paul Graham again:
> 
> <qoute>But the real advantage of the Bayesian approach, of course, is 
> that you know what you're measuring. Feature-recognizing filters like 
> SpamAssassin assign a spam "score" to email. The Bayesian approach 
> assigns an actual probability. The problem with a "score" is that no one 
> knows what it means.


LOL!!!  As someone studying mathematical statistics this semester, this
is *complete* bullshit.  How do you think Bayesian statistics determines
the probability?  Scores.  It's just normalized it to some value that
people think is meaningful.  We call that value "probability".  There's
no reason that SA scores couldn't be normalized to a probability - it
would just be pointless.

While someone in theory has the advantage of being able to calculate a
confidence interval with probability, I'd need to see a lot of work
before beleiveing this is what the article meant.

> The user doesn't know what it means, but worse 
> still, neither does the developer of the filter. How many points should 
> an email get for having the word "sex" in it?

Let A be a domain of email.  Let B\subseteq A contain only spam and
C\subseteq contain only non-spam.  Let X be a random variable that takes
value 0 if a given email is non-spam, 1 if it is.  Then we need to calculate
two values:

1) The mean \mu (also called E(X)).  This will be the mean score for all
emails in A.  Since SA scores are calculated relative to a certain corpus,
we can calculate this number (it's binomially distributed, by the way
I've formulated this).

2) The variance \sigma^2.  This is defined as E (X^2)-(E(X))^2.  For the
same reason as above we can calulate this number.

Then, by the central limit theorem, the scores of all emails \in A are
normally distributed, and can therefore be calculated in terms of the
standard distribution.  This would in turn allow us to calculate the
probability that an email is spam, given a scoring from SA.

If he claims an email should be 99.97% sure to be spam if it just has
the "sexy" and "sex", then the SA score for sex can be calculated by the
inverse of the normal distribution.

\qed.

(Sorry I went crazy there - though the above does have some interesting
consequences.  We *could* assign a probability to emails based on the
score.  Personally I think it's pointless, but someone might like the
idea...)

> And Bayes' Rule, equally 
> unambiguous, says that an email containing both words would, in the 
> (unlikely) absence of any other evidence, have a 99.97% chance of being 
> a spam.</quote>

Since when is Bayes's rule unambiguous?  For those who aren't familiar
with it, Bayes's need the user to define a risk.  Guess what it is?  It's
basically a score threshhold, just like Grahm lambasted above.  This is
why a lot of statisticians think Bayesian work is full of it - it's
absolutely dependant on an element that contains a completely subjective
element.

I wish I knew all of this before the article came out.  In general, I
think his naive statistical approach has merit - but it's definately not
Bayesian.  I also disagree that it's better than SA's filtering - I like
not having to constantly train my filter.  I like more not having to
constantly train my users ::-).

Both have their advantages and disadvantages.

-- 
Ross Vandegrift
[EMAIL PROTECTED]

A Pope has a Water Cannon.                               It is a Water Cannon.
He fires Holy-Water from it.                        It is a Holy-Water Cannon.
He Blesses it.                                 It is a Holy Holy-Water Cannon.
He Blesses the Hell out of it.          It is a Wholly Holy Holy-Water Cannon.
He has it pierced.                It is a Holey Wholly Holy Holy-Water Cannon.
He makes it official.      It is a Cannon Holey Wholly Holy Holy-Water Cannon.
Batman and Robin arrive.                                       He shoots them.


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] Bayesian attack

Reply via email to