On Mon, 22 Aug 2016 09:55:10 +1200
Sidney Markowitz wrote:

>  I'm one of those people he mentions who understands
> how Bayesian spam filtering works who has yet to wrap my head around
> what he is presenting - For now I'm staying agnostic about it until I
> do understand it better).

What it amounts to is:

Training: 

- tokenize a corpus of spam and ham 
- compile a list of tokens that occur only in spam and a list of
  tokens that only occur in ham

Classification:

- Tokenize the email
- count how many of the tokens are in each of the two list
- compare the two counts


In Bayes, if you set Robinson's S parameter to 0, then tokens that only
occur in spam or ham get a token probability of exactly 1 and 0
respectively. 

Tokens that have been seen in both spam and ham get a probability
between 0 and 1. So if you then set MIN_PROB_STRENGTH to 0.5 you can
discard all of these. 

All of the remaining tokens have probabilities of 0 or 1 so running
them through the chi-squared calculation (or any sensible symmetric
combining algorithm) and then comparing the result to 0.5  gives the
same result as comparing the number of spam-only and ham-only tokens.

In short it's mathematically equivalent to Bayes with different
tokenization and different constants; and on the face of it
the values of S and MIN_PROB_STRENGTH are very sub-optimal. 

OTOH it wouldn't surprise me if the tokenization is much better.


Reply via email to