On Mon, 22 Aug 2016 09:55:10 +1200 Sidney Markowitz wrote:
> I'm one of those people he mentions who understands > how Bayesian spam filtering works who has yet to wrap my head around > what he is presenting - For now I'm staying agnostic about it until I > do understand it better). What it amounts to is: Training: - tokenize a corpus of spam and ham - compile a list of tokens that occur only in spam and a list of tokens that only occur in ham Classification: - Tokenize the email - count how many of the tokens are in each of the two list - compare the two counts In Bayes, if you set Robinson's S parameter to 0, then tokens that only occur in spam or ham get a token probability of exactly 1 and 0 respectively. Tokens that have been seen in both spam and ham get a probability between 0 and 1. So if you then set MIN_PROB_STRENGTH to 0.5 you can discard all of these. All of the remaining tokens have probabilities of 0 or 1 so running them through the chi-squared calculation (or any sensible symmetric combining algorithm) and then comparing the result to 0.5 gives the same result as comparing the number of spam-only and ham-only tokens. In short it's mathematically equivalent to Bayes with different tokenization and different constants; and on the face of it the values of S and MIN_PROB_STRENGTH are very sub-optimal. OTOH it wouldn't surprise me if the tokenization is much better.