On Wed, 20 Jan 2016 12:11:02 -0800 Marc Perkel wrote: > Again - it's not about matching as Bayes does. It's about not > matching. > > In the subject line of the message the phrase "method for blocking > spam" makes the message ham. Spammers never use the phrase "method > for blocking spam". No other tests needed. My system result 100% ham. > To bayes it's just some words.
It is to Bayes, but most most statistical filters do use phrases as tokens. > What makes it ham is what doesn't match, not what does. Right but it's not about the count of phases that don't match anything, You uses phrases that occur in spam, but not ham and vice versa, so it is about matching too. What you are doing is equivalent to a statistical filter doing multiword tokenization, dropping the tokens that appear in both spam an ham and then simply counting the spammy and hammy tokens to produce a result. A filter like bogofilter can do exactly this if you turn-on multiword tokenization and configure it with some very sub-optimal parameters.