Am 20.01.2016 um 21:11 schrieb Marc Perkel:
On 01/20/16 12:05, RW wrote:On 01/20/16 10:26, Shawn Bakhtiar wrote:Sorry.. how is this different than Naive Bayes filtering??On Wed, 20 Jan 2016 10:52:58 -0800 Marc Perkel wrote:Yes - you missed something. It is about intersecting one corpi and NOT intersecting the other. This is about what doesn't match - not what does.What you are doing is a special case of an ordinary Bayesian filter. If you remove Robinson's correction for low-count tokens, or adjust the Robinson parameters so it has no effect, you end up with tokens that only occur in spam having a probability of 1, tokens that only occur in ham having a probability of 0 and token that occur in both having a probability in-between. If set a cut-off of 0.499999... you leave only the pure tokens behind. And because all the probabilities are 0 or 1 the chi-squared test reduces to comparing the number of spammy and hammy tokens just as you are doing. Your multi-word tokenization is exactly the same as in Bogofilter and most of what you are doing can be done in Bogofilter with a few lines in the configuration file. Any value in your scheme must be in the selection of what you tokenize. The rest is likely holding it back.Again - it's not about matching as Bayes does. It's about not matching. In the subject line of the message the phrase "method for blocking spam" makes the message ham. Spammers never use the phrase "method for blocking spam". No other tests needed. My system result 100% ham. To bayes it's just some words What makes it ham is what doesn't match, not what does
"Spammers never use the phrase" is pure bullshit - sorry, no way to express it nicer!
signature.asc
Description: OpenPGP digital signature