On 01/20/16 12:14, Reindl Harald wrote:


Am 20.01.2016 um 21:11 schrieb Marc Perkel:

On 01/20/16 12:05, RW wrote:
On 01/20/16 10:26, Shawn Bakhtiar wrote:
Sorry.. how is this different than Naive Bayes filtering??
On Wed, 20 Jan 2016 10:52:58 -0800
Marc Perkel wrote:

Yes - you missed something. It is about intersecting one corpi and
NOT intersecting the other.

This is about what doesn't match - not what does.

What you are doing is a special case of an ordinary Bayesian filter. If
you remove Robinson's correction for low-count tokens, or adjust the
Robinson parameters so it has no effect, you end up with tokens that
only occur in spam having a probability of 1, tokens that only occur
in ham having a probability of 0 and token that occur in both having a
probability in-between. If set a cut-off of 0.499999... you leave
only the pure tokens behind. And because all the probabilities are 0 or
1 the chi-squared test reduces to comparing the number of spammy and
hammy tokens just as you are doing.

Your multi-word tokenization is exactly the same as in Bogofilter and
most of what you are doing can be done in Bogofilter with a few lines
in the configuration file.

Any value in your scheme must be in the selection of what you
tokenize. The rest is likely holding it back.


Again - it's not about matching as Bayes does. It's about not matching.

In the subject line of the message the phrase "method for blocking spam"
makes the message ham. Spammers never use the phrase "method for
blocking spam". No other tests needed. My system result 100% ham. To
bayes it's just some words

What makes it ham is what doesn't match, not what does

"Spammers never use the phrase" is pure bullshit - sorry, no way to express it nicer!


The way I know what spammers never use is I store what spammers do use and see if it doesn't match. I've processed more that 100 million spams and it's amazing how many common words and phrases that spammers never use.

--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400

Reply via email to