Sorry.. how is this different than Naive Bayes filtering??

"Naive Bayes classifiers work by correlating the use of tokens (typically 
words, or sometimes other things), with spam and non-spam e-mails and then 
using Bayes' theorem to calculate a probability that an email is or is not 
spam."
— https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering

"the set of fingerprints of the test message is intersected with the spam and 
ham corpi creating sub sets of matches. Then you do a set diff both ways (ham - 
spam) (spam - ham) and whichever side is bigger wins. Generally it will match 
on only one side or very predominately on one side.” — Marc Perkel

You are still looking up words/phrases in a dictionary set, and coming up with 
a probability factor of which side it falls on (an application of Baye’s 
theorom).

Or did I miss something?



On Jan 20, 2016, at 9:17 AM, Wrolf <wr...@wrolf.net<mailto:wr...@wrolf.net>> 
wrote:

Good luck with your patent application, it should be in the infinitely elastic 
queue right after my perpetual motion machine.

Not sure how you will deal with the number of ham tokens in spam messages. Also 
not sure how much ham will get canned as spam - but then, maybe people 
shouldn't be sending each other poetry?

haiku by email
blossoms in my inbox
drink morning coffee


;-)


Wrolf
wr...@wrolf.net<mailto:wr...@wrolf.net>

On Wed, Jan 20, 2016 at 11:52 AM, Marc Perkel 
<supp...@junkemailfilter.com<mailto:supp...@junkemailfilter.com>> wrote:
OK - following up on this. I have my provisional patent filed. I'm still doing 
development to improve it and working on a licensing contract. But the license 
will be based on the Creative Commons patent with some restrictions added. 
Basically I want to get a license fee from the big guys and my spam filtering 
competitors. So unless you are in the spam filtering business or have more than 
10,000 email addresses it's not going to cost you anything.

I'm going to describe the concept here. I'm not going to share my code because 
my code is specific to my system and it a combination of bash scripts, redis, 
pascal, php, and Exim rules. And the open source programmers are likely to 
implement it better than I have. Basically I'm trying not to put myself out of 
business and this new method is a bigger breakthrough than Bayesian filtering.

Maybe I should call it a new plan for spam?

So - I'm just going to introduce the concept right now about how it works. Once 
you know what I'm doing it should be easy to implement, I had it working in a 
couple of days and I'm not an outstanding programmer. One thing to keep in mind 
is this is a paradigm shift. It's not about matching - it's about NOT matching. 
And although it is far better at catching spam, it best feature is actively 
identifying good email.

The secret sauce

Suppose I get an email with the subject line "Let's get some lunch". I know 
it's a good email because spammers never say "Let's go to lunch". In fact there 
are an infinite number of words and phrases that are used in good email that 
are never ever used in spam. And if I'm using words and phrases never used in 
spam that are used in ham - it's good email. And similarly - if I'm using words 
and phrases that are used in spam and never used in spam - it's spam.

So - how do I get a list of words and phrases never used in spam? I create a 
list of words and phrases that are used in spam and check to see if it's not on 
the list.

What I do is tokenize the spamiest parts of the email, like the subject line, 
into words and phrases of 1 2 3 and 4 word phrases.

the quick brown fox jumps over the lazy dog - becomes

"the" "quick" "the quick" "brown" "quick brown" "the quick brown" "fox" "brown 
fox" "quick brown fox" "the quick brown fox" "jumps" "fox jumps" "brown fox 
jumps" "quick brown fox jumps" "over" "jumps over" "fox jumps over" "brown fox 
jumps over" "the" "over the" "jumps over the" "fox jumps over the" "lazy" "the 
lazy" "over the lazy" "jumps over the lazy" "dog" "lazy dog" "the lazy dog" 
"over the lazy dog"

These tokens are learned as ham or spam and added to sets. I'm using Redis to 
do this because it has extremely fast set operations. I don't know of anything 
other than Redis that can do this. So think about Redis as the way to implement 
this.

A new message comes in. It is tokenized and fingerprinted and hundreds of 
fingerprints are generated. Then it's all set operations. the set of 
fingerprints of the test message is intersected with the spam and ham corpi 
creating sub sets of matches. Then you do a set diff both ways (ham - spam) 
(spam - ham) and whichever side is bigger wins. Generally it will match on only 
one side or very predominately on one side.

So I'm not just tokenizing the subject. Also the first 25 words of the message, 
the text of links in the message, The name part of the from address, The header 
names, the attachment names, the PHP script if there is one, and various 
behavior characteristics, (slow, no quit, no RDNS, number on mime parts, 
multiple recipients, etc.)

SpamAssassin is all about matching rules. This is all about not matching. Not 
matching allows you to compare to an infinite set rather than a finite set. So 
when spammers start misspelling words to not match the rules, my system catches 
that and makes its own rules. The tricks that spammers use not makes it easier 
to catch them using this method.


I will post a link to a better explanation later when I write one. But wanted 
to let you all know this wasn't just a tease from some crazy person.

So - here's what I want to see happen.

I'd like to see SA implement this. I will provide a license to include with it 
giving most people a free license. sort of like how Spamhaus isn't free to 
everyone, but it's in SA. Then the new method will take off and eventually I'll 
get a little something for this.

This new method (I'm calling it the Evolution Spam Filter because the algorithm 
mimics evolution.) it doesn't just block spammers, it decimates spammers. It's 
not just a treatment - it's the cure. I hate spam and although I could have 
kept this secret and made money having the best spam filter on the planet, I 
decided I had a moral obligation to make this generally available. I think this 
will save the global economy billions of dollars in recovered productivity and 
crime and fraud prevention.

I'm seeing close to 100% accuracy. It is so accurate it's scary and I think my 
implementation is crude at best. I think if it were done right it could even 
get closer to 100% than I have. Once you wrap your brain around the concept 
it's almost scary how well it works.

The side effects is this is a very fast and simple recursive learner. What 
happens is that as people converse by email it learns more words and phrases 
about the stuff that people talk about that are never used in spam. It doesn't 
have to know what language you are using, it will learn it on it's own. It's 
like having SA with 100 million accurate rules where it write new rules itself.

I will leave you with that and I'll have more later.




Reply via email to