OK - following up on this. I have my provisional patent filed. I'm still
doing development to improve it and working on a licensing contract. But
the license will be based on the Creative Commons patent with some
restrictions added. Basically I want to get a license fee from the big
guys and my spam filtering competitors. So unless you are in the spam
filtering business or have more than 10,000 email addresses it's not
going to cost you anything.
I'm going to describe the concept here. I'm not going to share my code
because my code is specific to my system and it a combination of bash
scripts, redis, pascal, php, and Exim rules. And the open source
programmers are likely to implement it better than I have. Basically I'm
trying not to put myself out of business and this new method is a bigger
breakthrough than Bayesian filtering.
Maybe I should call it a new plan for spam?
So - I'm just going to introduce the concept right now about how it
works. Once you know what I'm doing it should be easy to implement, I
had it working in a couple of days and I'm not an outstanding
programmer. One thing to keep in mind is this is a paradigm shift. It's
not about matching - *it's about NOT matching*. And although it is far
better at catching spam, it best feature is actively identifying good email.
The secret sauce
Suppose I get an email with the subject line "Let's get some lunch". I
know it's a good email because spammers never say "Let's go to lunch".
In fact there are an infinite number of words and phrases that are used
in good email that are never ever used in spam. And if I'm using words
and phrases *never used in spam* that are used in ham - it's good email.
And similarly - if I'm using words and phrases that are used in spam and
*never used in spam* - it's spam.
So - how do I get a list of words and phrases never used in spam? I
create a list of words and phrases that are used in spam and check to
see if it's *not on the list*.
What I do is tokenize the spamiest parts of the email, like the subject
line, into words and phrases of 1 2 3 and 4 word phrases.
the quick brown fox jumps over the lazy dog - becomes
"the" "quick" "the quick" "brown" "quick brown" "the quick brown" "fox"
"brown fox" "quick brown fox" "the quick brown fox" "jumps" "fox jumps"
"brown fox jumps" "quick brown fox jumps" "over" "jumps over" "fox jumps
over" "brown fox jumps over" "the" "over the" "jumps over the" "fox
jumps over the" "lazy" "the lazy" "over the lazy" "jumps over the lazy"
"dog" "lazy dog" "the lazy dog" "over the lazy dog"
These tokens are learned as ham or spam and added to sets. I'm using
Redis to do this because it has extremely fast set operations. I don't
know of anything other than Redis that can do this. So think about Redis
as the way to implement this.
A new message comes in. It is tokenized and fingerprinted and hundreds
of fingerprints are generated. Then it's all set operations. the set of
fingerprints of the test message is intersected with the spam and ham
corpi creating sub sets of matches. Then you do a set diff both ways
(ham - spam) (spam - ham) and whichever side is bigger wins. Generally
it will match on only one side or very predominately on one side.
So I'm not just tokenizing the subject. Also the first 25 words of the
message, the text of links in the message, The name part of the from
address, The header names, the attachment names, the PHP script if there
is one, and various behavior characteristics, (slow, no quit, no RDNS,
number on mime parts, multiple recipients, etc.)
SpamAssassin is all about matching rules. This is all about not
matching. Not matching allows you to compare to an infinite set rather
than a finite set. So when spammers start misspelling words to not match
the rules, my system catches that and makes its own rules. The tricks
that spammers use not makes it easier to catch them using this method.
I will post a link to a better explanation later when I write one. But
wanted to let you all know this wasn't just a tease from some crazy person.
So - here's what I want to see happen.
I'd like to see SA implement this. I will provide a license to include
with it giving most people a free license. sort of like how Spamhaus
isn't free to everyone, but it's in SA. Then the new method will take
off and eventually I'll get a little something for this.
This new method (I'm calling it the Evolution Spam Filter because the
algorithm mimics evolution.) it doesn't just block spammers, it
decimates spammers. It's not just a treatment - it's the cure. I hate
spam and although I could have kept this secret and made money having
the best spam filter on the planet, I decided I had a moral obligation
to make this generally available. I think this will save the global
economy billions of dollars in recovered productivity and crime and
fraud prevention.
I'm seeing close to 100% accuracy. It is so accurate it's scary and I
think my implementation is crude at best. I think if it were done right
it could even get closer to 100% than I have. Once you wrap your brain
around the concept it's almost scary how well it works.
The side effects is this is a very fast and simple recursive learner.
What happens is that as people converse by email it learns more words
and phrases about the stuff that people talk about that are never used
in spam. It doesn't have to know what language you are using, it will
learn it on it's own. It's like having SA with 100 million accurate
rules where it write new rules itself.
I will leave you with that and I'll have more later.