In the last weeks I tried to create custom rules for several spam not
catched (mostly german), and it's always the same:
- identify catchy phrases that (hopefully) only appear in that kind of spam
- make indirect rules for the catchy phrases
- make meta rules for combining a certain amount of catchy phrases
- guess some score for the meta rule and hope it is appropriate and will
only push the total score just over 5, nothing more
In the end, I'm doing nothing different than a Bayes filter, only with
phrases of 2-4 words instead of single words, and rate all manually
instead of using the Bayes math based on mail analysis.
I cannot believe nobody did some Bayes variant in the past that
identifies and feeds phrases of 2-4 words into the database instead of
only single words. But why isn't this implemented yet?
What was the outcome of such an experiment? Does mathematical or empiric
proof exist that this isn't more effective than single words?
My intuition (or guess) is that processing phrases would be vastly
superior compared to processing single words, because as spammer, you
have to convince your victim to click on that link, and there are only
so much phrases to do this. Much less than single words. But since it
isn't implemented, I assume there are arguments that invalidate the
approach. Which ones?
I understand that building a Bayes engine capable of handling phrases
has to be somewhat more complex. It has to handle overlapping phrases
and prevent that overlapping phrases score more than once for any given
message. Would the database size of such an engine grow without bounds?
Processing time gets too high?
Alex