Bayes filter with phrases

Alex Woick Mon, 26 Nov 2018 06:58:55 -0800

In the last weeks I tried to create custom rules for several spam notcatched (mostly german), and it's always the same:

- identify catchy phrases that (hopefully) only appear in that kind of spam
- make indirect rules for the catchy phrases
- make meta rules for combining a certain amount of catchy phrases

- guess some score for the meta rule and hope it is appropriate and willonly push the total score just over 5, nothing more

In the end, I'm doing nothing different than a Bayes filter, only withphrases of 2-4 words instead of single words, and rate all manuallyinstead of using the Bayes math based on mail analysis.

I cannot believe nobody did some Bayes variant in the past thatidentifies and feeds phrases of 2-4 words into the database instead ofonly single words. But why isn't this implemented yet?What was the outcome of such an experiment? Does mathematical or empiricproof exist that this isn't more effective than single words?My intuition (or guess) is that processing phrases would be vastlysuperior compared to processing single words, because as spammer, youhave to convince your victim to click on that link, and there are onlyso much phrases to do this. Much less than single words. But since itisn't implemented, I assume there are arguments that invalidate theapproach. Which ones?

I understand that building a Bayes engine capable of handling phraseshas to be somewhat more complex. It has to handle overlapping phrasesand prevent that overlapping phrases score more than once for any givenmessage. Would the database size of such an engine grow without bounds?Processing time gets too high?


Alex

Bayes filter with phrases

Reply via email to