On 9/25/2011 5:37 PM, RW wrote:
On Sun, 25 Sep 2011 09:28:32 -0700
Marc Perkel wrote:
Here's what I'd like to be able to do. I'd like a program of some
sort where I could take word tokes - like name of rules that were
triggered - and look for rule combinations that indicate spam or ham.
For example, a message triggers 4 rules A B C and D. These rules are
combined as follows:
A
...
ABCD
Each rule combo is then looked up for how often it occurs in spam and
how often it occurs in ham. Then the results are combined into some
sort of likelihood of being spam or ham.
There are a couple of problems with this. The first is that most SA
rules are either neutral or strong spam indicators, which make them
unsuitable for the sort of techniques used in Bayes.
The second is that most of the scope for meaningful combinations is in
high-scoring spam. Low-scoring spams are low-scoring because SA couldn't
find much evidence - in these you're going to end-up with
meaningless strong+neutral combinations like BAYES_99+SPF_PASS.
That's not to say that it can't be done in a more general sense; the
scoring system is a way of converting rule combinations into a
classification.
Similar questions have been asked before, IIRC someone came-up with
an alternative way of getting a classification from the rule hits
based on learning, and made a basic plugin that tweaked the score
accordingly.
Here's the kind of think I'm seeing. Spam talks about money - low score.
Spam talks about Jesus - low score. Spam talks about money and Jesus and
throw in a dear someone and it's spam. I'm hoping to detect combinations
automatcally.
--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400