On Jun 30, 2007, at 11:55 PM, Loren Wilton wrote:
Unfortunately I'm not on the SpamAssassin Bayes modules -- I wrote
my own Bayes Engine because I wanted to do that and then thought
about including the Rules results from SpamAssassin. I don't
know where this might be going, but it seems to be working
extremely well for me based on a training set of just a couple
hundred emails in total.
Don't see this as a problem. Someone, I forget who, has a Bayes
chained to an SA setup, I think the Bayes comes first, but I don't
recall. He was claiming good results from chained classifiers
using slightly different data and methods. This seems like a
reasonably possible contention to me.
If you have a pre-existing Bayes mail filter, and it runs as a
filter in a pipe or the like, then basically what you want to do
seems very simple to me, at least conceptually. Just run the mail
through SA first and then into your classifier. The rule names hit
along with their scores will be in the header of the mail you
process in your classifier, and thus, as long as you don't ignore
header data, the rule names are there to process. No need even to
modify SA. In fact you can get a header with just the rule names
hit without the scores, so you don't have the score values being
scored as tokens.
The only case where you would have to modify SA in I think either
Check or PMS is if you really did want to bloat every mail with the
names of all of the rules in the SA database, rather than just
those pertanent to the mail at hand.
I hink the trick is simply looking at your mail chain and figuring
out how to insert a call to SA before the call to your own Bayes
module.
Actually I have this but I don't have it writting the headers into
the email. It' s sending the SA data as attached information so I
can keep track of where it came from (header/body/metadata). I'm not
sure that the scoring is going to cost me anything or cause any
performance issues compared to getting the hits/misses. I think
we're debating the cpu involved to determine a number for the score,
not the scoring process itself.
I have a question about the sub rules -- are they themselves adding
up to an overall rule by means of hit/miss?
Is there any conceptual advantage to pulling in rules and sub_rules
to this process.
And the more I think about it, the more I don't need to "bloat every
mail with the names of all the rules".
But sub_rules might be more useful.
---
By not putting in all the SA rules it might make it easier to
establish the contribution of the scoring, but you have to know the
intended target (RULE => spam or RULE => ham) which isn't an issue
with todays rules (but you never know). Once you know this, the
effectiveness of a rule would be measured by it's distance in
probability from 0.500 toward 1.00. I can track this eventually, but
I think I need to reset my database to be certain of it's value. Not
a problem, I am my own admin.
But the real challenge for me, as has always been the case with SA,
is the proper care and feeding of the application when not using the
standard spamc/spamd and spamassassin scripts. I suspect this starts
with a lot of RTFM and then I can get to some real questions. The
difficulty for me is trimming out all the steps in the application
that I won't be benefitting from. I would like to start with
something that is approximately: local "static" rules only, no user
specific preferences, no learning or bayes or white/black listing.
By local "static" I mean to use the rules based on email content
analysis without network consultation (DNS, RBL, DCC...)