Re: A different approach to scoring spamassassin hits

Tom Allison Sun, 01 Jul 2007 08:08:01 -0700


On Jun 30, 2007, at 11:55 PM, Loren Wilton wrote:

Unfortunately I'm not on the SpamAssassin Bayes modules -- I wrotemy own Bayes Engine because I wanted to do that and then thoughtabout including the Rules results from SpamAssassin. I don'tknow where this might be going, but it seems to be workingextremely well for me based on a training set of just a couplehundred emails in total.
Don't see this as a problem. Someone, I forget who, has a Bayeschained to an SA setup, I think the Bayes comes first, but I don'trecall. He was claiming good results from chained classifiersusing slightly different data and methods. This seems like areasonably possible contention to me.
If you have a pre-existing Bayes mail filter, and it runs as afilter in a pipe or the like, then basically what you want to doseems very simple to me, at least conceptually. Just run the mailthrough SA first and then into your classifier. The rule names hitalong with their scores will be in the header of the mail youprocess in your classifier, and thus, as long as you don't ignoreheader data, the rule names are there to process. No need even tomodify SA. In fact you can get a header with just the rule nameshit without the scores, so you don't have the score values beingscored as tokens.
The only case where you would have to modify SA in I think eitherCheck or PMS is if you really did want to bloat every mail with thenames of all of the rules in the SA database, rather than justthose pertanent to the mail at hand.
I hink the trick is simply looking at your mail chain and figuringout how to insert a call to SA before the call to your own Bayesmodule.

Actually I have this but I don't have it writting the headers intothe email. It' s sending the SA data as attached information so Ican keep track of where it came from (header/body/metadata). I'm notsure that the scoring is going to cost me anything or cause anyperformance issues compared to getting the hits/misses. I thinkwe're debating the cpu involved to determine a number for the score,not the scoring process itself.

I have a question about the sub rules -- are they themselves addingup to an overall rule by means of hit/miss?Is there any conceptual advantage to pulling in rules and sub_rulesto this process.

And the more I think about it, the more I don't need to "bloat everymail with the names of all the rules".

But sub_rules might be more useful.

---

By not putting in all the SA rules it might make it easier toestablish the contribution of the scoring, but you have to know theintended target (RULE => spam or RULE => ham) which isn't an issuewith todays rules (but you never know). Once you know this, theeffectiveness of a rule would be measured by it's distance inprobability from 0.500 toward 1.00. I can track this eventually, butI think I need to reset my database to be certain of it's value. Nota problem, I am my own admin.

But the real challenge for me, as has always been the case with SA,is the proper care and feeding of the application when not using thestandard spamc/spamd and spamassassin scripts. I suspect this startswith a lot of RTFM and then I can get to some real questions. Thedifficulty for me is trimming out all the steps in the applicationthat I won't be benefitting from. I would like to start withsomething that is approximately: local "static" rules only, no userspecific preferences, no learning or bayes or white/black listing.By local "static" I mean to use the rules based on email contentanalysis without network consultation (DNS, RBL, DCC...)

Re: A different approach to scoring spamassassin hits

Reply via email to