Re: update on floating dividing score between spam and ham messages

Kai Schaetzl Tue, 12 Jul 2005 11:31:32 -0700

Joe Flowers wrote on Tue, 12 Jul 2005 11:55:36 -0400:

> >That's bad, really bad 
> >detection ... 
> >   
> > 
>  
> No. It's good, really good detection.


Sorry, I don't want to be rude by repeating myself, but if your average spam 
score is something like 6-something the *detection* *is* bad. Maybe not the 
end result but the pure spam detection. And that's also the reason why you had 
to try and find a method which lowers the threshold without giving you too 
much false positives. If your spam would score high enough you simply wouldn't 
need to do that. That's btw exactly what you said yourself:

> > But anything you can do that widens the 
> > typical score distribution between ham and spam is a good thing. 
>  
> Amen!!!!



> For lack of a better term in mind, I used "normalized". If the score of 
> a message is more than 30 points (or 25, I'm not going to waste time 
> looking back at the code) away from the nearest average, then I set the 
> score for the message back to 30 points away from the nearest average. 

Ok, I see you want to avoid peaks, makes sense.

> It sounds like you have put in a lot of time to become an expert in the 
> traditional wisdom of SA and to tune it accordingly.

Not more than others here. Not really too much time.

 And, I assume you 
> spend a lot of time keeping it tuned and dealing with SA upgrades.

Not at all. I have once carefully crafted a combination of my own rules plus 
SARE rules some time ago, trained it a lot of spam and ham at first and now 
let it just run, SARE updates are done automatically by rulesdujour. I haven't 
put much attention to it for probably a year now. Just some upgrading to SA 
3.1.* recently and maybe choosing a different SARE ruleset here and there.

 I'm 
> glad you have that time.... But, my situation is different and I agree 
> with some of the crtitics of SA - that it requires or almost requires an 
> expert to tune it properly and to keep it tuned properly

You indeed need some time to understand how it all works together, but then 
you don't need to apply too much care anymore, really. Of course, you should 
stay up-to-date with releases and the rulesets you use, but that's not a daily 
business at all.

>  
> And again, you are wrong. It is a very good setup (the proof is in the 
> pudding)

As I said earlier if you look closer at the pudding I'm sure that your false 
positive rate is much higher than ours. Do you have a proven figure of your FP 
rate?
To make it clear: I don't want to say that you have bad results from your 
setup. But I'm quite convinced that your FP rate could be much better if you 
tried to widen the gap between the ham and spam by applying more rules that 
are able to classify spam and maybe by finetuning a few rules scores (f.i. if 
you Bayes_99 is reliable you should boost it to 3 or 4, it's overly low in the 
3.0 setups). Your peaks are only 8 score points away from each other. Ours are 
more than 20 points away from each other and the vale between them is really 
low. Which means even if I slide the threshold for one absolute score point 
from 5 to 6 or down to 4 I won't get a much different detection rate because 
there's so few messages scoring in that range. I'm sure that many on this list 
would have similar results if they did that. But I suppose you can't do that 
because your gap is simply too small. What happens if you move your threshold 
one score point down or up?



Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org

Re: update on floating dividing score between spam and ham messages

Reply via email to