John Rudd wrote:

I wouldn't do that.

Please note that I "said it the short" way. I of course don't jump to disable rules. I do check whether the message should have been flagged as spam (a "reasonable" FP). if so, that's life. If possible, I see if I can create a rule to make it get hammed without breaking the whole filter. If however, the tests that made it classify as spam are not clear to me, then I check if I can lower some. but some tests just get disabled.




Just because legitimate mail triggers some rule doesn't mean that the rule is flawed. Using your example, triggering "no_real_name" does not mean that the message is spam, it means that the message has _some_ similarity to at least some spam messages (the higher the score, the stronger the similarity). And, that's absolutely true: statistically, when looking at the corpus which was used to create the rules database, a higher percentage of "no_real_name" messages were spam.

As I already said in another thread, the statistics results depend on the attributes you are checking. the perceptron will not wake up and say "hey, come on, this attribute is not good". so, if you run a mass check with rules like:
- IP parity
- first letter of sender
- mailer: "the bat" for instance
- relay = comcast, free.fr, ...
...

then the perceptron will give you what you asked for: scores.

I also understand that US guys may get less encoded subjects, but at least in .fr, we have that all the time (because of our accented letters, and because many companies still use software that predates mime). and if I find a legitimate IP in a dnsbl used by SA, then I just remove that dnsbl.


Now, if legit messages were not just triggering those rules, but also triggering enough rules to be flagged as spam ... then I would lower the value of those rules, but not disable those rules.

I disable the rules, and if I get false negatives, I see what I can do. up so far, (the very few) missed spam would have been missed anyway.

 But I would only do
that if I could see that there was a large percentage of should-be-ham messages being flagged as spam by that rule AND that rule wasn't being useful in flagging spam messages. The reason is: if the message is being flagged, but it shouldn't have been, then perhaps my "corpus" of messages differs significantly enough from the SA internal corpus that my score values need to be different. But that doesn't mean that the rules are so disjoint from tracking spam that they should be entirely disabled. They just don't have the same weighting that my corpus needs.

If, instead, most messages passing through my mail servers, that triggered that rule, really did seem to be spam, then I wouldn't alter the score at all. I would just pass the should-have-been-ham message into my bayesian learner and hope that a low bayes score for messages like that would offset the rules had flagged it as spam.


everybody has its own situation. I am very FP sensitive. I prefer to get spam than to lose an important mail. after all, I do review my spam. so the less FPs there are, the faster I can review my junk folder.

Reply via email to