John Rudd wrote:
I wouldn't do that.
Please note that I "said it the short" way. I of course don't jump to
disable rules. I do check whether the message should have been flagged
as spam (a "reasonable" FP). if so, that's life. If possible, I see if I
can create a rule to make it get hammed without breaking the whole
filter. If however, the tests that made it classify as spam are not
clear to me, then I check if I can lower some. but some tests just get
disabled.
Just because legitimate mail triggers some rule doesn't mean that the
rule is flawed. Using your example, triggering "no_real_name" does not
mean that the message is spam, it means that the message has _some_
similarity to at least some spam messages (the higher the score, the
stronger the similarity). And, that's absolutely true: statistically,
when looking at the corpus which was used to create the rules database,
a higher percentage of "no_real_name" messages were spam.
As I already said in another thread, the statistics results depend on
the attributes you are checking. the perceptron will not wake up and say
"hey, come on, this attribute is not good". so, if you run a mass check
with rules like:
- IP parity
- first letter of sender
- mailer: "the bat" for instance
- relay = comcast, free.fr, ...
...
then the perceptron will give you what you asked for: scores.
I also understand that US guys may get less encoded subjects, but at
least in .fr, we have that all the time (because of our accented
letters, and because many companies still use software that predates
mime). and if I find a legitimate IP in a dnsbl used by SA, then I just
remove that dnsbl.
Now, if legit messages were not just triggering those rules, but also
triggering enough rules to be flagged as spam ... then I would lower the
value of those rules, but not disable those rules.
I disable the rules, and if I get false negatives, I see what I can do.
up so far, (the very few) missed spam would have been missed anyway.
But I would only do
that if I could see that there was a large percentage of should-be-ham
messages being flagged as spam by that rule AND that rule wasn't being
useful in flagging spam messages. The reason is: if the message is
being flagged, but it shouldn't have been, then perhaps my "corpus" of
messages differs significantly enough from the SA internal corpus that
my score values need to be different. But that doesn't mean that the
rules are so disjoint from tracking spam that they should be entirely
disabled. They just don't have the same weighting that my corpus needs.
If, instead, most messages passing through my mail servers, that
triggered that rule, really did seem to be spam, then I wouldn't alter
the score at all. I would just pass the should-have-been-ham message
into my bayesian learner and hope that a low bayes score for messages
like that would offset the rules had flagged it as spam.
everybody has its own situation. I am very FP sensitive. I prefer to get
spam than to lose an important mail. after all, I do review my spam. so
the less FPs there are, the faster I can review my junk folder.