Theo Van Dinter wrote:
On Tue, Sep 19, 2006 at 10:58:46PM +0200, mouss wrote:
URI_NOVOWEL fires with things like href="#id" where id is a string that
starts with 7 "no-vowel" chars.
uri URI_NOVOWEL m%^https?://[^/?]*[bcdfghjklmnpqrstvwxz]{7}%i
uri URI_NOVOWEL m%^https?://[^/?\#]*[bcdfghjklmnpqrstvwxz]{7}%i
is this correct?
That depends on your definition of "correct". The RE looks ok, but the
hitrate could change dramatically. It's hard to say without testing.
my understanding is that the rule looks for "dummy" hostnames in the
server part. unfortunately, the way URIs are "exposed" by SA, this rule
also applies to any thing that resembles a URI. This is a problem with
relative URIs (aka href="foo.html" if foo matches the rule). [In the
past, I have reported problems with things like ldap strings, ... that
were interpreted as URIs by SA and caught by some rules].
in the present case, the FP ocurred for a "silly" NL that I whitelisted
(they trigger other rules. but I am not the recipient, otherwise, I'll
block'em at smtp time). so whether this is a real FP or not is debatable.
however, my understading of the rule is that it was not designed to
catch such relative URIs. If so, then it should be fixed. thus my question.
In other words, should we "fix" the rule because t catches things it was
not designed to catch, or should we be happy that it detects spam it was
not supposed to catch? This is a general question of course.
I personally tend to believe that when Bayes is used, "logical" rules
should only catch what they were supposed to catch. and I do use Bayes
(I have disabled Bayes for two months to see the results, and while it
was done on a single installation, the results were that Bayes is very
helpful).