John D. Hardin wrote:
On Thu, 21 Sep 2006, mouss wrote:
Theo Van Dinter wrote:
On Tue, Sep 19, 2006 at 10:58:46PM +0200, mouss wrote:
URI_NOVOWEL fires with things like href="#id" where id is a string that
starts with 7 "no-vowel" chars.
uri URI_NOVOWEL m%^https?://[^/?]*[bcdfghjklmnpqrstvwxz]{7}%i
uri URI_NOVOWEL m%^https?://[^/?\#]*[bcdfghjklmnpqrstvwxz]{7}%i
is this correct?
That depends on your definition of "correct". The RE looks ok, but the
hitrate could change dramatically. It's hard to say without testing.
my understanding is that the rule looks for "dummy" hostnames in the
server part. unfortunately, the way URIs are "exposed" by SA, this rule
also applies to any thing that resembles a URI. This is a problem with
relative URIs (aka href="foo.html" if foo matches the rule).
Erm. How can it match relative and "#gibberish" URIs at all if the RE
is explicitly anchored to "https?://" at the start of the URI?
this is what I meant by "exposed". The URI module "converts" things to
URI format even though they are not in the message. The goal is to catch
things like "www.spammer.example", without multiplying the number of
regexs. of course, rawbody won't catch that.