On Fri, 2014-10-24 at 19:05 -0700, John Hardin wrote:
> On Fri, 24 Oct 2014, John Hardin wrote:
> 
> > On Sat, 25 Oct 2014, Martin Gregorie wrote:
> >
> > >  Less obviously, it doesn't seem to matter whether you write the rule
> > >  as /\.link\b/  or /\.link$/ - both give identical matches. Both match
> > >  the following regexes just as you'd expect:
> > >    http://www.linkedin.com/home/user/data.link
> > >    http://www.example.link
> > >
> > >  but, less obviously, both also match this:
> > >    http://www.example.link/path/to/file.txt
> >
> > {boggle}
> >
> > >  ...but
> > >    "grep -P '\.link\b'" matches it, but
> > >    "grep -P '\.link$'"  does not.
> > >
> > >  I presume that this means that the uri rule tests against two strings:
> > >  one being just the domain name and the other being the whole URI and
> > >  declares a rule hit if either string matches.

Basically correct. SA uri rules are not only tested against the raw URI
as extracted from the message, but also some normalized variations.
Without going into details, OTOH this includes un-escaping, protocol
prefix (if missing) and path stripping.

  $ echo -e "\n apache.org/path/" |
  ./spamassassin -D -L --cf="uri URI_DOMAIN /^http:\/\/[^\/]+$/"

  dbg: rules: ran uri rule URI_DOMAIN ======> got hit: "http://apache.org";

Note the regex matching a "domain only" anything-but-slash [^/]+
substring anchored at the end of the string. Also note the input
message's URI lacking a protocol, but the rule hit showing the (default)
protocol added by SA in one variation.


> > I don't think so, but I'm not positive.
> >
> > If you have a testing environment set up, try adding this and see what you 
> > get in the log:
> >
> >    uri    __ALL_URI  /.*/
> 
> oops. This too:
> 
>       tflags __ALL_URI  multiple
> 
> Sorry for forgetting that bit, it's rather important. :)

That seemingly straight-forward approach does not work in this case. The
tflags multiple option does not make uri rules match multiple times on a
single URI extracted from the message. It still generates a single hit
per extracted URI only, not including multiple hits on its normalized
variations.

The tflags multiple option on a uri rule enables it to match multiple
times on different URIs extracted from the message.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Reply via email to