On Fri, 2014-10-24 at 19:05 -0700, John Hardin wrote: > On Fri, 24 Oct 2014, John Hardin wrote: > > > On Sat, 25 Oct 2014, Martin Gregorie wrote: > > > > > Less obviously, it doesn't seem to matter whether you write the rule > > > as /\.link\b/ or /\.link$/ - both give identical matches. Both match > > > the following regexes just as you'd expect: > > > http://www.linkedin.com/home/user/data.link > > > http://www.example.link > > > > > > but, less obviously, both also match this: > > > http://www.example.link/path/to/file.txt > > > > {boggle} > > > > > ...but > > > "grep -P '\.link\b'" matches it, but > > > "grep -P '\.link$'" does not. > > > > > > I presume that this means that the uri rule tests against two strings: > > > one being just the domain name and the other being the whole URI and > > > declares a rule hit if either string matches.
Basically correct. SA uri rules are not only tested against the raw URI as extracted from the message, but also some normalized variations. Without going into details, OTOH this includes un-escaping, protocol prefix (if missing) and path stripping. $ echo -e "\n apache.org/path/" | ./spamassassin -D -L --cf="uri URI_DOMAIN /^http:\/\/[^\/]+$/" dbg: rules: ran uri rule URI_DOMAIN ======> got hit: "http://apache.org" Note the regex matching a "domain only" anything-but-slash [^/]+ substring anchored at the end of the string. Also note the input message's URI lacking a protocol, but the rule hit showing the (default) protocol added by SA in one variation. > > I don't think so, but I'm not positive. > > > > If you have a testing environment set up, try adding this and see what you > > get in the log: > > > > uri __ALL_URI /.*/ > > oops. This too: > > tflags __ALL_URI multiple > > Sorry for forgetting that bit, it's rather important. :) That seemingly straight-forward approach does not work in this case. The tflags multiple option does not make uri rules match multiple times on a single URI extracted from the message. It still generates a single hit per extracted URI only, not including multiple hits on its normalized variations. The tflags multiple option on a uri rule enables it to match multiple times on different URIs extracted from the message. -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}