On Sun, 2011-10-23 at 18:12 -0500, Dave Funk wrote:
> On Sun, 23 Oct 2011, Karsten Bräckelmann wrote:

> > > header FROM_ENLARG              From: =~
> >                                      ^
> > Drop the colon, the header name is a plain "From".
> >
> > > /(\bsex\b|\bfree\b|\btrial\b|\benlarge.*|\bpils|sample.*)/i
> >                                        ^^              ^^
> > These are unnecessary at the end of the match.
> >
> > You should carefully watch the anchors for your RE, in your case the \b
> > word boundaries. The above matches a "sample" string anywhere, even
> > embedded inside a name -- or the email address for that matter.
> >
> > If you want to match the "real name" part only, the :name modifier comes
> > in handy. Note that the modifier is delimited by a leading colon. The
> > colon (as per above) is not part of the "From" header name.
> >
> >  header FOO  From:name =~ /\b(sex|free|trial|enlarge)\b/i
> >
> > Here I also moved the \b word boundaries outside the alternation, so you
> > cannot forget it when adding more words. ;)
> 
> Karsten's example is a clear win (efficiency) wise over Jakub's but it's
> also more restrictive. Because of the \b bounding on the outside, 
> Karsten's rule will match "From: enlarge now <b...@ha.ya>" but not
> "From: enlargement now <b...@ha.ya>".

Good point, Dave. Being more restrictive was intended, but the variants
is a very valid point -- and probably the reason for that trailing /.*/
in the first place.


> That can be achieved by adding trailing character matches on those
> words that you want to be 'extendable'. EG:
> 
>    header FOO  From:name =~ /\b(sex|free|trial|enlarge\w{0,5})\b/i
> 
> This will match on "enlarge" "enlargement" "enlarged" etc and still keep 
> the efficiency.
> Note that by using the 'word' match meta-character ('\w') rather than
> the generic wild-card match character ('.') you avoid back-tracking of the
> pattern-match engine (as well as putting a fixed size bounding on it).
> 
> This tactic does need to be used with caution to avoid FPs. The greater
> the usage of non-fixed pattern matches, the larger the group of matched
> strings and thus the greater the possibility of FPs.

Agreed, matching on words like this always needs caution. I was in a
hurry, and initially wanted to point out the issue with the colon only,
to make the header rule work at all...

Being cautious pretty much is the opposite of scoring that beast a
whopping 5 points, no matter how restrictive and isolated (specific
header only) it is.

There are quite some ways to improve this. Like a non-scoring sub-rule
with tflags multiple, and then score 1, 2 or 3+ occurrences accordingly.
Or meta it with that empty Return-Path as shown in the sample. Thought
about that, briefly, but then again I was in a hurry and figured
explaining all this might be a little heavy on the OP anyway.

As a quick band-aid to make the rule less prone to FP, lowering this
rule's score to, say 2 or 3 might already help a lot -- depending on
other rules hit and what these usually score.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Reply via email to