On Wed, 2014-05-28 at 14:16 -0400, Alex wrote: > I'm trying to write a body rule that will catch an email exactly > containing any number of characters up to 15, followed by a URI, > followed by any number of characters, up to 15. My attempt has failed > miserably, and hoped someone could help. > > body LOC_SHORT_BODY_URI m{^.{0,15}(https?://.{1,50}).{0,15}$} > > This catches pretty much everything and I can't figure out why.
Oh, come on, Alex. We've had that topic just recently in your "Help with short bodys with URLs" thread. Which wasn't the first time either... The "body" are all textual parts, rendered and normalized. Consecutive whitespace is condensed to a single space. An empty line (double newline) delimits paragraphs. The Subject becomes the first paragraph of the body. The regex pattern is matched against the "body" one paragraph at a time. A body rule with beginning and end anchors /^ $/ as you posted matches complete paragraphs. Not the full body. > Any help on how to do this more efficiently and effectively would be > greatly appreciated. First, you will need to use a rawbody rule. rawbody __SHORT_BODY_URI m~^.{,15} https?://[^ ]+ .{,15}$~ Entirely untested, and favoring simplicity and readability over correctness. (The simple spaces should better be /\s?/ any whitespace, optional.) Note the regex pattern for the URI. Unlike you I don't limit its length, but simply let it consume everything up until a natural URI end -- whitespace. That way I also ensure the trailing string to be no longer than 15 chars. Your regex doesn't distinguish between the URI and trailing part, effectively allowing a much longer string after a short URI. Since rawbody rules are matched against chunks of 1-2 kByte, we also need to take care there is no additional other chunk than the one matching __SHORT_BODY_URI. Since you already have a rule identifying short messages <= 200 chars, we can simply reuse it here. rawbody __RB_GT_200 /^.{201}/s meta __RB_LE_200 !__RB_GT_200 Another approach would be to actually ensure there is only a single chunk. And finally, meta them together. rawbody __CHUNK /^./ tflags __CHUNK multiple meta SHORT_BODY_URI __SHORT_BODY_URI && (__CHUNK == 1) That all said, the rule you are currently trying to write pretty much sounds like the "has URI and short body" LOC_SHORT rule we discussed back in Oct 2013... -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}