On Wed, 2014-05-28 at 21:55 -0400, Alex wrote: > On Wed, May 28, 2014 at 5:36 PM, Karsten Bräckelmann wrote:
> > Oh, come on, Alex. We've had that topic just recently in your "Help with > > short bodys with URLs" thread. Which wasn't the first time either... > > I know, I know. I actually started with that, and it became too > complex for me to modify/update when a particular false-negative came > in with enough preceding HTML to cause our rule to fail. So, I thought > trying to write a simple "body" rule be easy enough, with the help of > the team here. That's the trade-off. Rawbody including raw (sic) HTML markup, or rendered body, but split at paragraphs. > > The "body" are all textual parts, rendered and normalized. Consecutive > > whitespace is condensed to a single space. An empty line (double > > newline) delimits paragraphs. The Subject becomes the first paragraph of > > the body. > > > > The regex pattern is matched against the "body" one paragraph at a time. > > > > A body rule with beginning and end anchors /^ $/ as you posted matches > > complete paragraphs. Not the full body. > > I don't think I realized multiple buffers weren't considered > simultaneously. I don't get "buffer", neither "simultaneously" in this context. For body rules, all textual MIME-parts are rendered, then split up into an array of strings, each consisting of one paragraph. The regex pattern is matched against these strings (the "buffer" you referred to?) in the array. Matched against all ("simultaneously"?) of them (tflags multiple, or in case there is no hit), unless the first match is found. > > First, you will need to use a rawbody rule. > > > > rawbody __SHORT_BODY_URI m~^.{,15} https?://[^ ]+ .{,15}$~ > > > > Entirely untested, and favoring simplicity and readability over > > correctness. (The simple spaces should better be /\s?/ any whitespace, > > optional.) > > I tested this briefly on my sample, and it doesn't match because > __CHUNK hits twice. The HTML section is larger than fifteen chars > before and fifteen chars after, just as the LOC_SHORT doesn't match > for the rawbody being larger than 200 chars. Indeed, each MIME-part is split up into chunks separately. Thus, with a text/plain and text/html part each below 1k, __CHUNK will equal 2. Working around that can get complex, and at least 3 different ways to do that just popped up in my head. Too much to describe in detail, or even get straight right now. Plus, ultimately it depends heavily on the samples. > Is it possible to only match on text/plain instead of text/html? No. Well, yes, with a really bad-ass full pattern rule (don't even think about that), or a custom plugin. > > That all said, the rule you are currently trying to write pretty much > > sounds like the "has URI and short body" LOC_SHORT rule we discussed > > back in Oct 2013... > > So this doesn't match just as the LOC_SHORT rule doesn't match. It is way too late tonight, but if I get my hands on a meaningful sample, I might enjoy writing some rules tomorrow... ;) -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}