Re: Bayes refinement

Karsten Bräckelmann Wed, 21 May 2014 18:13:16 -0700

On Wed, 2014-05-21 at 17:32 -0700, Ian Zimmerman wrote:
> > The test message does not have that string. Maybe it uses DOS
> > flavor "\r\n". Or what appears to be a bunch of linebreaks
> > actually has spaces mixed in.
> 
> Well, no.  I looked at the message (the same data I fed to s.a. --debug)
> with hexdump -C.  It definitely has 10 consecutive 0a's.
> 
> For rawbody rules, is really _the whole_ body fed to the matcher at once?


Well, no. Rawbody rules are applied to the raw, textual body parts,
merely decoded with HTML and linebreaks left intact -- split up into 1-2
kByte chunks. It *is* possible the sub-string you're trying to match is
placed rather unfortunate and being split.

To have a closer look at the occurrences of consecutive newlines and
their respective lengths, you can use this rule for testing:

  rawbody __BLANKS  /\n{2,}/
  tflags  __BLANKS  multiple

The -D debug output will show all matches. The number of directly
following "[...]" continuation lines per hit equals the number of
consecutive newline chars matched. Unlike the resulting rule, this
debugging variant needs an "or more" quantifier. Adjust the minimum to
filter out short matches, while still being able to easily find the
largest occurrence.

Modifying your sample, or stripping down a minimal test case will show
if this is just an unfortunate edge-case.

In either case, having a sample would speed up this ping-pong style
debugging. And I am curious. ;)  Mind putting your sample up a pastebin?


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: Bayes refinement

Reply via email to