On Fri, 2010-08-06 at 13:51 -0400, Randy Ramsdell wrote:
> > > body            __RCR_MEGADK                    /.*(M.*E.*G.*A.*D.*K).*/
> > 
> > There are a few things that strike me as peculiar about that rule. Not 
> > least of which is that it would appear to match the following - 
> > hypothetical, but plausible - message. The presence of seven 
> > unrestricted greedy specifiers makes it perfectly plausible to me that 
> > it would take quite a long time to process any moderately long message.

> It does take a long time to process a message and a very short message 
> to boot. In fact, it never finishes and runs the cpus to 100% so the 
> rule has been removed. I still wonder if this is a bug.

It is a bug with the RE. It leads straight to backtracking hell.

OK, what happened there? First, the greedy /.*/ at the beginning slurps
in the entire string. Then the Perl RE engine removes one char at a
time, in an effort to match an M -- the last occurrence.

If it found one, the second greedy /.*/ slurps in the entire remainder
of the string, and the backtracking game begins again, this time
matching an E. And so on.

However, each time this process does not result in a match for the
entire RE, backtracking will cause the above process to go back a step
(a char in that RE), and start from there.


It is not a bug with SA, neither with Perl. It is the pathological RE
that causes an insane amount of backtracking with long strings, in
particular if there are many candidates -- that means upper-case chars.


The worst problem is the use of multiple, unbound, greedy /.*/ patterns.
As has been mentioned before, it would be much better to use a bound
variant like /.{0,10}/ instead. Also, using non-greedy variants likely
would help, too. As would using a char-class, rather than accepting any
char -- at the very least, that will help accuracy and prevent matches
on innocent text.

  guenther


-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Reply via email to