On Thu, 26 Jun 2008 11:20:01 -0500, Peter Pearson wrote: > On 25 Jun 2008 15:20:04 GMT, Kirk <[EMAIL PROTECTED]> wrote: [snip] >> the following regular expression matching seems to enter in a infinite >> loop: [snip] >> import re >> text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA) >> una ' >> re.findall('[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9] >> *[A-Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$', text) [snip] > > If it will help some smarter person identify the problem, it can > be simplified to this: > > re.findall('[^X]*((?:0*X+0*)+\s*a*\s*(?:0*X+0*\s*)*)([^X]*)$', > "XXXXXXXXXXXXXXXXX (X" ) > > This doesn't actually hang, it just takes a long time. The > time taken increases quickly as the chain of X's gets longer.
... and can be further reduced to this: re.findall('(X+)+(X+)*$', "XXXXXXXXXXXXXXXXXXXy" ) which bears considerable resemblance to a regexp that was called to my attention inconjunction with this report on pathological regular expressions: http://www.cs.rice.edu/~scrosby/hash/ that resulted in my measuring the following data for the regular expression '(x+x+)+y' in strings consisting of many x's and maybe a terminal y: seconds to scan ... ------------------- # x's xxx...xy xxx...x ----- -------- ------- 10 0.0 0.0 11 0.0 0.0 12 0.0 0.0 13 0.0 0.01 14 0.0 0.0 15 0.0 0.01 16 0.0 0.03 17 0.0 0.04 18 0.0 0.09 19 0.0 0.16 20 0.0 0.33 21 0.0 0.65 22 0.0 1.3 23 0.0 2.58 24 0.0 5.18 25 0.0 10.27 26 0.0 20.41 Bottom line: given any regular-expression matcher, you can find a regular expression and a family of strings such that the matching time increases exponentially in the length of the string. -- To email me, substitute nowhere->spamcop, invalid->net. -- http://mail.python.org/mailman/listinfo/python-list