On Jul 4, 4:43 pm, "Filipe Fernandes" <[EMAIL PROTECTED]> wrote: > On Fri, Jul 4, 2008 at 8:36 AM, Peter Otten <[EMAIL PROTECTED]> wrote: > > Henning_Thornblad wrote: > > >> What can be the cause of the large difference between re.search and > >> grep? > > > grep uses a smarter algorithm ;) > > >> This script takes about 5 min to run on my computer: > >> #!/usr/bin/env python > >> import re > > >> row="" > >> for a in range(156000): > >> row+="a" > >> print re.search('[^ "=]*/',row) > > >> While doing a simple grep: > >> grep '[^ "=]*/' input (input contains 156.000 a in > >> one row) > >> doesn't even take a second. > > >> Is this a bug in python? > > > You could call this a performance bug, but it's not common enough in real > > code to get the necessary brain cycles from the core developers. > > So you can either write a patch yourself or use a workaround. > > > re.search('[^ "=]*/', row) if "/" in row else None > > > might be good enough. > > Wow... I'm rather surprised at how slow this is... using re.match > yields much quicker results, but of course it's not quite the same as > re.search > > Incidentally, if you add the '/' to "row" at the end of the string, > re.search returns instantly with a match object.
This behavior is showing that you're getting n-squared performance; the regexp seems to be checking 156000*(156000-1)/2 substrings for a match. I don't think it's possible to avoid quadratic behavior in regexps in general, but clearly there are ways to optimize in some cases. I'm guessing that grep builds a table of locations of individual characters as it scans and, when the regexp exhausts the input, it tries to backtrack to the last slash it saw, except there wasn't one so it knows the regexp cannot be satisfied and it exits early. > @ Peter > I'm not versed enough in regex to tell if this is a bug or not > (although I suspect it is), I'm pretty sure it isn't: the regexp documentation makes no claims as to the performance of the regexp that I'm aware of. > but why would you say this particular > regex isn't common enough in real code? When re.search regexps start with things like [^...]* or .*, typically the excluded characters are a typically found frequently in the input. For example, the pattern .*hello.* could be used to find a line with hello in it, with the expectation that there are lots of newlines. But if there aren't any newlines the regexp wouldn't be very useful. Carl Banks -- http://mail.python.org/mailman/listinfo/python-list