Re: re.search much slower then grep on some regular expressions

Kris Kennaway Tue, 08 Jul 2008 09:03:27 -0700

samwyse wrote:

On Jul 4, 6:43 am, Henning_Thornblad <[EMAIL PROTECTED]>
wrote:

What can be the cause of the large difference between re.search and
grep?

While doing a simple grep:
grep '[^ "=]*/' input                  (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?


You might want to look at Plex.
http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/

"Another advantage of Plex is that it compiles all of the regular
expressions into a single DFA. Once that's done, the input can be
processed in a time proportional to the number of characters to be
scanned, and independent of the number or complexity of the regular
expressions. Python's existing regular expression matchers do not have
this property. "

I haven't tested this, but I think it would do what you want:

from Plex import *
lexicon = Lexicon([
    (Rep(AnyBut(' "='))+Str('/'),  TEXT),
    (AnyBut('\n'), IGNORE),
])
filename = "my_file.txt"
f = open(filename, "r")
scanner = Scanner(lexicon, f, filename)
while 1:
    token = scanner.read()
    print token
    if token[0] is None:
        break

Hmm, unfortunately it's still orders of magnitude slower than grep in myown application that involves matching lots of strings and regexpsagainst large files (I killed it after 400 seconds, compared to 1.5 forgrep), and that's leaving aside the much longer compilation time (over aminute). If the matching was fast then I could possibly pickle thelexer though (but it's not).


Kris

Kris
--
http://mail.python.org/mailman/listinfo/python-list

Re: re.search much slower then grep on some regular expressions

Reply via email to