On Jul 9, 10:06 pm, Kris Kennaway <[EMAIL PROTECTED]> wrote: > John Machin wrote: > >> Hmm, unfortunately it's still orders of magnitude slower than grep in my > >> own application that involves matching lots of strings and regexps > >> against large files (I killed it after 400 seconds, compared to 1.5 for > >> grep), and that's leaving aside the much longer compilation time (over a > >> minute). If the matching was fast then I could possibly pickle the > >> lexer though (but it's not). > > > Can you give us some examples of the kinds of patterns that you are > > using in practice and are slow using Python re? > > Trivial stuff like: > > (Str('error in pkg_delete'), ('mtree', 'mtree')), > (Str('filesystem was touched prior to .make install'), > ('mtree', 'mtree')), > (Str('list of extra files and directories'), ('mtree', 'mtree')), > (Str('list of files present before this port was installed'), > ('mtree', 'mtree')), > (Str('list of filesystem changes from before and after'), > ('mtree', 'mtree')), > > (re('Configuration .* not supported'), ('arch', 'arch')), > > (re('(configure: error:|Script.*configure.*failed > unexpectedly|script.*failed: here are the contents of)'), > ('configure_error', 'configure')), > ... > > There are about 150 of them and I want to find which is the first match > in a text file that ranges from a few KB up to 512MB in size. > > > How large is "large"? > > > What kind of text? > > It's compiler/build output. > > > Instead of grep, you might like to try nrgrep ... google("nrgrep > > Navarro Raffinot"): PDF paper about it on Citeseer (if it's up), > > postscript paper and C source findable from Gonzalo Navarro's home- > > page. > > Thanks, looks interesting but I don't think it is the best fit here. I > would like to avoid spawning hundreds of processes to process each file > (since I have tens of thousands of them to process). >
Uh-huh ... try this, then: http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/ You could use this to find the "Str" cases and the prefixes of the "re" cases (which seem to be no more complicated than 'foo.*bar.*zot') and use something slower like Python's re to search the remainder of the line for 'bar.*zot'. Cheers, John -- http://mail.python.org/mailman/listinfo/python-list