J. Cliff Dyer wrote:
On Wed, 2008-07-09 at 12:29 -0700, samwyse wrote:
On Jul 8, 11:01 am, Kris Kennaway <[EMAIL PROTECTED]> wrote:
samwyse wrote:
You might want to look at Plex.
http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/
"Another advantage of Plex is that it compiles all of the regular
expressions into a single DFA. Once that's done, the input can be
processed in a time proportional to the number of characters to be
scanned, and independent of the number or complexity of the regular
expressions. Python's existing regular expression matchers do not have
this property. "
Hmm, unfortunately it's still orders of magnitude slower than grep in my
own application that involves matching lots of strings and regexps
against large files (I killed it after 400 seconds, compared to 1.5 for
grep), and that's leaving aside the much longer compilation time (over a
minute).  If the matching was fast then I could possibly pickle the
lexer though (but it's not).
That's funny, the compilation is almost instantaneous for me.
However, I just tested it to several files, the first containing
4875*'a', the rest each twice the size of the previous.  And you're
right, for each doubling of the file size, the match take four times
as long, meaning O(n^2).  156000*'a' would probably take 8 hours.
Here are my results:

compile_lexicon() took 0.0236021580595 secs
test('file-0.txt') took 24.8322969831 secs
test('file-1.txt') took 99.3956799681 secs
test('file-2.txt') took 398.349623132 secs

Sounds like a good strategy would be to find the smallest chunk of the
file that matches can't cross, and iterate your search on units of those
chunks.  For example, if none of your regexes cross line boundaries,
search each line of the file individually.  That may help turn around
the speed degradation you're seeing.

That's what I'm doing. I've also tried various other things like mmapping the file and searching it at once, etc, but almost all of the time is spent in the regexp engine so optimizing other things only gives marginal improvement.

Kris
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to