On Jul 8, 11:01 am, Kris Kennaway <[EMAIL PROTECTED]> wrote: > samwyse wrote:
> > You might want to look at Plex. > >http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/ > > > "Another advantage of Plex is that it compiles all of the regular > > expressions into a single DFA. Once that's done, the input can be > > processed in a time proportional to the number of characters to be > > scanned, and independent of the number or complexity of the regular > > expressions. Python's existing regular expression matchers do not have > > this property. " > Hmm, unfortunately it's still orders of magnitude slower than grep in my > own application that involves matching lots of strings and regexps > against large files (I killed it after 400 seconds, compared to 1.5 for > grep), and that's leaving aside the much longer compilation time (over a > minute). If the matching was fast then I could possibly pickle the > lexer though (but it's not). That's funny, the compilation is almost instantaneous for me. However, I just tested it to several files, the first containing 4875*'a', the rest each twice the size of the previous. And you're right, for each doubling of the file size, the match take four times as long, meaning O(n^2). 156000*'a' would probably take 8 hours. Here are my results: compile_lexicon() took 0.0236021580595 secs test('file-0.txt') took 24.8322969831 secs test('file-1.txt') took 99.3956799681 secs test('file-2.txt') took 398.349623132 secs And here's my (probably over-engineered) testbed: from __future__ import with_statement from os.path import exists from timeit import Timer from Plex import * filename = "file-%d.txt" def create_files(n): for x in range(0,n): fname = filename % x if not exists(fname): print 'creating', fname with open(fname, 'w') as f: print >>f, (4875*2**x)*'a', def compile_lexicon(): global lexicon lexicon = Lexicon([ (Rep(AnyBut(' "='))+Str('/'), TEXT), (AnyBut('\n'), IGNORE), ]) def test(fname): with open(fname, 'r') as f: scanner = Scanner(lexicon, f, fname) while 1: token = scanner.read() #print token if token[0] is None: break def my_timed_test(func_name, *args): stmt = func_name + '(' + ','.join(map(repr, args)) + ')' t = Timer(stmt, "from __main__ import "+func_name) print stmt, 'took', t.timeit(1), 'secs' if __name__ == '__main__': create_files(6) my_timed_test('compile_lexicon') for x in range(0,4): my_timed_test('test', filename%x) -- http://mail.python.org/mailman/listinfo/python-list