Re: regex over files

Peter Otten Fri, 29 Apr 2005 00:15:05 -0700

Robin Becker wrote:

> #sscan1.py thanks to Skip
> import sys, time, mmap, os, re
> fn = sys.argv[1]
> fh=os.open(fn,os.O_BINARY|os.O_RDONLY)
> s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
> l=n=0
> t0 = time.time()
> for mat in re.split("XXXXX", s):


re.split() returns a list, not a generator, and this list may consume a lot
of memory.

> n += 1
> l += len(mat)
> t1 = time.time()
> 
> print "fn=%s n=%d l=%d time=%.2f" % (fn, n, l, (t1-t0))

I wrote a generator replacement for re.split(), but as you might expect, the
performance is nowhere near re.split(). For your large data it might help
somewhat because of its smaller memory footprint.

def splititer(regex, data):
    # like re.split(), but never yields the separators.
    if not hasattr(regex, "finditer"):
        regex = re.compile(regex)
    start = 0
    for match in regex.finditer(data):
        end, new_start = match.span()
        yield data[start:end]
        start = new_start
    yield data[start:]

Peter
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex over files

Reply via email to