Robin Becker wrote: > #sscan1.py thanks to Skip > import sys, time, mmap, os, re > fn = sys.argv[1] > fh=os.open(fn,os.O_BINARY|os.O_RDONLY) > s=mmap.mmap(fh,0,access=mmap.ACCESS_READ) > l=n=0 > t0 = time.time() > for mat in re.split("XXXXX", s):
re.split() returns a list, not a generator, and this list may consume a lot of memory. > n += 1 > l += len(mat) > t1 = time.time() > > print "fn=%s n=%d l=%d time=%.2f" % (fn, n, l, (t1-t0)) I wrote a generator replacement for re.split(), but as you might expect, the performance is nowhere near re.split(). For your large data it might help somewhat because of its smaller memory footprint. def splititer(regex, data): # like re.split(), but never yields the separators. if not hasattr(regex, "finditer"): regex = re.compile(regex) start = 0 for match in regex.finditer(data): end, new_start = match.span() yield data[start:end] start = new_start yield data[start:] Peter -- http://mail.python.org/mailman/listinfo/python-list