Robin> I implemented a simple scanning algorithm in two ways. First buffered scan Robin> tscan0.py; second mmapped scan tscan1.py.
... Robin> C:\code\reportlab\demos\gadflypaper>\tmp\tscan0.py dingo.dat Robin> len=139583265 w=103 time=110.91 Robin> C:\code\reportlab\demos\gadflypaper>\tmp\tscan1.py dingo.dat Robin> len=139583265 w=103 time=140.53 I'm not sure why the mmap() solution is so much slower for you. Perhaps on some systems files opened for reading are mmap'd under the covers. I'm sure it's highly platform-dependent. (My results on MacOSX - see below - are somewhat better.) Let me return to your original problem though, doing regex operations on files. I modified your two scripts slightly: tscan0.py: import sys, time, re fn = sys.argv[1] f=open(fn,'rb') n=0 t0 = time.time() while 1: buf = f.read(4096) if not buf: break for i in re.split("XXXXX", buf): n += 1 t1 = time.time() print "n=%d time=%.2f" % (n, (t1-t0)) tscan1.py: import sys, time, mmap, os, re fn = sys.argv[1] fh=os.open(fn,os.O_RDONLY) s=mmap.mmap(fh,0,access=mmap.ACCESS_READ) t0 = time.time() n = 0 for mat in re.split("XXXXX", s): n += 1 t1 = time.time() print "n=%d time=%.2f" % (n, (t1-t0)) The mmap version is almost obviously correct, assuming what we want to do is split the file on "XXXXX". The buffered read version is almost certainly incorrect, given our understanding that corner cases lurk at buffer boundaries. I took the file from Bengt Richter's example and replicated it a bunch of times to get a 122MB file. I then ran the above two programs against it: % python tscan1.py splitX n=2112001 time=8.88 % python tscan0.py splitX n=2139845 time=10.26 So the mmap'd version is within 15% of the performance of the buffered read version and we don't have to solve the problem of any corner cases (note the different values of n). I'm happy to take the extra runtime in exchange for simpler code. Skip -- http://mail.python.org/mailman/listinfo/python-list