Skip Montanaro wrote:
......

I'm not sure why the mmap() solution is so much slower for you. Perhaps on some systems files opened for reading are mmap'd under the covers. I'm sure it's highly platform-dependent. (My results on MacOSX - see below - are somewhat better.)


I'll have a go at doing the experiment on some other platforms I have available. The problem is certainly paging related. Perhaps the fact that we don't need to write dirty pages is moot when the system is actually writing out other processes' pages to make room for the incoming ones needed by the cpu hog. I do know that I cannot control that in detail. Also it's entirely possible that file caching/readahead etc etc can skew the results.


All my old compiler texts recommend the buffered read approach, but that might be because mmap etc weren't around. Perhaps some compiler expert can say? Also I suspect that in a low level language the minor overhead caused by the book keeping is lower than that for the paging code.

Let me return to your original problem though, doing regex operations on
files.  I modified your two scripts slightly:

......
I took the file from Bengt Richter's example and replicated it a bunch of
times to get a 122MB file.  I then ran the above two programs against it:

    % python tscan1.py splitX
    n=2112001 time=8.88
    % python tscan0.py splitX
    n=2139845 time=10.26

So the mmap'd version is within 15% of the performance of the buffered read
version and we don't have to solve the problem of any corner cases (note the
different values of n).  I'm happy to take the extra runtime in exchange for
simpler code.

Skip

I will have a go at repeating this on my system. Perhaps with Bengt's code in the buffered case as that would be more realistic.


It has been my experience that all systems crawl when driven into the swapping region and some users of our code seem anxious to run huge print jobs.
--
Robin Becker
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to