Re: regex over files

2005-04-29 Thread Robin Becker
Peter Otten wrote: Robin Becker wrote: #sscan1.py thanks to Skip import sys, time, mmap, os, re fn = sys.argv[1] fh=os.open(fn,os.O_BINARY|os.O_RDONLY) s=mmap.mmap(fh,0,access=mmap.ACCESS_READ) l=n=0 t0 = time.time() for mat in re.split("X", s): re.split() returns a list, not a generator, and

Re: regex over files

2005-04-29 Thread Robin Becker
Peter Otten wrote: Robin Becker wrote: #sscan1.py thanks to Skip import sys, time, mmap, os, re fn = sys.argv[1] fh=os.open(fn,os.O_BINARY|os.O_RDONLY) s=mmap.mmap(fh,0,access=mmap.ACCESS_READ) l=n=0 t0 = time.time() for mat in re.split("X", s): re.split() returns a list, not a generator, and

Re: regex over files

2005-04-29 Thread Peter Otten
Robin Becker wrote: > #sscan1.py thanks to Skip > import sys, time, mmap, os, re > fn = sys.argv[1] > fh=os.open(fn,os.O_BINARY|os.O_RDONLY) > s=mmap.mmap(fh,0,access=mmap.ACCESS_READ) > l=n=0 > t0 = time.time() > for mat in re.split("X", s): re.split() returns a list, not a generator, and th

Re: regex over files

2005-04-28 Thread Bengt Richter
On Thu, 28 Apr 2005 20:35:43 +, Robin Becker <[EMAIL PROTECTED]> wrote: >Jeremy Bowers wrote: > > > > > As you try to understand mmap, make sure your mental model can take into > > account the fact that it is easy and quite common to mmap a file several > > times larger than your physical

Re: regex over files

2005-04-28 Thread Robin Becker
Skip Montanaro wrote: . Let me return to your original problem though, doing regex operations on files. I modified your two scripts slightly: . Skip I'm sure my results are dependent on something other than the coding style I suspect file/disk cache and paging operates here. Note that we

Re: regex over files

2005-04-28 Thread Robin Becker
Jeremy Bowers wrote: . As you try to understand mmap, make sure your mental model can take into account the fact that it is easy and quite common to mmap a file several times larger than your physical memory, and it does not even *try* to read the whole thing in at any given time. You may benef

Re: regex over files

2005-04-28 Thread Robin Becker
Robin Becker wrote: Skip Montanaro wrote: .. I'm not sure why the mmap() solution is so much slower for you. Perhaps on some systems files opened for reading are mmap'd under the covers. I'm sure it's highly platform-dependent. (My results on MacOSX - see below - are somewhat better.) ...

Re: regex over files

2005-04-28 Thread Robin Becker
Jeremy Bowers wrote: > > As you try to understand mmap, make sure your mental model can take into > account the fact that it is easy and quite common to mmap a file several > times larger than your physical memory, and it does not even *try* to read > the whole thing in at any given time. You

Re: regex over files

2005-04-28 Thread Robin Becker
Skip Montanaro wrote: ... I'm not sure why the mmap() solution is so much slower for you. Perhaps on some systems files opened for reading are mmap'd under the covers. I'm sure it's highly platform-dependent. (My results on MacOSX - see below - are somewhat better.) Let me return to your origina

Re: regex over files

2005-04-28 Thread Skip Montanaro
Bengt> To be fairer, I think you'd want to hoist the re compilation out Bengt> of the loop. The re module compiles and caches regular expressions, so I doubt it would affect the runtime of either version. Bengt> But also to be fairer, maybe include the overhead of splitting Bengt

Re: regex over files

2005-04-28 Thread Robin Becker
Skip Montanaro wrote: .. I'm not sure why the mmap() solution is so much slower for you. Perhaps on some systems files opened for reading are mmap'd under the covers. I'm sure it's highly platform-dependent. (My results on MacOSX - see below - are somewhat better.) I'll have a go at doing th

Re: regex over files

2005-04-27 Thread Bengt Richter
On Wed, 27 Apr 2005 21:39:45 -0500, Skip Montanaro <[EMAIL PROTECTED]> wrote: > >Robin> I implemented a simple scanning algorithm in two ways. First > buffered scan >Robin> tscan0.py; second mmapped scan tscan1.py. > >... > >Robin> C:\code\reportlab\demos\gadflypaper>\tmp\tscan0.

Re: regex over files

2005-04-27 Thread Skip Montanaro
Robin> I implemented a simple scanning algorithm in two ways. First buffered scan Robin> tscan0.py; second mmapped scan tscan1.py. ... Robin> C:\code\reportlab\demos\gadflypaper>\tmp\tscan0.py dingo.dat Robin> len=139583265 w=103 time=110.91 Robin> C:\code\reportlab\de

Re: regex over files

2005-04-27 Thread Robin Becker
Jeremy Bowers wrote: On Tue, 26 Apr 2005 20:54:53 +, Robin Becker wrote: Skip Montanaro wrote: ... If I mmap() a file, it's not slurped into main memory immediately, though as you pointed out, it's charged to my process's virtual memory. As I access bits of the file's contents, it will page i

Re: regex over files

2005-04-26 Thread Bengt Richter
On Mon, 25 Apr 2005 16:01:45 +0100, Robin Becker <[EMAIL PROTECTED]> wrote: >Is there any way to get regexes to work on non-string/unicode objects. I would >like to split large files by regex and it seems relatively hard to do so >without >having the whole file in memory. Even with buffers it s

Re: regex over files

2005-04-26 Thread Jeremy Bowers
On Tue, 26 Apr 2005 20:54:53 +, Robin Becker wrote: > Skip Montanaro wrote: > ... >> If I mmap() a file, it's not slurped into main memory immediately, though as >> you pointed out, it's charged to my process's virtual memory. As I access >> bits of the file's contents, it will page in only w

Re: regex over files

2005-04-26 Thread Robin Becker
Skip Montanaro wrote: ... If I mmap() a file, it's not slurped into main memory immediately, though as you pointed out, it's charged to my process's virtual memory. As I access bits of the file's contents, it will page in only what's necessary. If I mmap() a huge file, then print out a few bytes

Re: regex over files

2005-04-26 Thread Skip Montanaro
>> It's hard to imagine how sliding a small window onto a file within Python >> would be more efficient than the operating system's paging system. ;-) Robin> well it might be if I only want to scan forward through the file Robin> (think lexical analysis). Most lexical analyzers use

Re: regex over files

2005-04-26 Thread Jeremy Bowers
On Tue, 26 Apr 2005 19:32:29 +0100, Robin Becker wrote: > Skip Montanaro wrote: >> Robin> So we avoid dirty page writes etc etc. However, I still think I >> Robin> could get away with a small window into the file which would be >> Robin> more efficient. >> >> It's hard to imagine how

Re: regex over files

2005-04-26 Thread Robin Becker
Skip Montanaro wrote: Robin> So we avoid dirty page writes etc etc. However, I still think I Robin> could get away with a small window into the file which would be Robin> more efficient. It's hard to imagine how sliding a small window onto a file within Python would be more efficient th

Re: regex over files

2005-04-26 Thread Skip Montanaro
Robin> So we avoid dirty page writes etc etc. However, I still think I Robin> could get away with a small window into the file which would be Robin> more efficient. It's hard to imagine how sliding a small window onto a file within Python would be more efficient than the operating sys

Re: regex over files

2005-04-26 Thread Robin Becker
Steve Holden wrote: . thanks I'll give it a whirl Whoops, I don't think it's a regex search :-( You should be able to adapt the logic fairly easily, I hope. The buffering logic is half the problem; doing it quickly is the other half. The third half of the problem is getting re to co-o

Re: regex over files

2005-04-26 Thread Steve Holden
Robin Becker wrote: Steve Holden wrote: .. I seem to remember that the Medusa code contains a fairly good overlapped search for a terminator string, if you want to chunk the file. Take a look at the handle_read() method of class async_chat in the standard library's asynchat.py. . thanks

Re: regex over files

2005-04-26 Thread Robin Becker
Steve Holden wrote: .. I seem to remember that the Medusa code contains a fairly good overlapped search for a terminator string, if you want to chunk the file. Take a look at the handle_read() method of class async_chat in the standard library's asynchat.py. . thanks I'll give it a whirl

Re: regex over files

2005-04-26 Thread Steve Holden
Robin Becker wrote: Richard Brodie wrote: "Robin Becker" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] Gerald Klix wrote: Map the file into RAM by using the mmap module. The file's contents than is availabel as a seachable string. that's a good idea, but I wonder if it actually saves

Re: regex over files

2005-04-26 Thread Robin Becker
Richard Brodie wrote: "Robin Becker" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] Gerald Klix wrote: Map the file into RAM by using the mmap module. The file's contents than is availabel as a seachable string. that's a good idea, but I wonder if it actually saves on memory? I just t

Re: regex over files

2005-04-26 Thread Richard Brodie
"Robin Becker" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Gerald Klix wrote: > > Map the file into RAM by using the mmap module. > > The file's contents than is availabel as a seachable string. > > > > that's a good idea, but I wonder if it actually saves on memory? I just tried

Re: regex over files

2005-04-25 Thread Robin Becker
Gerald Klix wrote: Map the file into RAM by using the mmap module. The file's contents than is availabel as a seachable string. that's a good idea, but I wonder if it actually saves on memory? I just tried regexing through a 25Mb file and end up with 40Mb as working set (it rose linearly as the l

Re: regex over files

2005-04-25 Thread Gerald Klix
Map the file into RAM by using the mmap module. The file's contents than is availabel as a seachable string. HTH, Gerald Robin Becker schrieb: Is there any way to get regexes to work on non-string/unicode objects. I would like to split large files by regex and it seems relatively hard to do so wi