subject:"regex over files"

Re: regex over files

2005-04-29 Thread Robin Becker

Peter Otten wrote: Robin Becker wrote: #sscan1.py thanks to Skip import sys, time, mmap, os, re fn = sys.argv[1] fh=os.open(fn,os.O_BINARY|os.O_RDONLY) s=mmap.mmap(fh,0,access=mmap.ACCESS_READ) l=n=0 t0 = time.time() for mat in re.split("X", s): re.split() returns a list, not a generator, and

Re: regex over files

2005-04-29 Thread Robin Becker

Peter Otten wrote: Robin Becker wrote: #sscan1.py thanks to Skip import sys, time, mmap, os, re fn = sys.argv[1] fh=os.open(fn,os.O_BINARY|os.O_RDONLY) s=mmap.mmap(fh,0,access=mmap.ACCESS_READ) l=n=0 t0 = time.time() for mat in re.split("X", s): re.split() returns a list, not a generator, and

Re: regex over files

2005-04-29 Thread Peter Otten

Robin Becker wrote: > #sscan1.py thanks to Skip > import sys, time, mmap, os, re > fn = sys.argv[1] > fh=os.open(fn,os.O_BINARY|os.O_RDONLY) > s=mmap.mmap(fh,0,access=mmap.ACCESS_READ) > l=n=0 > t0 = time.time() > for mat in re.split("X", s): re.split() returns a list, not a generator, and th

Re: regex over files

2005-04-28 Thread Bengt Richter

On Thu, 28 Apr 2005 20:35:43 +, Robin Becker <[EMAIL PROTECTED]> wrote: >Jeremy Bowers wrote: > > > > > As you try to understand mmap, make sure your mental model can take into > > account the fact that it is easy and quite common to mmap a file several > > times larger than your physical

Re: regex over files

2005-04-28 Thread Robin Becker

Skip Montanaro wrote: . Let me return to your original problem though, doing regex operations on files. I modified your two scripts slightly: . Skip I'm sure my results are dependent on something other than the coding style I suspect file/disk cache and paging operates here. Note that we

Re: regex over files

2005-04-28 Thread Robin Becker

Jeremy Bowers wrote: . As you try to understand mmap, make sure your mental model can take into account the fact that it is easy and quite common to mmap a file several times larger than your physical memory, and it does not even *try* to read the whole thing in at any given time. You may benef

Re: regex over files

2005-04-28 Thread Robin Becker

Robin Becker wrote: Skip Montanaro wrote: .. I'm not sure why the mmap() solution is so much slower for you. Perhaps on some systems files opened for reading are mmap'd under the covers. I'm sure it's highly platform-dependent. (My results on MacOSX - see below - are somewhat better.) ...

Re: regex over files

2005-04-28 Thread Robin Becker

Jeremy Bowers wrote: > > As you try to understand mmap, make sure your mental model can take into > account the fact that it is easy and quite common to mmap a file several > times larger than your physical memory, and it does not even *try* to read > the whole thing in at any given time. You

Re: regex over files

2005-04-28 Thread Robin Becker

Skip Montanaro wrote: ... I'm not sure why the mmap() solution is so much slower for you. Perhaps on some systems files opened for reading are mmap'd under the covers. I'm sure it's highly platform-dependent. (My results on MacOSX - see below - are somewhat better.) Let me return to your origina

Re: regex over files

2005-04-28 Thread Skip Montanaro

Bengt> To be fairer, I think you'd want to hoist the re compilation out Bengt> of the loop. The re module compiles and caches regular expressions, so I doubt it would affect the runtime of either version. Bengt> But also to be fairer, maybe include the overhead of splitting Bengt

Re: regex over files

2005-04-28 Thread Robin Becker

Skip Montanaro wrote: .. I'm not sure why the mmap() solution is so much slower for you. Perhaps on some systems files opened for reading are mmap'd under the covers. I'm sure it's highly platform-dependent. (My results on MacOSX - see below - are somewhat better.) I'll have a go at doing th

Re: regex over files

2005-04-27 Thread Bengt Richter

On Wed, 27 Apr 2005 21:39:45 -0500, Skip Montanaro <[EMAIL PROTECTED]> wrote: > >Robin> I implemented a simple scanning algorithm in two ways. First > buffered scan >Robin> tscan0.py; second mmapped scan tscan1.py. > >... > >Robin> C:\code\reportlab\demos\gadflypaper>\tmp\tscan0.

Re: regex over files

2005-04-27 Thread Skip Montanaro

Robin> I implemented a simple scanning algorithm in two ways. First buffered scan Robin> tscan0.py; second mmapped scan tscan1.py. ... Robin> C:\code\reportlab\demos\gadflypaper>\tmp\tscan0.py dingo.dat Robin> len=139583265 w=103 time=110.91 Robin> C:\code\reportlab\de

Re: regex over files

2005-04-27 Thread Robin Becker

Jeremy Bowers wrote: On Tue, 26 Apr 2005 20:54:53 +, Robin Becker wrote: Skip Montanaro wrote: ... If I mmap() a file, it's not slurped into main memory immediately, though as you pointed out, it's charged to my process's virtual memory. As I access bits of the file's contents, it will page i

Re: regex over files

2005-04-26 Thread Bengt Richter

On Mon, 25 Apr 2005 16:01:45 +0100, Robin Becker <[EMAIL PROTECTED]> wrote: >Is there any way to get regexes to work on non-string/unicode objects. I would >like to split large files by regex and it seems relatively hard to do so >without >having the whole file in memory. Even with buffers it s

Re: regex over files

2005-04-26 Thread Jeremy Bowers

On Tue, 26 Apr 2005 20:54:53 +, Robin Becker wrote: > Skip Montanaro wrote: > ... >> If I mmap() a file, it's not slurped into main memory immediately, though as >> you pointed out, it's charged to my process's virtual memory. As I access >> bits of the file's contents, it will page in only w

Re: regex over files

2005-04-26 Thread Robin Becker

Skip Montanaro wrote: ... If I mmap() a file, it's not slurped into main memory immediately, though as you pointed out, it's charged to my process's virtual memory. As I access bits of the file's contents, it will page in only what's necessary. If I mmap() a huge file, then print out a few bytes

Re: regex over files

2005-04-26 Thread Skip Montanaro

>> It's hard to imagine how sliding a small window onto a file within Python >> would be more efficient than the operating system's paging system. ;-) Robin> well it might be if I only want to scan forward through the file Robin> (think lexical analysis). Most lexical analyzers use

Re: regex over files

2005-04-26 Thread Jeremy Bowers

On Tue, 26 Apr 2005 19:32:29 +0100, Robin Becker wrote: > Skip Montanaro wrote: >> Robin> So we avoid dirty page writes etc etc. However, I still think I >> Robin> could get away with a small window into the file which would be >> Robin> more efficient. >> >> It's hard to imagine how

Re: regex over files

2005-04-26 Thread Robin Becker

Skip Montanaro wrote: Robin> So we avoid dirty page writes etc etc. However, I still think I Robin> could get away with a small window into the file which would be Robin> more efficient. It's hard to imagine how sliding a small window onto a file within Python would be more efficient th

Re: regex over files

2005-04-26 Thread Skip Montanaro

Robin> So we avoid dirty page writes etc etc. However, I still think I Robin> could get away with a small window into the file which would be Robin> more efficient. It's hard to imagine how sliding a small window onto a file within Python would be more efficient than the operating sys

Re: regex over files

2005-04-26 Thread Robin Becker

Steve Holden wrote: . thanks I'll give it a whirl Whoops, I don't think it's a regex search :-( You should be able to adapt the logic fairly easily, I hope. The buffering logic is half the problem; doing it quickly is the other half. The third half of the problem is getting re to co-o

Re: regex over files

2005-04-26 Thread Steve Holden

Robin Becker wrote: Steve Holden wrote: .. I seem to remember that the Medusa code contains a fairly good overlapped search for a terminator string, if you want to chunk the file. Take a look at the handle_read() method of class async_chat in the standard library's asynchat.py. . thanks

Re: regex over files

2005-04-26 Thread Robin Becker

Steve Holden wrote: .. I seem to remember that the Medusa code contains a fairly good overlapped search for a terminator string, if you want to chunk the file. Take a look at the handle_read() method of class async_chat in the standard library's asynchat.py. . thanks I'll give it a whirl

Re: regex over files

2005-04-26 Thread Steve Holden

Robin Becker wrote: Richard Brodie wrote: "Robin Becker" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] Gerald Klix wrote: Map the file into RAM by using the mmap module. The file's contents than is availabel as a seachable string. that's a good idea, but I wonder if it actually saves

Re: regex over files

2005-04-26 Thread Robin Becker

Richard Brodie wrote: "Robin Becker" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] Gerald Klix wrote: Map the file into RAM by using the mmap module. The file's contents than is availabel as a seachable string. that's a good idea, but I wonder if it actually saves on memory? I just t

Re: regex over files

2005-04-26 Thread Richard Brodie

"Robin Becker" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Gerald Klix wrote: > > Map the file into RAM by using the mmap module. > > The file's contents than is availabel as a seachable string. > > > > that's a good idea, but I wonder if it actually saves on memory? I just tried

Re: regex over files

2005-04-25 Thread Robin Becker

Gerald Klix wrote: Map the file into RAM by using the mmap module. The file's contents than is availabel as a seachable string. that's a good idea, but I wonder if it actually saves on memory? I just tried regexing through a 25Mb file and end up with 40Mb as working set (it rose linearly as the l

Re: regex over files

2005-04-25 Thread Gerald Klix

Map the file into RAM by using the mmap module. The file's contents than is availabel as a seachable string. HTH, Gerald Robin Becker schrieb: Is there any way to get regexes to work on non-string/unicode objects. I would like to split large files by regex and it seems relatively hard to do so wi

regex over files

2005-04-25 Thread Robin Becker

Is there any way to get regexes to work on non-string/unicode objects. I would like to split large files by regex and it seems relatively hard to do so without having the whole file in memory. Even with buffers it seems hard to get regexes to indicate that they failed because of buffer terminati

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

Re: regex over files

regex over files

30 matches

Site Navigation

Mail list logo

Footer information