On Sat, 29 Oct 2005 10:34:24 +0200, Peter Otten <[EMAIL PROTECTED]> wrote:
>Bengt Richter wrote: > >> On Fri, 28 Oct 2005 20:03:17 -0700, [EMAIL PROTECTED] (Alex Martelli) >> wrote: >> >>>Mike Meyer <[EMAIL PROTECTED]> wrote: >>> ... >>>> Except if you can't read the file into memory because it's to large, >>>> there's a pretty good chance you won't be able to mmap it either. To >>>> deal with huge files, the only option is to read the file in in >>>> chunks, count the occurences in each chunk, and then do some fiddling >>>> to deal with the pattern landing on a boundary. >>> >>>That's the kind of things generators are for...: >>> >>>def byblocks(f, blocksize, overlap): >>> block = f.read(blocksize) >>> yield block >>> while block: >>> block = block[-overlap:] + f.read(blocksize-overlap) >>> if block: yield block >>> >>>Now, to look for a substring of length N in an open binary file f: >>> >>>f = open(whatever, 'b') >>>count = 0 >>>for block in byblocks(f, 1024*1024, len(subst)-1): >>> count += block.count(subst) >>>f.close() >>> >>>not much "fiddling" needed, as you can see, and what little "fiddling" >>>is needed is entirely encompassed by the generator... >>> >> Do I get a job at google if I find something wrong with the above? ;-) > >Try it with a subst of length 1. Seems like you missed an opportunity :-) > I was thinking this was an example a la Alex's previous discussion of interviewee code challenges ;-) What struck me was >>> gen = byblocks(StringIO.StringIO('no'),1024,len('end?')-1) >>> [gen.next() for i in xrange(10)] ['no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no'] Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list