On Sat, 29 Oct 2005 21:10:11 +0100, Steve Holden <[EMAIL PROTECTED]> wrote:
>Peter Otten wrote: >> Bengt Richter wrote: >> >> >>>What struck me was >>> >>> >>>>>> gen = byblocks(StringIO.StringIO('no'),1024,len('end?')-1) >>>>>> [gen.next() for i in xrange(10)] >>> >>>['no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no'] >> >> >> Ouch. Seems like I spotted the subtle cornercase error and missed the big >> one. >> > >No, you just realised subconsciously that we'd all spot the obvious one >and decided to point out the bug that would remain after the obvious one >had been fixed. > I still smelled a bug in the counting of substring in the overlap region, and you motivated me to find it (obvious in hindsight, but aren't most ;-) A substring can get over-counted if the "overlap" region joins infelicitously with the next input. E.g., try counting 'xx' in 10*'xx' with a read chunk of 4 instead of 1024*1024: Assuming corrections so far posted as I understand them: >>> def byblocks(f, blocksize, overlap): ... block = f.read(blocksize) ... yield block ... if overlap>0: ... while True: ... next = f.read(blocksize-overlap) ... if not next: break ... block = block[-overlap:] + next ... yield block ... else: ... while True: ... next = f.read(blocksize) ... if not next: break ... yield next ... >>> def countsubst(f, subst, blksize=1024*1024): ... count = 0 ... for block in byblocks(f, blksize, len(subst)-1): ... count += block.count(subst) ... f.close() ... return count ... >>> from StringIO import StringIO as S >>> countsubst(S('xx'*10), 'xx', 4) 13 >>> ('xx'*10).count('xx') 10 >>> list(byblocks(S('xx'*10), 4, len('xx')-1)) ['xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xx'] Of course, a large read chunk will make the problem either go away >>> countsubst(S('xx'*10), 'xx', 1024) 10 or might make it low probability depending on the data. Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list