Bengt Richter wrote: > I still smelled a bug in the counting of substring in the overlap region, > and you motivated me to find it (obvious in hindsight, but aren't most ;-) > > A substring can get over-counted if the "overlap" region joins > infelicitously with the next input. E.g., try counting 'xx' in 10*'xx' > with a read chunk of 4 instead of 1024*1024: > > Assuming corrections so far posted as I understand them: > > >>> def byblocks(f, blocksize, overlap): > ... block = f.read(blocksize) > ... yield block > ... if overlap>0: > ... while True: > ... next = f.read(blocksize-overlap) > ... if not next: break > ... block = block[-overlap:] + next > ... yield block > ... else: > ... while True: > ... next = f.read(blocksize) > ... if not next: break > ... yield next > ... > >>> def countsubst(f, subst, blksize=1024*1024): > ... count = 0 > ... for block in byblocks(f, blksize, len(subst)-1): > ... count += block.count(subst) > ... f.close() > ... return count > ... > > >>> from StringIO import StringIO as S > >>> countsubst(S('xx'*10), 'xx', 4) > 13 > >>> ('xx'*10).count('xx') > 10 > >>> list(byblocks(S('xx'*10), 4, len('xx')-1)) > ['xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xx'] > > Of course, a large read chunk will make the problem either > go away > > >>> countsubst(S('xx'*10), 'xx', 1024) > 10 > > or might make it low probability depending on the data.
[David Rasmussen] > First of all, this isn't a text file, it is a binary file. Secondly, > substrings can overlap. In the sequence 0010010 the substring 0010 > occurs twice. Coincidentally the "always overlap" case seems the easiest to fix. It suffices to replace the count() method with def count_overlap(s, token): pos = -1 n = 0 while 1: try: pos = s.index(token, pos+1) except ValueError: break n += 1 return n Or so I hope, without the thorough tests that are indispensable as we should have learned by now... Peter -- http://mail.python.org/mailman/listinfo/python-list