On Mon, 31 Oct 2005 09:19:10 +0100, Peter Otten <[EMAIL PROTECTED]> wrote:
>Bengt Richter wrote: > >> I still smelled a bug in the counting of substring in the overlap region, >> and you motivated me to find it (obvious in hindsight, but aren't most ;-) >> >> A substring can get over-counted if the "overlap" region joins >> infelicitously with the next input. E.g., try counting 'xx' in 10*'xx' >> with a read chunk of 4 instead of 1024*1024: >> >> Assuming corrections so far posted as I understand them: >> >> >>> def byblocks(f, blocksize, overlap): >> ... block = f.read(blocksize) >> ... yield block >> ... if overlap>0: >> ... while True: >> ... next = f.read(blocksize-overlap) >> ... if not next: break >> ... block = block[-overlap:] + next >> ... yield block >> ... else: >> ... while True: >> ... next = f.read(blocksize) >> ... if not next: break >> ... yield next >> ... >> >>> def countsubst(f, subst, blksize=1024*1024): >> ... count = 0 >> ... for block in byblocks(f, blksize, len(subst)-1): >> ... count += block.count(subst) >> ... f.close() >> ... return count >> ... >> >> >>> from StringIO import StringIO as S >> >>> countsubst(S('xx'*10), 'xx', 4) >> 13 >> >>> ('xx'*10).count('xx') >> 10 >> >>> list(byblocks(S('xx'*10), 4, len('xx')-1)) >> ['xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xx'] >> >> Of course, a large read chunk will make the problem either >> go away >> >> >>> countsubst(S('xx'*10), 'xx', 1024) >> 10 >> >> or might make it low probability depending on the data. > >[David Rasmussen] > >> First of all, this isn't a text file, it is a binary file. Secondly, >> substrings can overlap. In the sequence 0010010 the substring 0010 >> occurs twice. The OP didn't reply to my post re the above for some reason http://groups.google.com/group/comp.lang.python/msg/dd4125bf38a54b7c?hl=en& > >Coincidentally the "always overlap" case seems the easiest to fix. It >suffices to replace the count() method with > >def count_overlap(s, token): > pos = -1 > n = 0 > while 1: > try: > pos = s.index(token, pos+1) > except ValueError: > break > n += 1 > return n > >Or so I hope, without the thorough tests that are indispensable as we should >have learned by now... > Unfortunately, there is such a thing as a correct implementation of an incorrect spec ;-) I have some doubts about the OP's really wanting to count overlapping patterns as above, which is what I asked about in the above referenced post. Elsewhere he later reveals: [David Rasmussen] >> If you must know, the above one-liner actually counts the number of >> frames in an MPEG2 file. I want to know this number for a number of >> files for various reasons. I don't want it to take forever. In which case I doubt whether he wants to count as above. Scanning for the particular 4 bytes would assume that non-frame-marker data is escaped one way or another so it can't contain the marker byte sequence. (If it did, you'd want to skip it, not count it, I presume). Robust streaming video format would presumably be designed for unambigous re-synching, meaning the data stream can't contain the sync mark. But I don't know if that is guaranteed in conversion from file to stream a la HDLC or some link packet protocol or whether it is actually encoded with escaping in the file. If framing in the file is with length-specifying packet headers and no data escaping, then the filebytes.count(pattern) approach is not going to do the job reliably, as Lasse was pointing out. Requirements, requirements ;-) Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list