"Paul Watson" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > <[EMAIL PROTECTED]> wrote in message > news:[EMAIL PROTECTED] >>I want to scan a file byte for byte for occurences of the the four byte >> pattern 0x00000100. I've tried with this: >> >> # start >> import sys >> >> numChars = 0 >> startCode = 0 >> count = 0 >> >> inputFile = sys.stdin >> >> while True: >> ch = inputFile.read(1) >> numChars += 1 >> >> if len(ch) < 1: break >> >> startCode = ((startCode << 8) & 0xffffffff) | (ord(ch)) >> if numChars < 4: continue >> >> if startCode == 0x00000100: >> count = count + 1 >> >> print count >> # end >> >> But it is very slow. What is the fastest way to do this? Using some >> native call? Using a buffer? Using whatever? >> >> /David
Here is a better one that counts, and not just detects, the substring. This is -much- faster than using mmap; especially for a large file that may cause paging to start. Using mmap can be -very- slow. #!/usr/bin/env python import sys fn = 't2.dat' ss = '\x00\x00\x01\x00' be = len(ss) - 1 # length of overlap to check blocksize = 64 * 1024 # need to ensure that blocksize > overlap fp = open(fn, 'rb') b = fp.read(blocksize) count = 0 while len(b) > be: count += b.count(ss) b = b[-be:] + fp.read(blocksize) fp.close() print count sys.exit(0) -- http://mail.python.org/mailman/listinfo/python-list