Am 06.07.2011 07:54 schrieb Phlip:
Pythonistas:

Consider this hashing code:

   import hashlib
   file = open(path)
   m = hashlib.md5()
   m.update(file.read())
   digest = m.hexdigest()
   file.close()

If the file were huge, the file.read() would allocate a big string and
thrash memory. (Yes, in 2011 that's still a problem, because these
files could be movies and whatnot.)

So if I do the stream trick - read one byte, update one byte, in a
loop, then I'm essentially dragging that movie thru 8 bits of a 64 bit
CPU. So that's the same problem; it would still be slow.

Yes. That is why you should read with a reasonable block size. Not too small and not too big.

def filechunks(f, size=8192):
    while True:
        s = f.read(size)
        if not s: break
        yield s
#    f.close() # maybe...

import hashlib
file = open(path)
m = hashlib.md5()
fc = filechunks(file)
for chunk in fc:
    m.update(chunk)
digest = m.hexdigest()
file.close()

So you are reading in 8 kiB chunks. Feel free to modify this - maybe use os.stat(file).st_blksize instead (which is AFAIK the recommended minimum), or a value of about 1 MiB...


So now I try this:

   sum = os.popen('sha256sum %r' % path).read()

This is not as nice as the above, especially not with a path containing strange characters. What about, at least,

def shellquote(*strs):
        return " ".join([
                "'"+st.replace("'","'\\''")+"'"
                for st in strs
        ])

sum = os.popen('sha256sum %r' % shellquote(path)).read()


or, even better,

import subprocess
sp = subprocess.Popen(['sha256sum', path'],
    stdin=subprocess.PIPE, stdout=subprocess.PIPE)
sp.stdin.close() # generate EOF
sum = sp.stdout.read()
sp.wait()

?


Does hashlib have a file-ready mode, to hide the streaming inside some
clever DMA operations?

AFAIK not.


Thomas
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to