On Sun, Feb 14, 2016 at 12:44 PM, Paulo da Silva <p_s_d_a_s_i_l_v_a...@netcabo.pt> wrote: >> What happens if, after hashing each file (and returning from this >> function), you call gc.collect()? If that reduces your RAM usage, you >> have reference cycles somewhere. >> > I have used gc and del. No luck. > > The most probable cause seems to be hashlib not correctly handling big > buffers updates. I am working in a computer and testing in another. For > the second part may be somehow I forgot to transfer the change to the > other computer. Unlikely but possible.
I'd like to see the problem boiled down to just the hashlib calls. Something like this: import hashlib data = b"*" * 4*1024*1024 lastdig = None while "simulating files": h = hashlib.sha256() hu = h.update for chunk in range(100): hu(data) dig = h.hexdigest() if lastdig is None: lastdig = dig print("Digest:",dig) else: if lastdig != dig: print("Digest fail!") Running this on my system (Python 3.6 on Debian Linux) produces a long-running process with stable memory usage, which is exactly what I'd expect. Even using different data doesn't change that: import hashlib import itertools byte = itertools.count() data = b"*" * 4*1024*1024 while "simulating files": h = hashlib.sha256() hu = h.update for chunk in range(100): hu(data + bytes([next(byte)&255])) dig = h.hexdigest() print("Digest:",dig) Somewhere between my code and yours is something that consumes all that memory. Can you neuter the actual disk reading (replacing it with constants, like this) and make a complete and shareable program that leaks all that memory? ChrisA -- https://mail.python.org/mailman/listinfo/python-list