Às 02:21 de 14-02-2016, Steven D'Aprano escreveu: > On Sun, 14 Feb 2016 06:29 am, Paulo da Silva wrote: ...
Thanks Steven for your advices. This is a small script to solve a specific problem. It will be used in future to solve other similar problems probably with small changes. When I found it eating memory and, what I thought was the 1st reason for that was fixed and it still ate the memory, I thought of something less obvious. After all it seems there is nothing wrong with it (see my other post). > That's your first clue that, perhaps, you should be reading in relatively > small blocks, more like 4K than 4MB. Sure enough, a quick bit of googling > shows that typically you should read from files in small-ish chunks, and > that trying to read in large chunks is often counter-productive: > > https://duckduckgo.com/html/?q=file+read+buffer+size > > The first three links all talk about optimal sizes being measured in small > multiples of 4K, not 40MB. > I didn't know about this! Most of my files are about ~>30MB. So I chose 40MB to avoid python loops. After all, python should be able to optimize those things. > You can try to increase the system buffer, by changing the "open" line to: > > with open(pathname, 'rb', buffering=40*M) as f: > This is another thing. One thing is the requested amount of data I want another is to choose de "really" buffer size. (I didn't know about this argument - thanks). ... > By the way, do you need a cryptographic checksum? sha256 is expensive to > calculate. If all you are doing is trying to match files which could have > the same content, you could use a cheaper hash, like md5 or even crc32. I don't know the probability of collision of each of them. The script has sha256 and md5 as options. When the failed execution I had chosen sha256. I didn't check if it takes much more time. A collision might cause data loss. So ... Thank you. Paulo -- https://mail.python.org/mailman/listinfo/python-list