On 03/07/2019 18.37, Israel Brewster wrote: > I have a script that benefits greatly from multiprocessing (it’s generating a > bunch of images from data). Of course, as expected each process uses a chunk > of memory, and the more processes there are, the more memory used. The amount > used per process can vary from around 3 GB (yes, gigabytes) to over 40 or 50 > GB, depending on the amount of data being processed (usually closer to 10GB, > the 40/50 is fairly rare). This puts me in a position of needing to balance > the number of processes with memory usage, such that I maximize resource > utilization (running one process at a time would simply take WAY to long) > while not overloading RAM (which at best would slow things down due to swap). > > Obviously this process will be run on a machine with lots of RAM, but as I > don’t know how large the datasets that will be fed to it are, I wanted to see > if I could build some intelligence into the program such that it doesn’t > overload the memory. A couple of approaches I thought of: > > 1) Determine the total amount of RAM in the machine (how?), assume an average > of 10GB per process, and only launch as many processes as calculated to fit. > Easy, but would run the risk of under-utilizing the processing capabilities > and taking longer to run if most of the processes were using significantly > less than 10GB > > 2) Somehow monitor the memory usage of the various processes, and if one > process needs a lot, pause the others until that one is complete. Of course, > I’m not sure if this is even possible. > > 3) Other approaches? >
Are you familiar with Dask? <https://docs.dask.org/en/latest/> I don't know it myself other than through hearsay, but I have a feeling it may have a ready-to-go solution to your problem. You'd have to look into dask in more detail than I have... -- https://mail.python.org/mailman/listinfo/python-list