I have an application running across 20 machines where each machine has 60 GB RAM. For some large inputs, some ranks require 45-50 GB RAM. The behavior I'm seeing is that for some of these large cases, my application will run for 10-15 minutes and then one rank will be killed; based on watching top in the past, the application's memory usage gradually increases until it eventually hits 60 GB and is killed (presumably by the OOM killer).
There are a few possibilities that come to mind... 1. While I compute all memory requirements upfront and allocate one large ping/pong buffer to reuse throughout the application, there are some other (believed to be small) allocations here and there. For large inputs, some of these may not be quite as small as I think. 2. There's a memory leak. 3. Open MPI is allocating very large buffers for transferring data, potentially because throughout the application I am *not* using synchronous sends. I can track down 1 and 2, but I'm wondering if there's some kind of debug/logging mode I can run in to see Open MPI's buffer management. All I really care about is the total amount of memory it allocates, but if I need to parse a list of buffers and sizes to infer the total allocation size, that's fine. Thanks for the help. -Adam
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users