Unfortunately, there is no way to share memory across nodes. Running out of memory as you describe can be due to several factors, including most typically:
* a memory leak in the application, or the application simply growing too big for the environment * one rank running slow, causing it to build a backlog of messages that eventually consumes too much memory What you can do: * add --display-map to your mpirun cmd line to see what MPI ranks are running on each node * run "top" on each node to see which process is getting too large. Generally, you'll see the pid for the processes run in order from the lowest local rank to the highest since that is the order in which they are spawned, so you can figure out which rank is causing the problem. * run valgrind on that rank. This isn't as hard as it might sound, though the cmd line is a tad tricky. If rank M (out of a total job of N ranks) is the one with the memory problem, then you would do this: mpirun -n (M-1) myapp : mpirun -n 1 valgrind myapp : mpirun -n (N-M) myapp Obviously, you can add whatever options you want to the various pieces of the cmd line This will affect the timing, so a race condition might be hidden - but it is a start at tracking it down. If you find that this isn't a leak, but rather a legitimate behavior, then you can try using the mapping and ranking options to redistribute the processes - maybe oversubscribe some of the nodes or increase the size of the allocation so the memory hog can run alone. HTH Ralph On Tue, Nov 18, 2014 at 7:10 AM, Jerry Mersel <jerry.mer...@weizmann.ac.il> wrote: > > > Hi all: > > > > I am running openmpi 1.6.5 and a job which is memory intensive. > > The job runs on 7 hosts using 16 core on each. On one of the hosts the > memory is exhausted so the kernel starts to > > Kill the processes. > > > > It could be that there is plenty of free memory on one of the other > hosts. > > Is there a way for openmpi to use the memory on one of the other hosts > when the memory on one host is gone. > > > > If yes please tell me how. > > > > Thank you. > > > > > > With Blessings, always, > > Jerry Mersel > > [image: cid:image003.png@01CF80E7.62B7D810] > > > > System Administrator > IT Infrastructure Branch | Division of Information Systems > > Weizmann Institute of Science > > Rehovot 76100, Israel > > > Tel: +972-8-9342363 > > > > “allow our heart, the heart of each and every one of us, to to see the > good qualities of our friends and not their shortcomings...” > > --Reb Elimelech of Lizhensk > > > > "We learn something by doing it. There is no other way." > > —John Holt > > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25839.php >