Thank you for your response.
I will investigate further.

With Blessings, always,

   Jerry Mersel

[cid:image003.png@01CF80E7.62B7D810]

   System Administrator
   IT Infrastructure Branch | Division of Information Systems
    Weizmann Institute of Science
    Rehovot 76100, Israel

   Tel:  +972-8-9342363

“allow our heart, the heart of each and every one of us, to  to see the good 
qualities of our friends and not their shortcomings...”
      --Reb Elimelech of Lizhensk

"We learn something by doing it. There is no other way."
—John Holt


From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Tuesday, November 18, 2014 5:56 PM
To: Open MPI Users
Subject: Re: [OMPI users] job running out of memory

Unfortunately, there is no way to share memory across nodes. Running out of 
memory as you describe can be due to several factors, including most typically:

* a memory leak in the application, or the application simply growing too big 
for the environment

* one rank running slow, causing it to build a backlog of messages that 
eventually consumes too much memory

What you can do:

* add --display-map to your mpirun cmd line to see what MPI ranks are running 
on each node

* run "top" on each node to see which process is getting too large. Generally, 
you'll see the pid for the processes run in order from the lowest local rank to 
the highest since that is the order in which they are spawned, so you can 
figure out which rank is causing the problem.

* run valgrind on that rank. This isn't as hard as it might sound, though the 
cmd line is a tad tricky. If rank M (out of a total job of N ranks) is the one 
with the memory problem, then you would do this:

mpirun -n (M-1) myapp : mpirun -n 1 valgrind myapp : mpirun -n (N-M) myapp

Obviously, you can add whatever options you want to the various pieces of the 
cmd line

This will affect the timing, so a race condition might be hidden - but it is a 
start at tracking it down.

If you find that this isn't a leak, but rather a legitimate behavior, then you 
can try using the mapping and ranking options to redistribute the processes - 
maybe oversubscribe some of the nodes or increase the size of the allocation so 
the memory hog can run alone.

HTH
Ralph


On Tue, Nov 18, 2014 at 7:10 AM, Jerry Mersel 
<jerry.mer...@weizmann.ac.il<mailto:jerry.mer...@weizmann.ac.il>> wrote:

Hi all:

  I am running openmpi 1.6.5 and a job which is memory intensive.
  The job runs on 7 hosts using 16 core on each. On one of the hosts the memory 
is exhausted so the kernel starts to
  Kill the processes.

  It could be that there is plenty of free memory on one of the other hosts.
  Is there a way for openmpi to use the memory on one of the other hosts when 
the memory on one host is gone.

  If yes please tell me how.

   Thank you.


With Blessings, always,

   Jerry Mersel

[cid:image003.png@01CF80E7.62B7D810]

   System Administrator
   IT Infrastructure Branch | Division of Information Systems
    Weizmann Institute of Science
    Rehovot 76100, Israel

   Tel:  +972-8-9342363<tel:%2B972-8-9342363>

“allow our heart, the heart of each and every one of us, to  to see the good 
qualities of our friends and not their shortcomings...”
      --Reb Elimelech of Lizhensk

"We learn something by doing it. There is no other way."
—John Holt



_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/11/25839.php

Reply via email to