Re: [OMPI users] large memory usage and hangs when preconnecting beyond 1000 cpus

Marshall Ward Thu, 30 Oct 2014 20:18:45 -0400 (EDT)

Hi, I'm just following up on this to say that the problem was not
related to preconnection, but just very large memory usage for high
CPU jobs.


Preconnecting was just acting to send off a large number of
isend/irecv messages and trigger the memory consumption.

I tried experimenting a bit with XRC, mostly just by copying the
values specified here in the faq:

http://www.open-mpi.org/faq/?category=openfabrics#ib-receive-queues

but it seems that I brought down some nodes in the process!

Is this the right way to reduce my memory consumption per node? Is
there some other way to go about it? (Or a safe way that doesn't cause
kernel panics? :) )

On Wed, Oct 22, 2014 at 1:40 AM, Nathan Hjelm <hje...@lanl.gov> wrote:
>
> At those sizes it is possible you are running into resource
> exhastion issues. Some of the resource exhaustion code paths still lead
> to hangs. If the code does not need to be fully connected I would
> suggest not using mpi_preconnect_mpi but instead track down why the
> initial MPI_Allreduce hangs. I would suggest the stack trace analysis
> tool (STAT). I might help you narrow down where the problem is
> occuring.
>
> -Nathan Hjelm
> HPC-5, LANL
>
> On Tue, Oct 21, 2014 at 01:12:21PM +1100, Marshall Ward wrote:
>> Thanks, it's at least good to know that the behaviour isn't normal!
>>
>> Could it be some sort of memory leak in the call? The code in
>>
>>     ompi/runtime/ompi_mpi_preconnect.c
>>
>> looks reasonably safe, though maybe doing thousands of of isend/irecv
>> pairs is causing problems with the buffer used in ptp messages?
>>
>> I'm trying to see if valgrind can see anything, but nothing from
>> ompi_init_preconnect_mpi is coming up (although there are some other
>> warnings).
>>
>>
>> On Sun, Oct 19, 2014 at 2:37 AM, Ralph Castain <r...@open-mpi.org> wrote:
>> >
>> >> On Oct 17, 2014, at 3:37 AM, Marshall Ward <marshall.w...@gmail.com> 
>> >> wrote:
>> >>
>> >> I currently have a numerical model that, for reasons unknown, requires
>> >> preconnection to avoid hanging on an initial MPI_Allreduce call.
>> >
>> > That is indeed odd - it might take a while for all the connections to 
>> > form, but it shouldn’t hang
>> >
>> >> But
>> >> when we try to scale out beyond around 1000 cores, we are unable to
>> >> get past MPI_Init's preconnection phase.
>> >>
>> >> To test this, I have a basic C program containing only MPI_Init() and
>> >> MPI_Finalize() named `mpi_init`, which I compile and run using `mpirun
>> >> -mca mpi_preconnect_mpi 1 mpi_init`.
>> >
>> > I doubt preconnect has been tested in a rather long time as I’m unaware of 
>> > anyone still using it (we originally provided it for some legacy code that 
>> > otherwise took a long time to initialize). However, I could give it a try 
>> > and see what happens. FWIW: because it was so targeted and hasn’t been 
>> > used in a long time, the preconnect algo is really not very efficient. 
>> > Still, shouldn’t have anything to do with memory footprint.
>> >
>> >>
>> >> This preconnection seems to consume a large amount of memory, and is
>> >> exceeding the available memory on our nodes (~2GiB/core) as the number
>> >> gets into the thousands (~4000 or so). If we try to preconnect to
>> >> around ~6000, we start to see hangs and crashes.
>> >>
>> >> A failed 5600 core preconnection gave this warning (~10k times) while
>> >> hanging for 30 minutes:
>> >>
>> >>    [warn] opal_libevent2021_event_base_loop: reentrant invocation.
>> >> Only one event_base_loop can run on each event_base at once.
>> >>
>> >> A failed 6000-core preconnection job crashed almost immediately with
>> >> the following error.
>> >>
>> >>    [r104:18459] [[32743,0],0] ORTE_ERROR_LOG: File open failure in
>> >> file ras_tm_module.c at line 159
>> >>    [r104:18459] [[32743,0],0] ORTE_ERROR_LOG: File open failure in
>> >> file ras_tm_module.c at line 85
>> >>    [r104:18459] [[32743,0],0] ORTE_ERROR_LOG: File open failure in
>> >> file base/ras_base_allocate.c at line 187
>> >
>> > This doesn’t have anything to do with preconnect - it indicates that 
>> > mpirun was unable to open the Torque allocation file. However, it 
>> > shouldn’t have “crashed”, but instead simply exited with an error message.
>> >
>> >>
>> >> Should we expect to use very large amounts of memory for
>> >> preconnections of thousands of CPUs? And can these
>> >>
>> >> I am using Open MPI 1.8.2 on Linux 2.6.32 (centOS) and FDR infiniband
>> >> network. This is probably not enough information, but I'll try to
>> >> provide more if necessary. My knowledge of implementation is
>> >> unfortunately very limited.
>> >> _______________________________________________
>> >> users mailing list
>> >> us...@open-mpi.org
>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >> Link to this post: 
>> >> http://www.open-mpi.org/community/lists/users/2014/10/25527.php
>> >
>> > _______________________________________________
>> > users mailing list
>> > us...@open-mpi.org
>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> > Link to this post: 
>> > http://www.open-mpi.org/community/lists/users/2014/10/25536.php
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/10/25541.php
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/10/25542.php

Re: [OMPI users] large memory usage and hangs when preconnecting beyond 1000 cpus

Reply via email to