Adam,

Are you using btl/tcp (e.g. plain TCP/IP) for internode communications
? Or are you using libfabric on top of the latest EC2 drivers ?

There is no control flow in btl/tcp, which means for example if all
your nodes send messages to rank 0, that can create a lot of
unexpected messages on that rank..
In the case of btl/tcp, this means a lot of malloc() on rank 0, until
these messages are received by the app.
If rank 0 is overflowed, then that will likely end up in the node
swapping to death (or killing your app if you have little or no swap).

If you are using collective operations, make sure the coll/sync module
is selected.
This module insert MPI_Barrier() every n collectives on a given
communicator. This forces your processes to synchronize and can force
message to be received. (Think of the previous example if you run
MPI_Scatter(root=0) in a loop)

Cheers,

Gilles

On Thu, Dec 20, 2018 at 11:06 PM Adam Sylvester <op8...@gmail.com> wrote:
>
> This case is actually quite small - 10 physical machines with 18 physical 
> cores each, 1 rank per machine.  These are AWS R4 instances (Intel Xeon E5 
> Broadwell processors).  OpenMPI version 2.1.0, using TCP (10 Gbps).
>
> I calculate the memory needs of my application upfront (in this case ~225 GB 
> per machine), allocate one buffer upfront, and reuse this buffer for valid 
> and scratch throughout processing.  This is running on RHEL 7 - I'm measuring 
> memory usage via top where I see it go up to 248 GB in an MPI-intensive 
> portion of processing.
>
> I thought I was being quite careful with my memory allocations and there 
> weren't any other stray allocations going on, but of course it's possible 
> there's a large temp buffer somewhere that I've missed... based on what 
> you're saying, this is way more memory than should be attributed to OpenMPI - 
> is there a way I can query OpenMPI to confirm that?  If the OS is unable to 
> keep up with the network traffic, is it possible there's some low-level 
> system buffer that gets allocated to gradually work off the TCP traffic?
>
> Thanks.
>
> On Thu, Dec 20, 2018 at 8:32 AM Nathan Hjelm via users 
> <users@lists.open-mpi.org> wrote:
>>
>> How many nodes are you using? How many processes per node? What kind of 
>> processor? Open MPI version? 25 GB is several orders of magnitude more 
>> memory than should be used except at extreme scale (1M+ processes). Also, 
>> how are you calculating memory usage?
>>
>> -Nathan
>>
>> > On Dec 20, 2018, at 4:49 AM, Adam Sylvester <op8...@gmail.com> wrote:
>> >
>> > Is there a way at runtime to query OpenMPI to ask it how much memory it's 
>> > using for internal buffers?  Is there a way at runtime to set a max amount 
>> > of memory OpenMPI will use for these buffers?  I have an application where 
>> > for certain inputs OpenMPI appears to be allocating ~25 GB and I'm not 
>> > accounting for this in my memory calculations (and thus bricking the 
>> > machine).
>> >
>> > Thanks.
>> > -Adam
>> > _______________________________________________
>> > users mailing list
>> > users@lists.open-mpi.org
>> > https://lists.open-mpi.org/mailman/listinfo/users
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to