Gilles,

It is btl/tcp (we'll be upgrading to newer EC2 types next year to take
advantage of libfabric).  I need to write a script to log and timestamp the
memory usage of the process as reported by /proc/<pid>/stat and sync that
up with the application's log of what it's doing to say this definitively,
but based on what I've watched on 'top' so far, I think where these big
allocations are happening are two areas where I'm doing MPI_Allgatherv() -
every rank has roughly 1/numRanks of the data (but not divided exactly
evenly so need to use MPI_Allgatherv)... the ranks are reusing that
pre-allocated buffer to store their local results and then pass that same
pre-allocated buffer into MPI_Allgatherv() to bring results in from all
ranks.  So, there is a lot of communication across all ranks at these
points.  So, does your comment about using the coll/sync module apply in
this case?  I'm not familiar with this module - is this something I specify
at OpenMPI compile time or a runtime option that I enable?

Thanks for the detailed help.
-Adam

On Thu, Dec 20, 2018 at 9:41 AM Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Adam,
>
> Are you using btl/tcp (e.g. plain TCP/IP) for internode communications
> ? Or are you using libfabric on top of the latest EC2 drivers ?
>
> There is no control flow in btl/tcp, which means for example if all
> your nodes send messages to rank 0, that can create a lot of
> unexpected messages on that rank..
> In the case of btl/tcp, this means a lot of malloc() on rank 0, until
> these messages are received by the app.
> If rank 0 is overflowed, then that will likely end up in the node
> swapping to death (or killing your app if you have little or no swap).
>
> If you are using collective operations, make sure the coll/sync module
> is selected.
> This module insert MPI_Barrier() every n collectives on a given
> communicator. This forces your processes to synchronize and can force
> message to be received. (Think of the previous example if you run
> MPI_Scatter(root=0) in a loop)
>
> Cheers,
>
> Gilles
>
> On Thu, Dec 20, 2018 at 11:06 PM Adam Sylvester <op8...@gmail.com> wrote:
> >
> > This case is actually quite small - 10 physical machines with 18
> physical cores each, 1 rank per machine.  These are AWS R4 instances (Intel
> Xeon E5 Broadwell processors).  OpenMPI version 2.1.0, using TCP (10 Gbps).
> >
> > I calculate the memory needs of my application upfront (in this case
> ~225 GB per machine), allocate one buffer upfront, and reuse this buffer
> for valid and scratch throughout processing.  This is running on RHEL 7 -
> I'm measuring memory usage via top where I see it go up to 248 GB in an
> MPI-intensive portion of processing.
> >
> > I thought I was being quite careful with my memory allocations and there
> weren't any other stray allocations going on, but of course it's possible
> there's a large temp buffer somewhere that I've missed... based on what
> you're saying, this is way more memory than should be attributed to OpenMPI
> - is there a way I can query OpenMPI to confirm that?  If the OS is unable
> to keep up with the network traffic, is it possible there's some low-level
> system buffer that gets allocated to gradually work off the TCP traffic?
> >
> > Thanks.
> >
> > On Thu, Dec 20, 2018 at 8:32 AM Nathan Hjelm via users <
> users@lists.open-mpi.org> wrote:
> >>
> >> How many nodes are you using? How many processes per node? What kind of
> processor? Open MPI version? 25 GB is several orders of magnitude more
> memory than should be used except at extreme scale (1M+ processes). Also,
> how are you calculating memory usage?
> >>
> >> -Nathan
> >>
> >> > On Dec 20, 2018, at 4:49 AM, Adam Sylvester <op8...@gmail.com> wrote:
> >> >
> >> > Is there a way at runtime to query OpenMPI to ask it how much memory
> it's using for internal buffers?  Is there a way at runtime to set a max
> amount of memory OpenMPI will use for these buffers?  I have an application
> where for certain inputs OpenMPI appears to be allocating ~25 GB and I'm
> not accounting for this in my memory calculations (and thus bricking the
> machine).
> >> >
> >> > Thanks.
> >> > -Adam
> >> > _______________________________________________
> >> > users mailing list
> >> > users@lists.open-mpi.org
> >> > https://lists.open-mpi.org/mailman/listinfo/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> users@lists.open-mpi.org
> >> https://lists.open-mpi.org/mailman/listinfo/users
> >
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to