I did some additional profiling of my code.  While the application uses 10
ranks, this particular image breaks into two totally independent pieces,
and we split the world communicator, so really this section of code is
using 5 ranks.

There was a ~16 GB allocation buried way inside several layers of classes
that I was not tracking in my total memory calculations that was part of
the issue... obviously nothing to do with OpenMPI.

For the MPI_Allgatherv() stage, there is ~13 GB of data spread roughly
evenly across 5 ranks that we're gathering via MPI_Allgatherv().  During
that function call, I see 6-7 GB extra allocated which must be due to the
underlying buffers used for transfer.  I tried PMPI_Allgatherv() followed
by MPI_Barrier() but I saw the same 6-7 GB spike.  Examining the code more
closely, there is a way I can rearchitect this to send less data across the
ranks (each rank really just needs several rows above and below itself, not
the entire global data).

So, I think I'm set for now - thanks for the help.

On Thu, Dec 20, 2018 at 7:49 PM Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Adam,
>
> you can rewrite MPI_Allgatherv() in your app. it should simply invoke
> PMPI_Allgatherv() (note the leading 'P') with the same arguments
> followed by MPI_Barrier() in the same communicator (feel free to also
> MPI_Barrier() before PMPI_Allgatherv()).
> That can make your code slower, but it will force the unexpected
> messages related to allgatherv being received.
> If it helps with respect to memory consumption, that means we have a lead
>
> Cheers,
>
> Gilles
>
> On Fri, Dec 21, 2018 at 5:00 AM Jeff Hammond <jeff.scie...@gmail.com>
> wrote:
> >
> > You might try replacing MPI_Allgatherv with the equivalent Send+Recv
> followed by Broadcast.  I don't think MPI_Allgatherv is particularly
> optimized (since it is hard to do and not a very popular function) and it
> might improve your memory utilization.
> >
> > Jeff
> >
> > On Thu, Dec 20, 2018 at 7:08 AM Adam Sylvester <op8...@gmail.com> wrote:
> >>
> >> Gilles,
> >>
> >> It is btl/tcp (we'll be upgrading to newer EC2 types next year to take
> advantage of libfabric).  I need to write a script to log and timestamp the
> memory usage of the process as reported by /proc/<pid>/stat and sync that
> up with the application's log of what it's doing to say this definitively,
> but based on what I've watched on 'top' so far, I think where these big
> allocations are happening are two areas where I'm doing MPI_Allgatherv() -
> every rank has roughly 1/numRanks of the data (but not divided exactly
> evenly so need to use MPI_Allgatherv)... the ranks are reusing that
> pre-allocated buffer to store their local results and then pass that same
> pre-allocated buffer into MPI_Allgatherv() to bring results in from all
> ranks.  So, there is a lot of communication across all ranks at these
> points.  So, does your comment about using the coll/sync module apply in
> this case?  I'm not familiar with this module - is this something I specify
> at OpenMPI compile time or a runtime option that
>   I enable?
> >>
> >> Thanks for the detailed help.
> >> -Adam
> >>
> >> On Thu, Dec 20, 2018 at 9:41 AM Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
> >>>
> >>> Adam,
> >>>
> >>> Are you using btl/tcp (e.g. plain TCP/IP) for internode communications
> >>> ? Or are you using libfabric on top of the latest EC2 drivers ?
> >>>
> >>> There is no control flow in btl/tcp, which means for example if all
> >>> your nodes send messages to rank 0, that can create a lot of
> >>> unexpected messages on that rank..
> >>> In the case of btl/tcp, this means a lot of malloc() on rank 0, until
> >>> these messages are received by the app.
> >>> If rank 0 is overflowed, then that will likely end up in the node
> >>> swapping to death (or killing your app if you have little or no swap).
> >>>
> >>> If you are using collective operations, make sure the coll/sync module
> >>> is selected.
> >>> This module insert MPI_Barrier() every n collectives on a given
> >>> communicator. This forces your processes to synchronize and can force
> >>> message to be received. (Think of the previous example if you run
> >>> MPI_Scatter(root=0) in a loop)
> >>>
> >>> Cheers,
> >>>
> >>> Gilles
> >>>
> >>> On Thu, Dec 20, 2018 at 11:06 PM Adam Sylvester <op8...@gmail.com>
> wrote:
> >>> >
> >>> > This case is actually quite small - 10 physical machines with 18
> physical cores each, 1 rank per machine.  These are AWS R4 instances (Intel
> Xeon E5 Broadwell processors).  OpenMPI version 2.1.0, using TCP (10 Gbps).
> >>> >
> >>> > I calculate the memory needs of my application upfront (in this case
> ~225 GB per machine), allocate one buffer upfront, and reuse this buffer
> for valid and scratch throughout processing.  This is running on RHEL 7 -
> I'm measuring memory usage via top where I see it go up to 248 GB in an
> MPI-intensive portion of processing.
> >>> >
> >>> > I thought I was being quite careful with my memory allocations and
> there weren't any other stray allocations going on, but of course it's
> possible there's a large temp buffer somewhere that I've missed... based on
> what you're saying, this is way more memory than should be attributed to
> OpenMPI - is there a way I can query OpenMPI to confirm that?  If the OS is
> unable to keep up with the network traffic, is it possible there's some
> low-level system buffer that gets allocated to gradually work off the TCP
> traffic?
> >>> >
> >>> > Thanks.
> >>> >
> >>> > On Thu, Dec 20, 2018 at 8:32 AM Nathan Hjelm via users <
> users@lists.open-mpi.org> wrote:
> >>> >>
> >>> >> How many nodes are you using? How many processes per node? What
> kind of processor? Open MPI version? 25 GB is several orders of magnitude
> more memory than should be used except at extreme scale (1M+ processes).
> Also, how are you calculating memory usage?
> >>> >>
> >>> >> -Nathan
> >>> >>
> >>> >> > On Dec 20, 2018, at 4:49 AM, Adam Sylvester <op8...@gmail.com>
> wrote:
> >>> >> >
> >>> >> > Is there a way at runtime to query OpenMPI to ask it how much
> memory it's using for internal buffers?  Is there a way at runtime to set a
> max amount of memory OpenMPI will use for these buffers?  I have an
> application where for certain inputs OpenMPI appears to be allocating ~25
> GB and I'm not accounting for this in my memory calculations (and thus
> bricking the machine).
> >>> >> >
> >>> >> > Thanks.
> >>> >> > -Adam
> >>> >> > _______________________________________________
> >>> >> > users mailing list
> >>> >> > users@lists.open-mpi.org
> >>> >> > https://lists.open-mpi.org/mailman/listinfo/users
> >>> >>
> >>> >> _______________________________________________
> >>> >> users mailing list
> >>> >> users@lists.open-mpi.org
> >>> >> https://lists.open-mpi.org/mailman/listinfo/users
> >>> >
> >>> > _______________________________________________
> >>> > users mailing list
> >>> > users@lists.open-mpi.org
> >>> > https://lists.open-mpi.org/mailman/listinfo/users
> >>> _______________________________________________
> >>> users mailing list
> >>> users@lists.open-mpi.org
> >>> https://lists.open-mpi.org/mailman/listinfo/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> users@lists.open-mpi.org
> >> https://lists.open-mpi.org/mailman/listinfo/users
> >
> >
> >
> > --
> > Jeff Hammond
> > jeff.scie...@gmail.com
> > http://jeffhammond.github.io/
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to