Gilles, It is btl/tcp (we'll be upgrading to newer EC2 types next year to take advantage of libfabric). I need to write a script to log and timestamp the memory usage of the process as reported by /proc/<pid>/stat and sync that up with the application's log of what it's doing to say this definitively, but based on what I've watched on 'top' so far, I think where these big allocations are happening are two areas where I'm doing MPI_Allgatherv() - every rank has roughly 1/numRanks of the data (but not divided exactly evenly so need to use MPI_Allgatherv)... the ranks are reusing that pre-allocated buffer to store their local results and then pass that same pre-allocated buffer into MPI_Allgatherv() to bring results in from all ranks. So, there is a lot of communication across all ranks at these points. So, does your comment about using the coll/sync module apply in this case? I'm not familiar with this module - is this something I specify at OpenMPI compile time or a runtime option that I enable?
Thanks for the detailed help. -Adam On Thu, Dec 20, 2018 at 9:41 AM Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > Adam, > > Are you using btl/tcp (e.g. plain TCP/IP) for internode communications > ? Or are you using libfabric on top of the latest EC2 drivers ? > > There is no control flow in btl/tcp, which means for example if all > your nodes send messages to rank 0, that can create a lot of > unexpected messages on that rank.. > In the case of btl/tcp, this means a lot of malloc() on rank 0, until > these messages are received by the app. > If rank 0 is overflowed, then that will likely end up in the node > swapping to death (or killing your app if you have little or no swap). > > If you are using collective operations, make sure the coll/sync module > is selected. > This module insert MPI_Barrier() every n collectives on a given > communicator. This forces your processes to synchronize and can force > message to be received. (Think of the previous example if you run > MPI_Scatter(root=0) in a loop) > > Cheers, > > Gilles > > On Thu, Dec 20, 2018 at 11:06 PM Adam Sylvester <op8...@gmail.com> wrote: > > > > This case is actually quite small - 10 physical machines with 18 > physical cores each, 1 rank per machine. These are AWS R4 instances (Intel > Xeon E5 Broadwell processors). OpenMPI version 2.1.0, using TCP (10 Gbps). > > > > I calculate the memory needs of my application upfront (in this case > ~225 GB per machine), allocate one buffer upfront, and reuse this buffer > for valid and scratch throughout processing. This is running on RHEL 7 - > I'm measuring memory usage via top where I see it go up to 248 GB in an > MPI-intensive portion of processing. > > > > I thought I was being quite careful with my memory allocations and there > weren't any other stray allocations going on, but of course it's possible > there's a large temp buffer somewhere that I've missed... based on what > you're saying, this is way more memory than should be attributed to OpenMPI > - is there a way I can query OpenMPI to confirm that? If the OS is unable > to keep up with the network traffic, is it possible there's some low-level > system buffer that gets allocated to gradually work off the TCP traffic? > > > > Thanks. > > > > On Thu, Dec 20, 2018 at 8:32 AM Nathan Hjelm via users < > users@lists.open-mpi.org> wrote: > >> > >> How many nodes are you using? How many processes per node? What kind of > processor? Open MPI version? 25 GB is several orders of magnitude more > memory than should be used except at extreme scale (1M+ processes). Also, > how are you calculating memory usage? > >> > >> -Nathan > >> > >> > On Dec 20, 2018, at 4:49 AM, Adam Sylvester <op8...@gmail.com> wrote: > >> > > >> > Is there a way at runtime to query OpenMPI to ask it how much memory > it's using for internal buffers? Is there a way at runtime to set a max > amount of memory OpenMPI will use for these buffers? I have an application > where for certain inputs OpenMPI appears to be allocating ~25 GB and I'm > not accounting for this in my memory calculations (and thus bricking the > machine). > >> > > >> > Thanks. > >> > -Adam > >> > _______________________________________________ > >> > users mailing list > >> > users@lists.open-mpi.org > >> > https://lists.open-mpi.org/mailman/listinfo/users > >> > >> _______________________________________________ > >> users mailing list > >> users@lists.open-mpi.org > >> https://lists.open-mpi.org/mailman/listinfo/users > > > > _______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users