I couldn't find the docs on mpool_hints, but shouldn't there be a way to disable registration via MPI_Info rather than patching the source?
Jeff PS Jeff Squyres: ;-) ;-) ;-) On Wed, Mar 15, 2017 at 5:03 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > > It looks like there were 3 separate threads on this CP2K issue, but I think we developers got sidetracked because there was a bunch of talk in the other threads about PSM, non-IB(verbs) networks, etc. > > So: the real issue is an app is experiencing a lot of slowdown when calling MPI_ALLOC_MEM/MPI_FREE_MEM when the openib BTL is involved. > > The MPI_*_MEM calls are "slow" when used with the openib BTL because we're registering the memory every time you call MPI_ALLOC_MEM and deregistering the memory every time you call MPI_FREE_MEM. This was intended as an optimization such that the memory is already registered when you invoke an MPI communications function with that buffer. I guess we didn't really anticipate the case where *every* allocation goes through ALLOC_MEM... > > Meaning: if the app is aggressive in using MPI_*_MEM *everywhere* -- even for buffers that aren't used for MPI communication -- I guess you could end up with lots of useless registration/deregistration. If the app does it a lot, that could be the source of quite a lot of needless overhead. > > We don't have a run-time bypass of this behavior (i.e., we assumed that if you're calling MPI_*_MEM, you mean to do so). But let's try an experiment -- can you try applying this patch and see if it removes the slowness? This patch basically removes the registration / deregistration with ALLOC/FREE_MEM (and instead handles it lazily / upon demand when buffers are passed to MPI functions): > > ```patch > diff --git a/ompi/mpi/c/alloc_mem.c b/ompi/mpi/c/alloc_mem.c > index 8c8fb8cd54..c62c8ff706 100644 > --- a/ompi/mpi/c/alloc_mem.c > +++ b/ompi/mpi/c/alloc_mem.c > @@ -74,6 +74,7 @@ int MPI_Alloc_mem(MPI_Aint size, MPI_Info info, void *baseptr) > > OPAL_CR_ENTER_LIBRARY(); > > +#if 0 > if (MPI_INFO_NULL != info) { > int flag; > (void) ompi_info_get (info, "mpool_hints", MPI_MAX_INFO_VAL, info_value, &f > @@ -84,6 +85,9 @@ int MPI_Alloc_mem(MPI_Aint size, MPI_Info info, void *baseptr) > > *((void **) baseptr) = mca_mpool_base_alloc ((size_t) size, (struct opal_info_t > mpool_hints); > +#else > + *((void **) baseptr) = malloc(size); > +#endif > OPAL_CR_EXIT_LIBRARY(); > if (NULL == *((void **) baseptr)) { > return OMPI_ERRHANDLER_INVOKE(MPI_COMM_WORLD, MPI_ERR_NO_MEM, > diff --git a/ompi/mpi/c/free_mem.c b/ompi/mpi/c/free_mem.c > index 4498fc8bb1..4c65ea2339 100644 > --- a/ompi/mpi/c/free_mem.c > +++ b/ompi/mpi/c/free_mem.c > @@ -50,10 +50,16 @@ int MPI_Free_mem(void *baseptr) > > If you call MPI_ALLOC_MEM with a size of 0, you get NULL > back. So don't consider a NULL==baseptr an error. */ > +#if 0 > if (NULL != baseptr && OMPI_SUCCESS != mca_mpool_base_free(baseptr)) { > OPAL_CR_EXIT_LIBRARY(); > return OMPI_ERRHANDLER_INVOKE(MPI_COMM_WORLD, MPI_ERR_NO_MEM, FUNC_NAME); > } > +#else > + if (NULL != baseptr) { > + free(baseptr); > + } > +#endif > > OPAL_CR_EXIT_LIBRARY(); > return MPI_SUCCESS; > ``` > > This will at least tell us if the innards of our ALLOC_MEM/FREE_MEM (i.e., likely the registration/deregistration) are causing the issue. > > > > > > On Mar 15, 2017, at 1:27 PM, Dave Love <dave.l...@manchester.ac.uk> wrote: > > > > Paul Kapinos <kapi...@itc.rwth-aachen.de> writes: > > > >> Nathan, > >> unfortunately '--mca memory_linux_disable 1' does not help on this > >> issue - it does not change the behaviour at all. > >> Note that the pathological behaviour is present in Open MPI 2.0.2 as > >> well as in /1.10.x, and Intel OmniPath (OPA) network-capable nodes are > >> affected only. > > > > [I guess that should have been "too" rather than "only". It's loading > > the openib btl that is the problem.] > > > >> The known workaround is to disable InfiniBand failback by '--mca btl > >> ^tcp,openib' on nodes with OPA network. (On IB nodes, the same tweak > >> lead to 5% performance improvement on single-node jobs; > > > > It was a lot more than that in my cp2k test. > > > >> but obviously > >> disabling IB on nodes connected via IB is not a solution for > >> multi-node jobs, huh). > > > > But it works OK with libfabric (ofi mtl). Is there a problem with > > libfabric? > > > > Has anyone reported this issue to the cp2k people? I know it's not > > their problem, but I assume they'd like to know for users' sake, > > particularly if it's not going to be addressed. I wonder what else > > might be affected. > > _______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > -- > Jeff Squyres > jsquy...@cisco.com > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users -- Jeff Hammond jeff.scie...@gmail.com http://jeffhammond.github.io/
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users