Ok. I talked with Nathan about this a bit. Here's what we think we should do:
1. Add an MCA param to disable (de)registration as part of ALLOC/FREE_MEM. Because that's just the Open MPI way (moar MCA paramz!). 2. If memory hooks are enabled, default to not (de)registering as part of ALLOC/FREE_MEM. I.e., the lazy method seems to be working much better for this scenario already. If memory hooks are not enabled, then we'll do the (de)registration as part of ALLOC/FREE_MEM. Paul/etc.: can you run with the CALLGRAPH option that Alfio mentioned (https://www.mail-archive.com/users@lists.open-mpi.org/msg30785.html)? From what Alfio described, it sounds like CP3K is trying to minimize the calls to ALLOC/FREE_MEM, but somehow they are clearly still getting invoked a lot. It would be good to understand how/why. I.e.: is the bug that OMPI's ALLOC/FREE_MEM is slow (which -- to be honest -- is somewhat expected), or is there a bug in CP3K such that it is calling ALLOC/FREE_MEM more than it should? Hristo: you mentioned "70%" of the run time was spent in ALLOC/FREE_MEM. How long was the run? > On Mar 16, 2017, at 2:22 PM, Paul Kapinos <kapi...@itc.rwth-aachen.de> wrote: > > Jeff, I confirm: your patch did it. > > (tried on 1.10.6 - do not even need to rebuild the cp2k.popt , just load > another Open MPI version compiled with Jeff'path) > > ( On Intel OmpiPath the same speed as with --mca btl ^tcp,openib ) > > > On 03/16/17 01:03, Jeff Squyres (jsquyres) wrote: >> It looks like there were 3 separate threads on this CP2K issue, but I think >> we developers got sidetracked because there was a bunch of talk in the other >> threads about PSM, non-IB(verbs) networks, etc. >> >> So: the real issue is an app is experiencing a lot of slowdown when calling >> MPI_ALLOC_MEM/MPI_FREE_MEM when the openib BTL is involved. >> >> The MPI_*_MEM calls are "slow" when used with the openib BTL because we're >> registering the memory every time you call MPI_ALLOC_MEM and deregistering >> the memory every time you call MPI_FREE_MEM. This was intended as an >> optimization such that the memory is already registered when you invoke an >> MPI communications function with that buffer. I guess we didn't really >> anticipate the case where *every* allocation goes through ALLOC_MEM... >> >> Meaning: if the app is aggressive in using MPI_*_MEM *everywhere* -- even >> for buffers that aren't used for MPI communication -- I guess you could end >> up with lots of useless registration/deregistration. If the app does it a >> lot, that could be the source of quite a lot of needless overhead. >> >> We don't have a run-time bypass of this behavior (i.e., we assumed that if >> you're calling MPI_*_MEM, you mean to do so). But let's try an experiment >> -- can you try applying this patch and see if it removes the slowness? This >> patch basically removes the registration / deregistration with >> ALLOC/FREE_MEM (and instead handles it lazily / upon demand when buffers are >> passed to MPI functions): >> >> ```patch >> diff --git a/ompi/mpi/c/alloc_mem.c b/ompi/mpi/c/alloc_mem.c >> index 8c8fb8cd54..c62c8ff706 100644 >> --- a/ompi/mpi/c/alloc_mem.c >> +++ b/ompi/mpi/c/alloc_mem.c >> @@ -74,6 +74,7 @@ int MPI_Alloc_mem(MPI_Aint size, MPI_Info info, void >> *baseptr) >> >> OPAL_CR_ENTER_LIBRARY(); >> >> +#if 0 >> if (MPI_INFO_NULL != info) { >> int flag; >> (void) ompi_info_get (info, "mpool_hints", MPI_MAX_INFO_VAL, >> info_value, &f >> @@ -84,6 +85,9 @@ int MPI_Alloc_mem(MPI_Aint size, MPI_Info info, void >> *baseptr) >> >> *((void **) baseptr) = mca_mpool_base_alloc ((size_t) size, (struct >> opal_info_t >> mpool_hints); >> +#else >> + *((void **) baseptr) = malloc(size); >> +#endif >> OPAL_CR_EXIT_LIBRARY(); >> if (NULL == *((void **) baseptr)) { >> return OMPI_ERRHANDLER_INVOKE(MPI_COMM_WORLD, MPI_ERR_NO_MEM, >> diff --git a/ompi/mpi/c/free_mem.c b/ompi/mpi/c/free_mem.c >> index 4498fc8bb1..4c65ea2339 100644 >> --- a/ompi/mpi/c/free_mem.c >> +++ b/ompi/mpi/c/free_mem.c >> @@ -50,10 +50,16 @@ int MPI_Free_mem(void *baseptr) >> >> If you call MPI_ALLOC_MEM with a size of 0, you get NULL >> back. So don't consider a NULL==baseptr an error. */ >> +#if 0 >> if (NULL != baseptr && OMPI_SUCCESS != mca_mpool_base_free(baseptr)) { >> OPAL_CR_EXIT_LIBRARY(); >> return OMPI_ERRHANDLER_INVOKE(MPI_COMM_WORLD, MPI_ERR_NO_MEM, >> FUNC_NAME); >> } >> +#else >> + if (NULL != baseptr) { >> + free(baseptr); >> + } >> +#endif >> >> OPAL_CR_EXIT_LIBRARY(); >> return MPI_SUCCESS; >> ``` >> >> This will at least tell us if the innards of our ALLOC_MEM/FREE_MEM (i.e., >> likely the registration/deregistration) are causing the issue. >> >> >> >> >>> On Mar 15, 2017, at 1:27 PM, Dave Love <dave.l...@manchester.ac.uk> wrote: >>> >>> Paul Kapinos <kapi...@itc.rwth-aachen.de> writes: >>> >>>> Nathan, >>>> unfortunately '--mca memory_linux_disable 1' does not help on this >>>> issue - it does not change the behaviour at all. >>>> Note that the pathological behaviour is present in Open MPI 2.0.2 as >>>> well as in /1.10.x, and Intel OmniPath (OPA) network-capable nodes are >>>> affected only. >>> >>> [I guess that should have been "too" rather than "only". It's loading >>> the openib btl that is the problem.] >>> >>>> The known workaround is to disable InfiniBand failback by '--mca btl >>>> ^tcp,openib' on nodes with OPA network. (On IB nodes, the same tweak >>>> lead to 5% performance improvement on single-node jobs; >>> >>> It was a lot more than that in my cp2k test. >>> >>>> but obviously >>>> disabling IB on nodes connected via IB is not a solution for >>>> multi-node jobs, huh). >>> >>> But it works OK with libfabric (ofi mtl). Is there a problem with >>> libfabric? >>> >>> Has anyone reported this issue to the cp2k people? I know it's not >>> their problem, but I assume they'd like to know for users' sake, >>> particularly if it's not going to be addressed. I wonder what else >>> might be affected. >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> >> > > > -- > Dipl.-Inform. Paul Kapinos - High Performance Computing, > RWTH Aachen University, IT Center > Seffenter Weg 23, D 52074 Aachen (Germany) > Tel: +49 241/80-24915 > -- Jeff Squyres jsquy...@cisco.com _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users