Ok.  I talked with Nathan about this a bit.  Here's what we think we should do:

1. Add an MCA param to disable (de)registration as part of ALLOC/FREE_MEM.  
Because that's just the Open MPI way (moar MCA paramz!).

2. If memory hooks are enabled, default to not (de)registering as part of 
ALLOC/FREE_MEM.  I.e., the lazy method seems to be working much better for this 
scenario already.  If memory hooks are not enabled, then we'll do the 
(de)registration as part of ALLOC/FREE_MEM.

Paul/etc.: can you run with the CALLGRAPH option that Alfio mentioned 
(https://www.mail-archive.com/users@lists.open-mpi.org/msg30785.html)?  From 
what Alfio described, it sounds like CP3K is trying to minimize the calls to 
ALLOC/FREE_MEM, but somehow they are clearly still getting invoked a lot.  It 
would be good to understand how/why.  I.e.: is the bug that OMPI's 
ALLOC/FREE_MEM is slow (which -- to be honest -- is somewhat expected), or is 
there a bug in CP3K such that it is calling ALLOC/FREE_MEM more than it should?

Hristo: you mentioned "70%" of the run time was spent in ALLOC/FREE_MEM.  How 
long was the run?




> On Mar 16, 2017, at 2:22 PM, Paul Kapinos <kapi...@itc.rwth-aachen.de> wrote:
> 
> Jeff, I confirm: your patch did it.
> 
> (tried on 1.10.6 - do not even need to rebuild the cp2k.popt , just load 
> another Open MPI version compiled with Jeff'path)
> 
> ( On Intel OmpiPath the same speed as with --mca btl ^tcp,openib )
> 
> 
> On 03/16/17 01:03, Jeff Squyres (jsquyres) wrote:
>> It looks like there were 3 separate threads on this CP2K issue, but I think 
>> we developers got sidetracked because there was a bunch of talk in the other 
>> threads about PSM, non-IB(verbs) networks, etc.
>> 
>> So: the real issue is an app is experiencing a lot of slowdown when calling 
>> MPI_ALLOC_MEM/MPI_FREE_MEM when the openib BTL is involved.
>> 
>> The MPI_*_MEM calls are "slow" when used with the openib BTL because we're 
>> registering the memory every time you call MPI_ALLOC_MEM and deregistering 
>> the memory every time you call MPI_FREE_MEM.  This was intended as an 
>> optimization such that the memory is already registered when you invoke an 
>> MPI communications function with that buffer.  I guess we didn't really 
>> anticipate the case where *every* allocation goes through ALLOC_MEM...
>> 
>> Meaning: if the app is aggressive in using MPI_*_MEM *everywhere* -- even 
>> for buffers that aren't used for MPI communication -- I guess you could end 
>> up with lots of useless registration/deregistration.  If the app does it a 
>> lot, that could be the source of quite a lot of needless overhead.
>> 
>> We don't have a run-time bypass of this behavior (i.e., we assumed that if 
>> you're calling MPI_*_MEM, you mean to do so).  But let's try an experiment 
>> -- can you try applying this patch and see if it removes the slowness?  This 
>> patch basically removes the registration / deregistration with 
>> ALLOC/FREE_MEM (and instead handles it lazily / upon demand when buffers are 
>> passed to MPI functions):
>> 
>> ```patch
>> diff --git a/ompi/mpi/c/alloc_mem.c b/ompi/mpi/c/alloc_mem.c
>> index 8c8fb8cd54..c62c8ff706 100644
>> --- a/ompi/mpi/c/alloc_mem.c
>> +++ b/ompi/mpi/c/alloc_mem.c
>> @@ -74,6 +74,7 @@ int MPI_Alloc_mem(MPI_Aint size, MPI_Info info, void 
>> *baseptr)
>> 
>>     OPAL_CR_ENTER_LIBRARY();
>> 
>> +#if 0
>>     if (MPI_INFO_NULL != info) {
>>         int flag;
>>         (void) ompi_info_get (info, "mpool_hints", MPI_MAX_INFO_VAL, 
>> info_value, &f
>> @@ -84,6 +85,9 @@ int MPI_Alloc_mem(MPI_Aint size, MPI_Info info, void 
>> *baseptr)
>> 
>>     *((void **) baseptr) = mca_mpool_base_alloc ((size_t) size, (struct 
>> opal_info_t
>>                                                  mpool_hints);
>> +#else
>> +    *((void **) baseptr) = malloc(size);
>> +#endif
>>     OPAL_CR_EXIT_LIBRARY();
>>     if (NULL == *((void **) baseptr)) {
>>         return OMPI_ERRHANDLER_INVOKE(MPI_COMM_WORLD, MPI_ERR_NO_MEM,
>> diff --git a/ompi/mpi/c/free_mem.c b/ompi/mpi/c/free_mem.c
>> index 4498fc8bb1..4c65ea2339 100644
>> --- a/ompi/mpi/c/free_mem.c
>> +++ b/ompi/mpi/c/free_mem.c
>> @@ -50,10 +50,16 @@ int MPI_Free_mem(void *baseptr)
>> 
>>        If you call MPI_ALLOC_MEM with a size of 0, you get NULL
>>        back.  So don't consider a NULL==baseptr an error. */
>> +#if 0
>>     if (NULL != baseptr && OMPI_SUCCESS != mca_mpool_base_free(baseptr)) {
>>         OPAL_CR_EXIT_LIBRARY();
>>         return OMPI_ERRHANDLER_INVOKE(MPI_COMM_WORLD, MPI_ERR_NO_MEM, 
>> FUNC_NAME);
>>     }
>> +#else
>> +    if (NULL != baseptr) {
>> +        free(baseptr);
>> +    }
>> +#endif
>> 
>>     OPAL_CR_EXIT_LIBRARY();
>>     return MPI_SUCCESS;
>> ```
>> 
>> This will at least tell us if the innards of our ALLOC_MEM/FREE_MEM (i.e., 
>> likely the registration/deregistration) are causing the issue.
>> 
>> 
>> 
>> 
>>> On Mar 15, 2017, at 1:27 PM, Dave Love <dave.l...@manchester.ac.uk> wrote:
>>> 
>>> Paul Kapinos <kapi...@itc.rwth-aachen.de> writes:
>>> 
>>>> Nathan,
>>>> unfortunately '--mca memory_linux_disable 1' does not help on this
>>>> issue - it does not change the behaviour at all.
>>>> Note that the pathological behaviour is present in Open MPI 2.0.2 as
>>>> well as in /1.10.x, and Intel OmniPath (OPA) network-capable nodes are
>>>> affected only.
>>> 
>>> [I guess that should have been "too" rather than "only".  It's loading
>>> the openib btl that is the problem.]
>>> 
>>>> The known workaround is to disable InfiniBand failback by '--mca btl
>>>> ^tcp,openib' on nodes with OPA network. (On IB nodes, the same tweak
>>>> lead to 5% performance improvement on single-node jobs;
>>> 
>>> It was a lot more than that in my cp2k test.
>>> 
>>>> but obviously
>>>> disabling IB on nodes connected via IB is not a solution for
>>>> multi-node jobs, huh).
>>> 
>>> But it works OK with libfabric (ofi mtl).  Is there a problem with
>>> libfabric?
>>> 
>>> Has anyone reported this issue to the cp2k people?  I know it's not
>>> their problem, but I assume they'd like to know for users' sake,
>>> particularly if it's not going to be addressed.  I wonder what else
>>> might be affected.
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> 
>> 
> 
> 
> -- 
> Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
> RWTH Aachen University, IT Center
> Seffenter Weg 23,  D 52074  Aachen (Germany)
> Tel: +49 241/80-24915
> 


-- 
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to