Jeff, I confirm: your patch did it.

(tried on 1.10.6 - do not even need to rebuild the cp2k.popt , just load another Open MPI version compiled with Jeff'path)

( On Intel OmpiPath the same speed as with --mca btl ^tcp,openib )


On 03/16/17 01:03, Jeff Squyres (jsquyres) wrote:
It looks like there were 3 separate threads on this CP2K issue, but I think we 
developers got sidetracked because there was a bunch of talk in the other 
threads about PSM, non-IB(verbs) networks, etc.

So: the real issue is an app is experiencing a lot of slowdown when calling 
MPI_ALLOC_MEM/MPI_FREE_MEM when the openib BTL is involved.

The MPI_*_MEM calls are "slow" when used with the openib BTL because we're 
registering the memory every time you call MPI_ALLOC_MEM and deregistering the memory 
every time you call MPI_FREE_MEM.  This was intended as an optimization such that the 
memory is already registered when you invoke an MPI communications function with that 
buffer.  I guess we didn't really anticipate the case where *every* allocation goes 
through ALLOC_MEM...

Meaning: if the app is aggressive in using MPI_*_MEM *everywhere* -- even for 
buffers that aren't used for MPI communication -- I guess you could end up with 
lots of useless registration/deregistration.  If the app does it a lot, that 
could be the source of quite a lot of needless overhead.

We don't have a run-time bypass of this behavior (i.e., we assumed that if 
you're calling MPI_*_MEM, you mean to do so).  But let's try an experiment -- 
can you try applying this patch and see if it removes the slowness?  This patch 
basically removes the registration / deregistration with ALLOC/FREE_MEM (and 
instead handles it lazily / upon demand when buffers are passed to MPI 
functions):

```patch
diff --git a/ompi/mpi/c/alloc_mem.c b/ompi/mpi/c/alloc_mem.c
index 8c8fb8cd54..c62c8ff706 100644
--- a/ompi/mpi/c/alloc_mem.c
+++ b/ompi/mpi/c/alloc_mem.c
@@ -74,6 +74,7 @@ int MPI_Alloc_mem(MPI_Aint size, MPI_Info info, void *baseptr)

     OPAL_CR_ENTER_LIBRARY();

+#if 0
     if (MPI_INFO_NULL != info) {
         int flag;
         (void) ompi_info_get (info, "mpool_hints", MPI_MAX_INFO_VAL, info_value, 
&f
@@ -84,6 +85,9 @@ int MPI_Alloc_mem(MPI_Aint size, MPI_Info info, void *baseptr)

     *((void **) baseptr) = mca_mpool_base_alloc ((size_t) size, (struct 
opal_info_t
                                                  mpool_hints);
+#else
+    *((void **) baseptr) = malloc(size);
+#endif
     OPAL_CR_EXIT_LIBRARY();
     if (NULL == *((void **) baseptr)) {
         return OMPI_ERRHANDLER_INVOKE(MPI_COMM_WORLD, MPI_ERR_NO_MEM,
diff --git a/ompi/mpi/c/free_mem.c b/ompi/mpi/c/free_mem.c
index 4498fc8bb1..4c65ea2339 100644
--- a/ompi/mpi/c/free_mem.c
+++ b/ompi/mpi/c/free_mem.c
@@ -50,10 +50,16 @@ int MPI_Free_mem(void *baseptr)

        If you call MPI_ALLOC_MEM with a size of 0, you get NULL
        back.  So don't consider a NULL==baseptr an error. */
+#if 0
     if (NULL != baseptr && OMPI_SUCCESS != mca_mpool_base_free(baseptr)) {
         OPAL_CR_EXIT_LIBRARY();
         return OMPI_ERRHANDLER_INVOKE(MPI_COMM_WORLD, MPI_ERR_NO_MEM, 
FUNC_NAME);
     }
+#else
+    if (NULL != baseptr) {
+        free(baseptr);
+    }
+#endif

     OPAL_CR_EXIT_LIBRARY();
     return MPI_SUCCESS;
```

This will at least tell us if the innards of our ALLOC_MEM/FREE_MEM (i.e., 
likely the registration/deregistration) are causing the issue.




On Mar 15, 2017, at 1:27 PM, Dave Love <dave.l...@manchester.ac.uk> wrote:

Paul Kapinos <kapi...@itc.rwth-aachen.de> writes:

Nathan,
unfortunately '--mca memory_linux_disable 1' does not help on this
issue - it does not change the behaviour at all.
Note that the pathological behaviour is present in Open MPI 2.0.2 as
well as in /1.10.x, and Intel OmniPath (OPA) network-capable nodes are
affected only.

[I guess that should have been "too" rather than "only".  It's loading
the openib btl that is the problem.]

The known workaround is to disable InfiniBand failback by '--mca btl
^tcp,openib' on nodes with OPA network. (On IB nodes, the same tweak
lead to 5% performance improvement on single-node jobs;

It was a lot more than that in my cp2k test.

but obviously
disabling IB on nodes connected via IB is not a solution for
multi-node jobs, huh).

But it works OK with libfabric (ofi mtl).  Is there a problem with
libfabric?

Has anyone reported this issue to the cp2k people?  I know it's not
their problem, but I assume they'd like to know for users' sake,
particularly if it's not going to be addressed.  I wonder what else
might be affected.
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users




--
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, IT Center
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to