It looks like there were 3 separate threads on this CP2K issue, but I think we
developers got sidetracked because there was a bunch of talk in the other
threads about PSM, non-IB(verbs) networks, etc.
So: the real issue is an app is experiencing a lot of slowdown when calling
MPI_ALLOC_MEM/MPI_FREE_MEM when the openib BTL is involved.
The MPI_*_MEM calls are "slow" when used with the openib BTL because we're
registering the memory every time you call MPI_ALLOC_MEM and deregistering the
memory every time you call MPI_FREE_MEM. This was intended as an optimization
such that the memory is already registered when you invoke an MPI
communications function with that buffer. I guess we didn't really anticipate
the case where *every* allocation goes through ALLOC_MEM...
Meaning: if the app is aggressive in using MPI_*_MEM *everywhere* -- even for
buffers that aren't used for MPI communication -- I guess you could end up with
lots of useless registration/deregistration. If the app does it a lot, that
could be the source of quite a lot of needless overhead.
We don't have a run-time bypass of this behavior (i.e., we assumed that if
you're calling MPI_*_MEM, you mean to do so). But let's try an experiment --
can you try applying this patch and see if it removes the slowness? This patch
basically removes the registration / deregistration with ALLOC/FREE_MEM (and
instead handles it lazily / upon demand when buffers are passed to MPI
functions):
```patch
diff --git a/ompi/mpi/c/alloc_mem.c b/ompi/mpi/c/alloc_mem.c
index 8c8fb8cd54..c62c8ff706 100644
--- a/ompi/mpi/c/alloc_mem.c
+++ b/ompi/mpi/c/alloc_mem.c
@@ -74,6 +74,7 @@ int MPI_Alloc_mem(MPI_Aint size, MPI_Info info, void *baseptr)
OPAL_CR_ENTER_LIBRARY();
+#if 0
if (MPI_INFO_NULL != info) {
int flag;
(void) ompi_info_get (info, "mpool_hints", MPI_MAX_INFO_VAL,
info_value, &f
@@ -84,6 +85,9 @@ int MPI_Alloc_mem(MPI_Aint size, MPI_Info info, void *baseptr)
*((void **) baseptr) = mca_mpool_base_alloc ((size_t) size, (struct
opal_info_t
mpool_hints);
+#else
+ *((void **) baseptr) = malloc(size);
+#endif
OPAL_CR_EXIT_LIBRARY();
if (NULL == *((void **) baseptr)) {
return OMPI_ERRHANDLER_INVOKE(MPI_COMM_WORLD, MPI_ERR_NO_MEM,
diff --git a/ompi/mpi/c/free_mem.c b/ompi/mpi/c/free_mem.c
index 4498fc8bb1..4c65ea2339 100644
--- a/ompi/mpi/c/free_mem.c
+++ b/ompi/mpi/c/free_mem.c
@@ -50,10 +50,16 @@ int MPI_Free_mem(void *baseptr)
If you call MPI_ALLOC_MEM with a size of 0, you get NULL
back. So don't consider a NULL==baseptr an error. */
+#if 0
if (NULL != baseptr && OMPI_SUCCESS != mca_mpool_base_free(baseptr)) {
OPAL_CR_EXIT_LIBRARY();
return OMPI_ERRHANDLER_INVOKE(MPI_COMM_WORLD, MPI_ERR_NO_MEM,
FUNC_NAME);
}
+#else
+ if (NULL != baseptr) {
+ free(baseptr);
+ }
+#endif
OPAL_CR_EXIT_LIBRARY();
return MPI_SUCCESS;
```
This will at least tell us if the innards of our ALLOC_MEM/FREE_MEM (i.e.,
likely the registration/deregistration) are causing the issue.
> On Mar 15, 2017, at 1:27 PM, Dave Love <[email protected]> wrote:
>
> Paul Kapinos <[email protected]> writes:
>
>> Nathan,
>> unfortunately '--mca memory_linux_disable 1' does not help on this
>> issue - it does not change the behaviour at all.
>> Note that the pathological behaviour is present in Open MPI 2.0.2 as
>> well as in /1.10.x, and Intel OmniPath (OPA) network-capable nodes are
>> affected only.
>
> [I guess that should have been "too" rather than "only". It's loading
> the openib btl that is the problem.]
>
>> The known workaround is to disable InfiniBand failback by '--mca btl
>> ^tcp,openib' on nodes with OPA network. (On IB nodes, the same tweak
>> lead to 5% performance improvement on single-node jobs;
>
> It was a lot more than that in my cp2k test.
>
>> but obviously
>> disabling IB on nodes connected via IB is not a solution for
>> multi-node jobs, huh).
>
> But it works OK with libfabric (ofi mtl). Is there a problem with
> libfabric?
>
> Has anyone reported this issue to the cp2k people? I know it's not
> their problem, but I assume they'd like to know for users' sake,
> particularly if it's not going to be addressed. I wonder what else
> might be affected.
> _______________________________________________
> users mailing list
> [email protected]
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Jeff Squyres
[email protected]
_______________________________________________
users mailing list
[email protected]
https://rfd.newmexicoconsortium.org/mailman/listinfo/users