I couldn't find the docs on mpool_hints, but shouldn't there be a way to
disable registration via MPI_Info rather than patching the source?

Jeff

PS Jeff Squyres: ;-) ;-) ;-)

On Wed, Mar 15, 2017 at 5:03 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
wrote:
>
> It looks like there were 3 separate threads on this CP2K issue, but I
think we developers got sidetracked because there was a bunch of talk in
the other threads about PSM, non-IB(verbs) networks, etc.
>
> So: the real issue is an app is experiencing a lot of slowdown when
calling MPI_ALLOC_MEM/MPI_FREE_MEM when the openib BTL is involved.
>
> The MPI_*_MEM calls are "slow" when used with the openib BTL because
we're registering the memory every time you call MPI_ALLOC_MEM and
deregistering the memory every time you call MPI_FREE_MEM.  This was
intended as an optimization such that the memory is already registered when
you invoke an MPI communications function with that buffer.  I guess we
didn't really anticipate the case where *every* allocation goes through
ALLOC_MEM...
>
> Meaning: if the app is aggressive in using MPI_*_MEM *everywhere* -- even
for buffers that aren't used for MPI communication -- I guess you could end
up with lots of useless registration/deregistration.  If the app does it a
lot, that could be the source of quite a lot of needless overhead.
>
> We don't have a run-time bypass of this behavior (i.e., we assumed that
if you're calling MPI_*_MEM, you mean to do so).  But let's try an
experiment -- can you try applying this patch and see if it removes the
slowness?  This patch basically removes the registration / deregistration
with ALLOC/FREE_MEM (and instead handles it lazily / upon demand when
buffers are passed to MPI functions):
>
> ```patch
> diff --git a/ompi/mpi/c/alloc_mem.c b/ompi/mpi/c/alloc_mem.c
> index 8c8fb8cd54..c62c8ff706 100644
> --- a/ompi/mpi/c/alloc_mem.c
> +++ b/ompi/mpi/c/alloc_mem.c
> @@ -74,6 +74,7 @@ int MPI_Alloc_mem(MPI_Aint size, MPI_Info info, void
*baseptr)
>
>      OPAL_CR_ENTER_LIBRARY();
>
> +#if 0
>      if (MPI_INFO_NULL != info) {
>          int flag;
>          (void) ompi_info_get (info, "mpool_hints", MPI_MAX_INFO_VAL,
info_value, &f
> @@ -84,6 +85,9 @@ int MPI_Alloc_mem(MPI_Aint size, MPI_Info info, void
*baseptr)
>
>      *((void **) baseptr) = mca_mpool_base_alloc ((size_t) size, (struct
opal_info_t
>                                                   mpool_hints);
> +#else
> +    *((void **) baseptr) = malloc(size);
> +#endif
>      OPAL_CR_EXIT_LIBRARY();
>      if (NULL == *((void **) baseptr)) {
>          return OMPI_ERRHANDLER_INVOKE(MPI_COMM_WORLD, MPI_ERR_NO_MEM,
> diff --git a/ompi/mpi/c/free_mem.c b/ompi/mpi/c/free_mem.c
> index 4498fc8bb1..4c65ea2339 100644
> --- a/ompi/mpi/c/free_mem.c
> +++ b/ompi/mpi/c/free_mem.c
> @@ -50,10 +50,16 @@ int MPI_Free_mem(void *baseptr)
>
>         If you call MPI_ALLOC_MEM with a size of 0, you get NULL
>         back.  So don't consider a NULL==baseptr an error. */
> +#if 0
>      if (NULL != baseptr && OMPI_SUCCESS != mca_mpool_base_free(baseptr))
{
>          OPAL_CR_EXIT_LIBRARY();
>          return OMPI_ERRHANDLER_INVOKE(MPI_COMM_WORLD, MPI_ERR_NO_MEM,
FUNC_NAME);
>      }
> +#else
> +    if (NULL != baseptr) {
> +        free(baseptr);
> +    }
> +#endif
>
>      OPAL_CR_EXIT_LIBRARY();
>      return MPI_SUCCESS;
> ```
>
> This will at least tell us if the innards of our ALLOC_MEM/FREE_MEM
(i.e., likely the registration/deregistration) are causing the issue.
>
>
>
>
> > On Mar 15, 2017, at 1:27 PM, Dave Love <dave.l...@manchester.ac.uk>
wrote:
> >
> > Paul Kapinos <kapi...@itc.rwth-aachen.de> writes:
> >
> >> Nathan,
> >> unfortunately '--mca memory_linux_disable 1' does not help on this
> >> issue - it does not change the behaviour at all.
> >> Note that the pathological behaviour is present in Open MPI 2.0.2 as
> >> well as in /1.10.x, and Intel OmniPath (OPA) network-capable nodes are
> >> affected only.
> >
> > [I guess that should have been "too" rather than "only".  It's loading
> > the openib btl that is the problem.]
> >
> >> The known workaround is to disable InfiniBand failback by '--mca btl
> >> ^tcp,openib' on nodes with OPA network. (On IB nodes, the same tweak
> >> lead to 5% performance improvement on single-node jobs;
> >
> > It was a lot more than that in my cp2k test.
> >
> >> but obviously
> >> disabling IB on nodes connected via IB is not a solution for
> >> multi-node jobs, huh).
> >
> > But it works OK with libfabric (ofi mtl).  Is there a problem with
> > libfabric?
> >
> > Has anyone reported this issue to the cp2k people?  I know it's not
> > their problem, but I assume they'd like to know for users' sake,
> > particularly if it's not going to be addressed.  I wonder what else
> > might be affected.
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users




--
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to