Just to follow up…

This turned out to be a bug in OpenMPI+UCX.

   https://github.com/openucx/ucx/issues/2921 
<https://github.com/openucx/ucx/issues/2921>
   https://github.com/open-mpi/ompi/pull/5878 
<https://github.com/open-mpi/ompi/pull/5878>

I cherry-picked the patch from the github master and applied it to 3.1.2.  The 
gadget/gizmo
test case has been running since yesterday without the previously observed 
growth in RSS.

Thanks to Yossi Itigin (yos...@mellanox.com <mailto:yos...@mellanox.com>) for 
the fix.

Charlie Taylor
UF Research Computing

> On Oct 4, 2018, at 5:39 PM, Charles A Taylor <chas...@ufl.edu> wrote:
> 
> 
> We are seeing a gaping memory leak when running OpenMPI 3.1.x (or 2.1.2, for 
> that matter) built with UCX support.   The leak shows up
> whether the “ucx” PML is specified for the run or not.  The applications in 
> question are arepo and gizmo but it I have no reason to believe
> that others are not affected as well.
> 
> Basically the MPI processes grow without bound until SLURM kills the job or 
> the host memory is exhausted.  
> If I configure and build with “--without-ucx” the problem goes away.
> 
> I didn’t see anything about this on the UCX github site so I thought I’d ask 
> here.  Anyone else seeing the same or similar?
> 
> What version of UCX is OpenMPI 3.1.x tested against?
> 
> Regards,
> 
> Charlie Taylor
> UF Research Computing
> 
> Details:
> —————————————
> RHEL7.5
> OpenMPI 3.1.2 (and any other version I’ve tried).
> ucx 1.2.2-1.el7 (RH native)
> RH native IB stack
> Mellanox FDR/EDR IB fabric
> Intel Parallel Studio 2018.1.163
> 
> Configuration Options:
> —————————————————
> CFG_OPTS=""
> CFG_OPTS="$CFG_OPTS C=icc CXX=icpc FC=ifort FFLAGS=\"-O2 -g -warn -m64\" 
> LDFLAGS=\"\" "
> CFG_OPTS="$CFG_OPTS --enable-static"
> CFG_OPTS="$CFG_OPTS --enable-orterun-prefix-by-default"
> CFG_OPTS="$CFG_OPTS --with-slurm=/opt/slurm"
> CFG_OPTS="$CFG_OPTS --with-pmix=/opt/pmix/2.1.1"
> CFG_OPTS="$CFG_OPTS --with-pmi=/opt/slurm"
> CFG_OPTS="$CFG_OPTS --with-libevent=external"
> CFG_OPTS="$CFG_OPTS --with-hwloc=external"
> CFG_OPTS="$CFG_OPTS --with-verbs=/usr"
> CFG_OPTS="$CFG_OPTS --with-libfabric=/usr"
> CFG_OPTS="$CFG_OPTS --with-ucx=/usr"
> CFG_OPTS="$CFG_OPTS --with-verbs-libdir=/usr/lib64"
> CFG_OPTS="$CFG_OPTS --with-mxm=no"
> CFG_OPTS="$CFG_OPTS --with-cuda=${HPC_CUDA_DIR}"
> CFG_OPTS="$CFG_OPTS --enable-openib-udcm"
> CFG_OPTS="$CFG_OPTS --enable-openib-rdmacm"
> CFG_OPTS="$CFG_OPTS --disable-pmix-dstore"
> 
> rpmbuild --ba \
>         --define '_name openmpi' \
>         --define "_version $OMPI_VER" \
>         --define "_release ${RELEASE}" \
>         --define "_prefix $PREFIX" \
>         --define '_mandir %{_prefix}/share/man' \
>         --define '_defaultdocdir %{_prefix}' \
>         --define 'mflags -j 8' \
>         --define 'use_default_rpm_opt_flags 1' \
>         --define 'use_check_files 0' \
>         --define 'install_shell_scripts 1' \
>         --define 'shell_scripts_basename mpivars' \
>         --define "configure_options $CFG_OPTS " \
>         openmpi-${OMPI_VER}.spec 2>&1 | tee rpmbuild.log
> 
> 
> 
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=_TUHqBC2-jZfYbwP18yLYDuU3Rq68N8-nk-rnxsiDGo&s=QjG-szMi1wbDc0DX3andcwIsZNDMsVZErnCirrAYnlE&e=

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to