Just to follow up… This turned out to be a bug in OpenMPI+UCX.
https://github.com/openucx/ucx/issues/2921 <https://github.com/openucx/ucx/issues/2921> https://github.com/open-mpi/ompi/pull/5878 <https://github.com/open-mpi/ompi/pull/5878> I cherry-picked the patch from the github master and applied it to 3.1.2. The gadget/gizmo test case has been running since yesterday without the previously observed growth in RSS. Thanks to Yossi Itigin (yos...@mellanox.com <mailto:yos...@mellanox.com>) for the fix. Charlie Taylor UF Research Computing > On Oct 4, 2018, at 5:39 PM, Charles A Taylor <chas...@ufl.edu> wrote: > > > We are seeing a gaping memory leak when running OpenMPI 3.1.x (or 2.1.2, for > that matter) built with UCX support. The leak shows up > whether the “ucx” PML is specified for the run or not. The applications in > question are arepo and gizmo but it I have no reason to believe > that others are not affected as well. > > Basically the MPI processes grow without bound until SLURM kills the job or > the host memory is exhausted. > If I configure and build with “--without-ucx” the problem goes away. > > I didn’t see anything about this on the UCX github site so I thought I’d ask > here. Anyone else seeing the same or similar? > > What version of UCX is OpenMPI 3.1.x tested against? > > Regards, > > Charlie Taylor > UF Research Computing > > Details: > ————————————— > RHEL7.5 > OpenMPI 3.1.2 (and any other version I’ve tried). > ucx 1.2.2-1.el7 (RH native) > RH native IB stack > Mellanox FDR/EDR IB fabric > Intel Parallel Studio 2018.1.163 > > Configuration Options: > ————————————————— > CFG_OPTS="" > CFG_OPTS="$CFG_OPTS C=icc CXX=icpc FC=ifort FFLAGS=\"-O2 -g -warn -m64\" > LDFLAGS=\"\" " > CFG_OPTS="$CFG_OPTS --enable-static" > CFG_OPTS="$CFG_OPTS --enable-orterun-prefix-by-default" > CFG_OPTS="$CFG_OPTS --with-slurm=/opt/slurm" > CFG_OPTS="$CFG_OPTS --with-pmix=/opt/pmix/2.1.1" > CFG_OPTS="$CFG_OPTS --with-pmi=/opt/slurm" > CFG_OPTS="$CFG_OPTS --with-libevent=external" > CFG_OPTS="$CFG_OPTS --with-hwloc=external" > CFG_OPTS="$CFG_OPTS --with-verbs=/usr" > CFG_OPTS="$CFG_OPTS --with-libfabric=/usr" > CFG_OPTS="$CFG_OPTS --with-ucx=/usr" > CFG_OPTS="$CFG_OPTS --with-verbs-libdir=/usr/lib64" > CFG_OPTS="$CFG_OPTS --with-mxm=no" > CFG_OPTS="$CFG_OPTS --with-cuda=${HPC_CUDA_DIR}" > CFG_OPTS="$CFG_OPTS --enable-openib-udcm" > CFG_OPTS="$CFG_OPTS --enable-openib-rdmacm" > CFG_OPTS="$CFG_OPTS --disable-pmix-dstore" > > rpmbuild --ba \ > --define '_name openmpi' \ > --define "_version $OMPI_VER" \ > --define "_release ${RELEASE}" \ > --define "_prefix $PREFIX" \ > --define '_mandir %{_prefix}/share/man' \ > --define '_defaultdocdir %{_prefix}' \ > --define 'mflags -j 8' \ > --define 'use_default_rpm_opt_flags 1' \ > --define 'use_check_files 0' \ > --define 'install_shell_scripts 1' \ > --define 'shell_scripts_basename mpivars' \ > --define "configure_options $CFG_OPTS " \ > openmpi-${OMPI_VER}.spec 2>&1 | tee rpmbuild.log > > > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=_TUHqBC2-jZfYbwP18yLYDuU3Rq68N8-nk-rnxsiDGo&s=QjG-szMi1wbDc0DX3andcwIsZNDMsVZErnCirrAYnlE&e=
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users