Hello, I can try to tell from PMIx/UCX perspective. Do you have "MPI=pmix" parameter in your slurm.conf or have you specified "--mpi=pmix" in your srun command? If not - you are not running PMIx and thus UCX (UCX support is only in the PMIx plugin). I think this is confirmed by the log output that you have provided, I don't see any traces of PMIx plugin.
пт, 17 авг. 2018 г. в 20:43, zhangtao102...@126.com <zhangtao102...@126.com >: > Hi, > I have installed SLURM 18.08.0-0pre2 on a my cluster based on RHEL7.4 > (x86_64). > My configure parameters likes this: > ./configure --prefix=/opt/slurm17 --with-munge=/opt/munge > --with-pmix=/opt/pmix --with-ucx=/opt/openucx --with-hwloc=/usr > (openucx version is 1.5.0, pmix version is 3.0.0, hwloc version is 1.11.8) > > After completing the installation and configuration, it looks like slurm > is working normally. But when I submitted a simple test job with sbatch > sleep.sh(just call srun sleep 30 at single computing node), I found that > the job (ID=1032) state was R, but the job did not start normally on the > computation node (no process found). > > The appendix is the output log of the computing node of the management > node. > I can't tell if the cause of this problem is related to the compilation > parameters I specify (such as pmix, ucx), and I've never seen anything > similar in earlier versions. > Has anyone ever responded to a similar phenomenon with me? How to solve > the problem? > > Best regards > > ------------------------------ > zhangtao102...@126.com > -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov