I am facing the same problem that was quoted long ago (2019) in this mailing mailing reference:
https://lists.schedmd.com/pipermail/slurm-users/2019-July/003785.html but with more recent version of slurm i.e: slurm 21.08.8-2 PMIx 2.2.5 (pmix-2.2.5-1.el8.src.rpm) openMPI 4.1.5 In a similar way to my predecessor, running MPI heterogeneous jobs (OSU benchmarks) using this slurm+PMIx version installed on the host gives sporadically this type of error >>> slurmstepd: error: mpi/pmix_v2: _tcp_connect: lxbk1177 [0]: pmixp_dconn_tcp.c:139: Cannot establish the connection slurmstepd: error: mpi/pmix_v2: pmixp_dconn_connect: lxbk1177 [0]: pmixp_dconn.h:246: Cannot establish direct connection to lxbk1177 (0) slurmstepd: error: mpi/pmix_v2: _process_extended_hdr: lxbk1177 [0]: pmixp_server.c:738: Unable to connect to 0 slurmstepd: error: mpi/pmix_v2: pmixp_coll_ring_check: lxbk1177 [0]: pmixp_coll_ring.c:618: 0x14cd84047ab0: unexpected contrib from lxbk1177:0, expected is 1 slurmstepd: error: mpi/pmix_v2: _process_server_request: lxbk1177 [0]: pmixp_server.c:942: 0x14cd84047ab0: unexpected contrib from lxbk1177:0, coll->seq=0, seq=0 >>> So very similar problem indeed. Additionally when the jobs completes, from time to time it cannot finish properly and stay in RUNNING state an one needs to manually cancel the job. Is the hetjob functionality really supporting this case? If yes, any ideas what can be wrong here? Job submission details: ================== - submit script: sbatch --ntasks 1 --ntasks-per-core 1 --cpus-per-task 2 -p main -D ./data -o %j.out.log -e %j.err.log : --ntasks 1 --ntasks-per-core 1 --cpus-per-task 1 -p main -D ./data -o %j.out.log -e %j.err.log ./run-file.sh - run-file.sh: export CONT=<std_singularity_container>.sif srun -vv --mpi=pmix --export=ALL : $CONT collective/osu_allreduce -f -i 100 -x 10 --------- Denis Bertini Abteilung: CIT Ort: SB3 2.265a Tel: +49 6159 71 2240 Fax: +49 6159 71 2986 E-Mail: d.bert...@gsi.de GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528 Managing Directors / Geschäftsführung: Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats: Ministerialdirigent Dr. Volkmar Dietz