I am facing the same problem that was quoted long ago (2019) in this mailing 
mailing reference:


https://lists.schedmd.com/pipermail/slurm-users/2019-July/003785.html


but with more recent version of slurm i.e:


slurm 21.08.8-2
PMIx 2.2.5 (pmix-2.2.5-1.el8.src.rpm)
openMPI 4.1.5

In  a similar way to my predecessor, running MPI heterogeneous jobs (OSU 
benchmarks) using this
slurm+PMIx version installed on the host gives sporadically this type of error

>>>
slurmstepd: error:  mpi/pmix_v2: _tcp_connect: lxbk1177 [0]: 
pmixp_dconn_tcp.c:139: Cannot establish the connection
slurmstepd: error:  mpi/pmix_v2: pmixp_dconn_connect: lxbk1177 [0]: 
pmixp_dconn.h:246: Cannot establish direct connection to lxbk1177 (0)
slurmstepd: error:  mpi/pmix_v2: _process_extended_hdr: lxbk1177 [0]: 
pmixp_server.c:738: Unable to connect to 0
slurmstepd: error:  mpi/pmix_v2: pmixp_coll_ring_check: lxbk1177 [0]: 
pmixp_coll_ring.c:618: 0x14cd84047ab0: unexpected contrib from lxbk1177:0, 
expected is 1
slurmstepd: error:  mpi/pmix_v2: _process_server_request: lxbk1177 [0]: 
pmixp_server.c:942: 0x14cd84047ab0: unexpected contrib from lxbk1177:0, 
coll->seq=0, seq=0
>>>

So very similar problem indeed.
Additionally when the jobs completes, from time to time it cannot finish 
properly and stay in RUNNING state an one needs to manually
cancel the job.

Is the hetjob functionality really supporting this case?
If yes, any ideas what can be wrong here?



Job submission details:
==================


- submit script:

sbatch --ntasks 1 --ntasks-per-core 1 --cpus-per-task 2   -p main  -D ./data -o 
%j.out.log -e %j.err.log : --ntasks 1 --ntasks-per-core 1 --cpus-per-task 1  -p 
main  -D ./data -o %j.out.log -e %j.err.log  ./run-file.sh



- run-file.sh:



export CONT=<std_singularity_container>.sif

srun  -vv --mpi=pmix --export=ALL : $CONT collective/osu_allreduce -f -i 100 -x 
10




---------
Denis Bertini
Abteilung: CIT
Ort: SB3 2.265a

Tel: +49 6159 71 2240
Fax: +49 6159 71 2986
E-Mail: d.bert...@gsi.de

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
Ministerialdirigent Dr. Volkmar Dietz

Reply via email to