Hi there,

Using slurm v24.11.0 together with openMPI 5.0.7 built with openpmix v5.0.6 i 
am facing a systematical crash at process wiring-up phase when launching 
standard MPI job (OSU benchmarks ) on our new AMD compute nodes
( amd-epyc 9654, 192 phys. cores +HT ) running Rocky Linux 9.4 OS

The typical error reads:

slurmstepd: error:  mpi/pmix_v5: pmixp_p2p_send: ccexe0094 [4]: 
pmixp_utils.c:469: send failed, rc=1001, exceeded the retry limit
slurmstepd: error:  mpi/pmix_v5: _slurm_send: ccexe0094 [4]: 
pmixp_server.c:1581: Cannot send message to 
/var/spool/slurmd/stepd.slurm.pmix.656.0, size = 46979, hostlist:
(null)
srun: error: Node failure on ccexe0091


after such a error as you can see the node move to state down
It looks like the slurmstep pmix_server can not use the local socket at 
var/spool/slurmd/stepd.slurm.pmix.job_id.0 for inter-node communication .

  *   On one AMD node ( same SLURM version, same cluster setup ) wiring up 
works smoothly even at core satuation (192 cores used)
  *   On Intel node (intel,xeon,gold6248r, 48 cores ) wiring-up works even with 
multiple node without any problem
  *   When the problematic AMD nodes are setup as dynamic 
node<https://slurm.schedmd.com/dynamic_nodes.html>

the wiring-up phase with multiple nodes works perfectly, without any issue

Has anybody experienced this kind of problem?
Any idea what could be the reason for that?


I also add that when the problematic AMD nodes are setup as dynamic 
node<https://slurm.schedmd.com/dynamic_nodes.html>
the wiring-up phase with multiple nodes works perfectly, without any issue


Cheers,

Denis

  *


---------
Denis Bertini
Abteilung: CIT
Ort: SB3 2.265a

Tel: +49 6159 71 2240
Fax: +49 6159 71 2986
E-Mail: d.bert...@gsi.de

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
Ministerialdirigent Dr. Volkmar Dietz
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to