Hi,

We are trying to enable PMIx support for OpenMPI in the OpenHPC project but
we experience issues.
Submitting jobs via Slurm and/or OpenPBS seems to just hang in network
calls when PMIx is enabled. Without PMIx (i.e. no --with-pmix=...) the jobs
are successfully executed. Also everything works fine if using MPICH
("module swap openmpi4 mpich/3.4.3-ofi").

$ ompi_info | grep -i pmix
  Configure command line: '--prefix=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5'
'--disable-static' '--enable-builtin-atomics' '--with-sge'
'--enable-mpi-cxx' '--with-hwloc=/opt/ohpc/pub/libs/hwloc'
'--with-pmix=/opt/ohpc/admin/pmix' '--with-libevent=external'
'--with-libfabric=/opt/ohpc/pub/mpi/libfabric/1.18.0'
'--with-ucx=/opt/ohpc/pub/mpi/ucx-ohpc/1.14.0' '--without-verbs'
'--with-tm=/opt/pbs/'
                MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component
v4.1.5)
                MCA pmix: ext3x (MCA v2.1.0, API v2.0.0, Component v4.1.5)


In an environment with three bare metal machines (one manager and two
compute nodes) managed by OpenPBS "strace mpirun hostname"  ends with:

...
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 21
setsockopt(21, SOL_SOCKET, SO_LINGER, {l_onoff=1, l_linger=5}, 8) = 0
connect(21, {sa_family=AF_INET, sin_port=htons(15003),
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
write(21, "PKTV1\0\0\0\0, +2+22+26181.openhpc-o"..., 11307) = 11307
write(21, "PKTV1\0\0\0\0, +2+22+26181.openhpc-o"..., 11307) = 11307
ppoll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN},
{fd=20, events=POLLIN}], 4, NULL, NULL, 0

More complete output at https://pastebin.com/hmVeCifF

Another example is our simplified CI at Github Actions where we use Slurm:

 Log file: ./tests/rms-harness/tests/family-gnu12-openmpi4/test_harness.log
 not ok 1 [RMS/harness] Verify zero exit code from MPI job runs OK
(slurm/gnu12/openmpi4)
 (from function `run_mpi_binary' in file ./common/functions, line 399,
  in test file test_harness, line 23)
   `run_mpi_binary ./mpi_exit 0 $NODES $TASKS' failed
 job script = /tmp/job.ohpc.18553
 Batch job 6 submitted

 Job 6 failed...
 Reason=NonZeroExitCode

 [prun] Master compute host = bd2a644aa87c
 [prun] Resource manager = slurm
 [prun] Launch cmd = srun --mpi=pmix ./mpi_exit 0 (family=openmpi4)
 srun: launch/slurm: launch_p_step_launch: StepId=6.0 aborted before
step completely launched.
 srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
 srun: error: task 1 launch failed: Unspecified error
 srun: error: c0: task 0: Killed



Any hints on how to debug it ?

OpenMPI 4.1.5 (
https://github.com/openhpc/ohpc/blob/3.x/components/mpi-families/openmpi/SPECS/openmpi.spec
)
PMIx 4.2.4 (
https://github.com/openhpc/ohpc/blob/3.x/components/rms/pmix/SPECS/pmix.spec
)

Regards,
Martin

Reply via email to