i compilied pmix slurm openmpi

---pmix
./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13
--disable-debug
---slurm
./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13
--with-pmix=/hpc/pmix/2.2
---openmpi
./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external
--with-libevent=external --with-slurm=/hpc/slurm/18.08
--with-pmix=/hpc/pmix/2.2

everything seemed to compile fine, but when i do an srun i get the
below errors, however, if i salloc and then mpirun it seems to work
fine.  i'm not quite sure where the breakdown is or how to debug it

---

[ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl
[labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file
event/pmix_event_registration.c at line 101
[labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file
event/pmix_event_registration.c at line 101
[labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file
event/pmix_event_registration.c at line 101
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_interlib_declare
  --> Returned "Would block" (-10) instead of "Success" (0)
...snipped...
[labcmp6:18355] *** An error occurred in MPI_Init
[labcmp6:18355] *** reported by process [140726281390153,15]
[labcmp6:18355] *** on a NULL communicator
[labcmp6:18355] *** Unknown error
[labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator will now abort,
[labcmp6:18355] ***    and potentially your MPI job)
[labcmp6:18352] *** An error occurred in MPI_Init
[labcmp6:18352] *** reported by process [1677936713,12]
[labcmp6:18352] *** on a NULL communicator
[labcmp6:18352] *** Unknown error
[labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator will now abort,
[labcmp6:18352] ***    and potentially your MPI job)
[labcmp6:18354] *** An error occurred in MPI_Init
[labcmp6:18354] *** reported by process [140726281390153,14]
[labcmp6:18354] *** on a NULL communicator
[labcmp6:18354] *** Unknown error
[labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator will now abort,
[labcmp6:18354] ***    and potentially your MPI job)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT 2019-01-18T20:03:33 ***
[labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file
event/pmix_event_registration.c at line 101
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_interlib_declare
  --> Returned "Would block" (-10) instead of "Success" (0)
--------------------------------------------------------------------------
[labcmp5:18357] PMIX ERROR: NOT-SUPPORTED in file
event/pmix_event_registration.c at line 101
[labcmp5:18356] PMIX ERROR: NOT-SUPPORTED in file
event/pmix_event_registration.c at line 101
srun: error: labcmp6: tasks 12-15: Exited with exit code 1
srun: error: labcmp3: tasks 0-3: Killed
srun: error: labcmp4: tasks 4-7: Killed
srun: error: labcmp5: tasks 8-11: Killed
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to