i compilied pmix slurm openmpi ---pmix ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13 --disable-debug ---slurm ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13 --with-pmix=/hpc/pmix/2.2 ---openmpi ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external --with-libevent=external --with-slurm=/hpc/slurm/18.08 --with-pmix=/hpc/pmix/2.2
everything seemed to compile fine, but when i do an srun i get the below errors, however, if i salloc and then mpirun it seems to work fine. i'm not quite sure where the breakdown is or how to debug it --- [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101 [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101 [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101 -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_interlib_declare --> Returned "Would block" (-10) instead of "Success" (0) ...snipped... [labcmp6:18355] *** An error occurred in MPI_Init [labcmp6:18355] *** reported by process [140726281390153,15] [labcmp6:18355] *** on a NULL communicator [labcmp6:18355] *** Unknown error [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [labcmp6:18355] *** and potentially your MPI job) [labcmp6:18352] *** An error occurred in MPI_Init [labcmp6:18352] *** reported by process [1677936713,12] [labcmp6:18352] *** on a NULL communicator [labcmp6:18352] *** Unknown error [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [labcmp6:18352] *** and potentially your MPI job) [labcmp6:18354] *** An error occurred in MPI_Init [labcmp6:18354] *** reported by process [140726281390153,14] [labcmp6:18354] *** on a NULL communicator [labcmp6:18354] *** Unknown error [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [labcmp6:18354] *** and potentially your MPI job) srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT 2019-01-18T20:03:33 *** [labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101 -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_interlib_declare --> Returned "Would block" (-10) instead of "Success" (0) -------------------------------------------------------------------------- [labcmp5:18357] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101 [labcmp5:18356] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101 srun: error: labcmp6: tasks 12-15: Exited with exit code 1 srun: error: labcmp3: tasks 0-3: Killed srun: error: labcmp4: tasks 4-7: Killed srun: error: labcmp5: tasks 8-11: Killed _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users