Hello! I'm currently trying to run OpenMPI 1.6.3 binaries directly under SLURM 2.5.1 [1]. OpenMPI is built using --with-slurm, $SLURM_STEP_RESV_PORTS is successfully set by SLURM. According to the error message I assume a shared library couldn't be found, unfortunately I'm not able to find a failed stat() or open() in strace.
[1] http://www.schedmd.com/slurmdocs/mpi_guide.html#open_mpi It's probably a stupid mistake on my side. It drives me crazy as I already realized such setups in the early OpenMPI 1.5 days :-/ Using mpirun works: $ salloc -n 2 mpirun ./IMB-MPI1 [dalco@master imb]$ salloc -n 2 mpirun ./IMB-MPI1 salloc: Granted job allocation 72 #--------------------------------------------------- # Intel (R) MPI Benchmark Suite V3.2.2, MPI-1 part ... Direct invocation fails: [dalco@master imb]$ salloc -n 2 srun ./IMB-MPI1 salloc: Granted job allocation 74 -------------------------------------------------------------------------- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: node30.cluster Framework: grpcomm Component: hier -------------------------------------------------------------------------- [node30.cluster:42203] [[74,1],0] ORTE_ERROR_LOG: Error in file base/ess_base_std_app.c at line 93 -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_grpcomm_base_open failed --> Returned value Error (-1) instead of ORTE_SUCCESS -------------------------------------------------------------------------- [node30.cluster:42203] [[74,1],0] ORTE_ERROR_LOG: Error in file ess_slurmd_module.c at line 385 [node30.cluster:42203] [[74,1],0] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at line 128 -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_set_name failed --> Returned value Error (-1) instead of ORTE_SUCCESS -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: orte_init failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- [node30.cluster:42203] *** An error occurred in MPI_Init_thread [node30.cluster:42203] *** on a NULL communicator [node30.cluster:42203] *** Unknown error [node30.cluster:42203] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: node30.cluster PID: 42203 -------------------------------------------------------------------------- -------------------------------------------------------------------------- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: node30.cluster Framework: grpcomm Component: hier -------------------------------------------------------------------------- [node30.cluster:42204] [[74,1],1] ORTE_ERROR_LOG: Error in file base/ess_base_std_app.c at line 93 -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_grpcomm_base_open failed --> Returned value Error (-1) instead of ORTE_SUCCESS -------------------------------------------------------------------------- [node30.cluster:42204] [[74,1],1] ORTE_ERROR_LOG: Error in file ess_slurmd_module.c at line 385 [node30.cluster:42204] [[74,1],1] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at line 128 -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_set_name failed --> Returned value Error (-1) instead of ORTE_SUCCESS -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: orte_init failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- [node30.cluster:42204] *** An error occurred in MPI_Init_thread [node30.cluster:42204] *** on a NULL communicator [node30.cluster:42204] *** Unknown error [node30.cluster:42204] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: node30.cluster PID: 42204 -------------------------------------------------------------------------- srun: error: node30: tasks 0-1: Exited with exit code 1 salloc: Relinquishing job allocation 74 salloc: Job allocation 74 has been revoked. Thanks for any input! Beat -- \|/ Beat Rubischon <b...@0x1b.ch> ( 0-0 ) http://www.0x1b.ch/~beat/ oOO--(_)--OOo--------------------------------------------------- Meine Erlebnisse, Gedanken und Traeume: http://www.0x1b.ch/blog/