Sadly, we incorrectly removed the required grpcomm component to make that work. I'm restoring it this weekend and we will be issuing a 1.6.4 shortly.
Meantime, you can use the PMI support in its place. Just configure OMPI --with-pmi=<path-to-slurm's-pmi.h> and you will be able to direct-launch your job. Sorry for the problem. On Jan 12, 2013, at 7:32 AM, Beat Rubischon <b...@0x1b.ch> wrote: > Hello! > > I'm currently trying to run OpenMPI 1.6.3 binaries directly under SLURM > 2.5.1 [1]. OpenMPI is built using --with-slurm, $SLURM_STEP_RESV_PORTS > is successfully set by SLURM. According to the error message I assume a > shared library couldn't be found, unfortunately I'm not able to find a > failed stat() or open() in strace. > > [1] http://www.schedmd.com/slurmdocs/mpi_guide.html#open_mpi > > It's probably a stupid mistake on my side. It drives me crazy as I > already realized such setups in the early OpenMPI 1.5 days :-/ > > Using mpirun works: > > $ salloc -n 2 mpirun ./IMB-MPI1 > [dalco@master imb]$ salloc -n 2 mpirun ./IMB-MPI1 > salloc: Granted job allocation 72 > #--------------------------------------------------- > # Intel (R) MPI Benchmark Suite V3.2.2, MPI-1 part > ... > > Direct invocation fails: > > [dalco@master imb]$ salloc -n 2 srun ./IMB-MPI1 > salloc: Granted job allocation 74 > -------------------------------------------------------------------------- > A requested component was not found, or was unable to be opened. This > means that this component is either not installed or is unable to be > used on your system (e.g., sometimes this means that shared libraries > that the component requires are unable to be found/loaded). Note that > Open MPI stopped checking at the first component that it did not find. > > Host: node30.cluster > Framework: grpcomm > Component: hier > -------------------------------------------------------------------------- > [node30.cluster:42203] [[74,1],0] ORTE_ERROR_LOG: Error in file > base/ess_base_std_app.c at line 93 > -------------------------------------------------------------------------- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > orte_grpcomm_base_open failed > --> Returned value Error (-1) instead of ORTE_SUCCESS > -------------------------------------------------------------------------- > [node30.cluster:42203] [[74,1],0] ORTE_ERROR_LOG: Error in file > ess_slurmd_module.c at line 385 > [node30.cluster:42203] [[74,1],0] ORTE_ERROR_LOG: Error in file > runtime/orte_init.c at line 128 > -------------------------------------------------------------------------- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > orte_ess_set_name failed > --> Returned value Error (-1) instead of ORTE_SUCCESS > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > ompi_mpi_init: orte_init failed > --> Returned "Error" (-1) instead of "Success" (0) > -------------------------------------------------------------------------- > [node30.cluster:42203] *** An error occurred in MPI_Init_thread > [node30.cluster:42203] *** on a NULL communicator > [node30.cluster:42203] *** Unknown error > [node30.cluster:42203] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort > -------------------------------------------------------------------------- > An MPI process is aborting at a time when it cannot guarantee that all > of its peer processes in the job will be killed properly. You should > double check that everything has shut down cleanly. > > Reason: Before MPI_INIT completed > Local host: node30.cluster > PID: 42203 > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > A requested component was not found, or was unable to be opened. This > means that this component is either not installed or is unable to be > used on your system (e.g., sometimes this means that shared libraries > that the component requires are unable to be found/loaded). Note that > Open MPI stopped checking at the first component that it did not find. > > Host: node30.cluster > Framework: grpcomm > Component: hier > -------------------------------------------------------------------------- > [node30.cluster:42204] [[74,1],1] ORTE_ERROR_LOG: Error in file > base/ess_base_std_app.c at line 93 > -------------------------------------------------------------------------- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > orte_grpcomm_base_open failed > --> Returned value Error (-1) instead of ORTE_SUCCESS > -------------------------------------------------------------------------- > [node30.cluster:42204] [[74,1],1] ORTE_ERROR_LOG: Error in file > ess_slurmd_module.c at line 385 > [node30.cluster:42204] [[74,1],1] ORTE_ERROR_LOG: Error in file > runtime/orte_init.c at line 128 > -------------------------------------------------------------------------- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > orte_ess_set_name failed > --> Returned value Error (-1) instead of ORTE_SUCCESS > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > ompi_mpi_init: orte_init failed > --> Returned "Error" (-1) instead of "Success" (0) > -------------------------------------------------------------------------- > [node30.cluster:42204] *** An error occurred in MPI_Init_thread > [node30.cluster:42204] *** on a NULL communicator > [node30.cluster:42204] *** Unknown error > [node30.cluster:42204] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort > -------------------------------------------------------------------------- > An MPI process is aborting at a time when it cannot guarantee that all > of its peer processes in the job will be killed properly. You should > double check that everything has shut down cleanly. > > Reason: Before MPI_INIT completed > Local host: node30.cluster > PID: 42204 > -------------------------------------------------------------------------- > srun: error: node30: tasks 0-1: Exited with exit code 1 > salloc: Relinquishing job allocation 74 > salloc: Job allocation 74 has been revoked. > > Thanks for any input! > Beat > > -- > \|/ Beat Rubischon <b...@0x1b.ch> > ( 0-0 ) http://www.0x1b.ch/~beat/ > oOO--(_)--OOo--------------------------------------------------- > Meine Erlebnisse, Gedanken und Traeume: http://www.0x1b.ch/blog/ > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users