Re: [OMPI users] OpenMPI and SLURM

Ralph Castain Sat, 12 Jan 2013 11:29:37 -0500

Sadly, we incorrectly removed the required grpcomm component to make that work. 
I'm restoring it this weekend and we will be issuing a 1.6.4 shortly.


Meantime, you can use the PMI support in its place. Just configure OMPI 
--with-pmi=<path-to-slurm's-pmi.h> and you will be able to direct-launch your 
job.

Sorry for the problem.

On Jan 12, 2013, at 7:32 AM, Beat Rubischon <b...@0x1b.ch> wrote:

> Hello!
> 
> I'm currently trying to run OpenMPI 1.6.3 binaries directly under SLURM
> 2.5.1 [1]. OpenMPI is built using --with-slurm, $SLURM_STEP_RESV_PORTS
> is successfully set by SLURM. According to the error message I assume a
> shared library couldn't be found, unfortunately I'm not able to find a
> failed stat() or open() in strace.
> 
>       [1] http://www.schedmd.com/slurmdocs/mpi_guide.html#open_mpi
> 
> It's probably a stupid mistake on my side. It drives me crazy as I
> already realized such setups in the early OpenMPI 1.5 days :-/
> 
> Using mpirun works:
> 
> $ salloc -n 2 mpirun ./IMB-MPI1
> [dalco@master imb]$ salloc -n 2 mpirun ./IMB-MPI1
> salloc: Granted job allocation 72
> #---------------------------------------------------
> #    Intel (R) MPI Benchmark Suite V3.2.2, MPI-1 part
> ...
> 
> Direct invocation fails:
> 
> [dalco@master imb]$ salloc -n 2 srun ./IMB-MPI1
> salloc: Granted job allocation 74
> --------------------------------------------------------------------------
> A requested component was not found, or was unable to be opened.  This
> means that this component is either not installed or is unable to be
> used on your system (e.g., sometimes this means that shared libraries
> that the component requires are unable to be found/loaded).  Note that
> Open MPI stopped checking at the first component that it did not find.
> 
> Host:      node30.cluster
> Framework: grpcomm
> Component: hier
> --------------------------------------------------------------------------
> [node30.cluster:42203] [[74,1],0] ORTE_ERROR_LOG: Error in file
> base/ess_base_std_app.c at line 93
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>  orte_grpcomm_base_open failed
>  --> Returned value Error (-1) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> [node30.cluster:42203] [[74,1],0] ORTE_ERROR_LOG: Error in file
> ess_slurmd_module.c at line 385
> [node30.cluster:42203] [[74,1],0] ORTE_ERROR_LOG: Error in file
> runtime/orte_init.c at line 128
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>  orte_ess_set_name failed
>  --> Returned value Error (-1) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
>  ompi_mpi_init: orte_init failed
>  --> Returned "Error" (-1) instead of "Success" (0)
> --------------------------------------------------------------------------
> [node30.cluster:42203] *** An error occurred in MPI_Init_thread
> [node30.cluster:42203] *** on a NULL communicator
> [node30.cluster:42203] *** Unknown error
> [node30.cluster:42203] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
> --------------------------------------------------------------------------
> An MPI process is aborting at a time when it cannot guarantee that all
> of its peer processes in the job will be killed properly.  You should
> double check that everything has shut down cleanly.
> 
>  Reason:     Before MPI_INIT completed
>  Local host: node30.cluster
>  PID:        42203
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> A requested component was not found, or was unable to be opened.  This
> means that this component is either not installed or is unable to be
> used on your system (e.g., sometimes this means that shared libraries
> that the component requires are unable to be found/loaded).  Note that
> Open MPI stopped checking at the first component that it did not find.
> 
> Host:      node30.cluster
> Framework: grpcomm
> Component: hier
> --------------------------------------------------------------------------
> [node30.cluster:42204] [[74,1],1] ORTE_ERROR_LOG: Error in file
> base/ess_base_std_app.c at line 93
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>  orte_grpcomm_base_open failed
>  --> Returned value Error (-1) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> [node30.cluster:42204] [[74,1],1] ORTE_ERROR_LOG: Error in file
> ess_slurmd_module.c at line 385
> [node30.cluster:42204] [[74,1],1] ORTE_ERROR_LOG: Error in file
> runtime/orte_init.c at line 128
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>  orte_ess_set_name failed
>  --> Returned value Error (-1) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
>  ompi_mpi_init: orte_init failed
>  --> Returned "Error" (-1) instead of "Success" (0)
> --------------------------------------------------------------------------
> [node30.cluster:42204] *** An error occurred in MPI_Init_thread
> [node30.cluster:42204] *** on a NULL communicator
> [node30.cluster:42204] *** Unknown error
> [node30.cluster:42204] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
> --------------------------------------------------------------------------
> An MPI process is aborting at a time when it cannot guarantee that all
> of its peer processes in the job will be killed properly.  You should
> double check that everything has shut down cleanly.
> 
>  Reason:     Before MPI_INIT completed
>  Local host: node30.cluster
>  PID:        42204
> --------------------------------------------------------------------------
> srun: error: node30: tasks 0-1: Exited with exit code 1
> salloc: Relinquishing job allocation 74
> salloc: Job allocation 74 has been revoked.
> 
> Thanks for any input!
> Beat
> 
> -- 
>     \|/                           Beat Rubischon <b...@0x1b.ch>
>   ( 0-0 )                             http://www.0x1b.ch/~beat/
> oOO--(_)--OOo---------------------------------------------------
> Meine Erlebnisse, Gedanken und Traeume: http://www.0x1b.ch/blog/
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] OpenMPI and SLURM

Reply via email to