Hello!

I'm currently trying to run OpenMPI 1.6.3 binaries directly under SLURM
2.5.1 [1]. OpenMPI is built using --with-slurm, $SLURM_STEP_RESV_PORTS
is successfully set by SLURM. According to the error message I assume a
shared library couldn't be found, unfortunately I'm not able to find a
failed stat() or open() in strace.

        [1] http://www.schedmd.com/slurmdocs/mpi_guide.html#open_mpi

It's probably a stupid mistake on my side. It drives me crazy as I
already realized such setups in the early OpenMPI 1.5 days :-/

Using mpirun works:

$ salloc -n 2 mpirun ./IMB-MPI1
[dalco@master imb]$ salloc -n 2 mpirun ./IMB-MPI1
salloc: Granted job allocation 72
#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V3.2.2, MPI-1 part
...

Direct invocation fails:

[dalco@master imb]$ salloc -n 2 srun ./IMB-MPI1
salloc: Granted job allocation 74
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      node30.cluster
Framework: grpcomm
Component: hier
--------------------------------------------------------------------------
[node30.cluster:42203] [[74,1],0] ORTE_ERROR_LOG: Error in file
base/ess_base_std_app.c at line 93
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_grpcomm_base_open failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[node30.cluster:42203] [[74,1],0] ORTE_ERROR_LOG: Error in file
ess_slurmd_module.c at line 385
[node30.cluster:42203] [[74,1],0] ORTE_ERROR_LOG: Error in file
runtime/orte_init.c at line 128
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_set_name failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: orte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[node30.cluster:42203] *** An error occurred in MPI_Init_thread
[node30.cluster:42203] *** on a NULL communicator
[node30.cluster:42203] *** Unknown error
[node30.cluster:42203] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly.  You should
double check that everything has shut down cleanly.

  Reason:     Before MPI_INIT completed
  Local host: node30.cluster
  PID:        42203
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      node30.cluster
Framework: grpcomm
Component: hier
--------------------------------------------------------------------------
[node30.cluster:42204] [[74,1],1] ORTE_ERROR_LOG: Error in file
base/ess_base_std_app.c at line 93
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_grpcomm_base_open failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[node30.cluster:42204] [[74,1],1] ORTE_ERROR_LOG: Error in file
ess_slurmd_module.c at line 385
[node30.cluster:42204] [[74,1],1] ORTE_ERROR_LOG: Error in file
runtime/orte_init.c at line 128
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_set_name failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: orte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[node30.cluster:42204] *** An error occurred in MPI_Init_thread
[node30.cluster:42204] *** on a NULL communicator
[node30.cluster:42204] *** Unknown error
[node30.cluster:42204] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly.  You should
double check that everything has shut down cleanly.

  Reason:     Before MPI_INIT completed
  Local host: node30.cluster
  PID:        42204
--------------------------------------------------------------------------
srun: error: node30: tasks 0-1: Exited with exit code 1
salloc: Relinquishing job allocation 74
salloc: Job allocation 74 has been revoked.

Thanks for any input!
Beat

-- 
     \|/                           Beat Rubischon <b...@0x1b.ch>
   ( 0-0 )                             http://www.0x1b.ch/~beat/
oOO--(_)--OOo---------------------------------------------------
Meine Erlebnisse, Gedanken und Traeume: http://www.0x1b.ch/blog/

Reply via email to