Hello,

trying to run Intel MPI Benchmarks with OpenMPI 1.4.1 fails in initializing
the component openib.  System is Debian GNU/Linux 5.0.4.
The command to start the job (under Torque 2.4.7) was:

mpirun.openmpi-1.4.1 --mca btl_base_verbose 50 --mca btl self,openib -n 2 ./IMB-MPI1 -npmin 2 PingPong

and results in these messages:

----------------------------8<----------------------------------------------

[beo-15:20933] mca: base: components_open: Looking for btl components
[beo-16:20605] mca: base: components_open: Looking for btl components
[beo-15:20933] mca: base: components_open: opening btl components
[beo-15:20933] mca: base: components_open: found loaded component openib
[beo-15:20933] mca: base: components_open: component openib has no register function [beo-15:20933] mca: base: components_open: component openib open function successful
[beo-15:20933] mca: base: components_open: found loaded component self
[beo-15:20933] mca: base: components_open: component self has no register 
function
[beo-15:20933] mca: base: components_open: component self open function 
successful
[beo-16:20605] mca: base: components_open: opening btl components
[beo-16:20605] mca: base: components_open: found loaded component openib
[beo-16:20605] mca: base: components_open: component openib has no register function [beo-16:20605] mca: base: components_open: component openib open function successful
[beo-16:20605] mca: base: components_open: found loaded component self
[beo-16:20605] mca: base: components_open: component self has no register 
function
[beo-16:20605] mca: base: components_open: component self open function 
successful
[beo-15:20933] select: initializing btl component openib
[beo-15:20933] select: init of component openib returned failure
[beo-15:20933] select: module openib unloaded
[beo-15:20933] select: initializing btl component self
[beo-15:20933] select: init of component self returned success
[beo-16:20605] select: initializing btl component openib
[beo-16:20605] select: init of component openib returned failure
[beo-16:20605] select: module openib unloaded
[beo-16:20605] select: initializing btl component self
[beo-16:20605] select: init of component self returned success
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[4887,1],0]) is on host: beo-15
  Process 2 ([[4887,1],1]) is on host: beo-16
  BTLs attempted: self

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[beo-15:20933] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
orterun has exited due to process rank 0 with PID 20933 on
node beo-15 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by orterun (as reported here).
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[beo-16:20605] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! [beo-15:20930] 1 more process has sent help message help-mca-bml-r2.txt / unreachable proc [beo-15:20930] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [beo-15:20930] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:internal-failure

----------------------------8<----------------------------------------------

running another Benchmark (OSU) succeeds in loading the openib component.

"ibstat |grep -i state" on both nodes gives:

----------------------------8<----------------------------------------------
                State: Active
                Physical state: LinkUp
----------------------------8<----------------------------------------------

Running with "mpi_abort_delay -1" and attaching an strace on the process
is not very helpful it loops with:

----------------------------8<----------------------------------------------
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {0x2aee58ff3250, [CHLD], SA_RESTORER|SA_RESTART, 0x2aee59d44f60}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({5, 0}, {5, 0})               = 0
----------------------------8<----------------------------------------------

Does anybody have an idea what is wrong or how can we get more debugging
information about the initialization of the openib module?

Thanks for any help,

  Peter

Reply via email to