Try running with:

mpirun.openmpi-1.4.1 --mca btl_base_verbose 50  --mca btl self,openib -n 2 
--mca btl_openib_verbose 100 ./IMB-MPI1 -npmin 2 PingPong

Also, are you saying that running the same command line with osu_latency works 
just fine?  That would be really weird...


On May 18, 2010, at 6:18 AM, Peter Kruse wrote:

> Hello,
> 
> trying to run Intel MPI Benchmarks with OpenMPI 1.4.1 fails in initializing
> the component openib.  System is Debian GNU/Linux 5.0.4.
> The command to start the job (under Torque 2.4.7) was:
> 
> mpirun.openmpi-1.4.1 --mca btl_base_verbose 50  --mca btl self,openib -n 2
> ./IMB-MPI1 -npmin 2 PingPong
> 
> and results in these messages:
> 
> ----------------------------8<----------------------------------------------
> 
> [beo-15:20933] mca: base: components_open: Looking for btl components
> [beo-16:20605] mca: base: components_open: Looking for btl components
> [beo-15:20933] mca: base: components_open: opening btl components
> [beo-15:20933] mca: base: components_open: found loaded component openib
> [beo-15:20933] mca: base: components_open: component openib has no register
> function
> [beo-15:20933] mca: base: components_open: component openib open function
> successful
> [beo-15:20933] mca: base: components_open: found loaded component self
> [beo-15:20933] mca: base: components_open: component self has no register 
> function
> [beo-15:20933] mca: base: components_open: component self open function 
> successful
> [beo-16:20605] mca: base: components_open: opening btl components
> [beo-16:20605] mca: base: components_open: found loaded component openib
> [beo-16:20605] mca: base: components_open: component openib has no register
> function
> [beo-16:20605] mca: base: components_open: component openib open function
> successful
> [beo-16:20605] mca: base: components_open: found loaded component self
> [beo-16:20605] mca: base: components_open: component self has no register 
> function
> [beo-16:20605] mca: base: components_open: component self open function 
> successful
> [beo-15:20933] select: initializing btl component openib
> [beo-15:20933] select: init of component openib returned failure
> [beo-15:20933] select: module openib unloaded
> [beo-15:20933] select: initializing btl component self
> [beo-15:20933] select: init of component self returned success
> [beo-16:20605] select: initializing btl component openib
> [beo-16:20605] select: init of component openib returned failure
> [beo-16:20605] select: module openib unloaded
> [beo-16:20605] select: initializing btl component self
> [beo-16:20605] select: init of component self returned success
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
> 
>    Process 1 ([[4887,1],0]) is on host: beo-15
>    Process 2 ([[4887,1],1]) is on host: beo-16
>    BTLs attempted: self
> 
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
>    PML add procs failed
>    --> Returned "Unreachable" (-12) instead of "Success" (0)
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init_thread
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [beo-15:20933] Abort before MPI_INIT completed successfully; not able to
> guarantee that all other processes were killed!
> --------------------------------------------------------------------------
> orterun has exited due to process rank 0 with PID 20933 on
> node beo-15 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by orterun (as reported here).
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init_thread
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [beo-16:20605] Abort before MPI_INIT completed successfully; not able to
> guarantee that all other processes were killed!
> [beo-15:20930] 1 more process has sent help message help-mca-bml-r2.txt /
> unreachable proc
> [beo-15:20930] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
> help / error messages
> [beo-15:20930] 1 more process has sent help message help-mpi-runtime /
> mpi_init:startup:internal-failure
> 
> ----------------------------8<----------------------------------------------
> 
> running another Benchmark (OSU) succeeds in loading the openib component.
> 
> "ibstat |grep -i state" on both nodes gives:
> 
> ----------------------------8<----------------------------------------------
>                  State: Active
>                  Physical state: LinkUp
> ----------------------------8<----------------------------------------------
> 
> Running with "mpi_abort_delay -1" and attaching an strace on the process
> is not very helpful it loops with:
> 
> ----------------------------8<----------------------------------------------
> rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> rt_sigaction(SIGCHLD, NULL, {0x2aee58ff3250, [CHLD], SA_RESTORER|SA_RESTART,
> 0x2aee59d44f60}, 8) = 0
> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> nanosleep({5, 0}, {5, 0})               = 0
> ----------------------------8<----------------------------------------------
> 
> Does anybody have an idea what is wrong or how can we get more debugging
> information about the initialization of the openib module?
> 
> Thanks for any help,
> 
>    Peter
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to