On May 10, 2006, at 10:46 PM, Gurhan Ozen wrote:
My ultimate goal is to get Open MPI working with openIB stack.
First, I had
installed lam-mpi , I know it doesn't have support for openIB but
it's still
relevant to some of my questions I will ask.. Here is the set up
I have:
Yes, keep in mind throughout that while Open MPI does support MVAPI,
LAM/MPI will fall back to using IP over IB for communication.
I have two machines, pe830-01 and pe830-02 .. Both have ethernet
interface and
HCA interface. The IP addresses follow:
eth0 ib0
pe830-01 10.12.4.32 192.168.1.32
pe830-02 10.12.4.34 192.168.1.34
So this has worked even though it lamhosts file is configured to
use ib0
interfaces. I further verified with tcpdump command that none of
this went
to eth0 ..
Anyhow, if i change the lamhosts file to use the eth0 IPs,
things work just
as the same with no issues . And in that case i see some traffic
on eth0
with tcpdump.
Ok, so at least it sounds like your TCP network is sanely configured.
Now, when i installed and used Open MPI, things didn't work as
easy.. Here is
what happens. After recompiling the sources with the mpicc that
comes with
open-mpi:
$ /usr/local/openmpi/bin/mpirun --prefix /usr/local/openmpi --
mca
pls_rsh_agent ssh --mca btl tcp -np 2 --host
10.12.4.34,10.12.4.32
/path/to/hello_world
Hello, world, I am 0 of 2 and this is on : pe830-02.
Hello, world, I am 1 of 2 and this is on: pe830-01.
So far so good, using eth0 interfaces.. hello_world works just
fine. Now,
when i try the broadcast program:
In reality, you always need to include two BTLs when specifying. You
need both the one you want to use (mvapi,openib,tcp,etc.) and
"self". You can run into issues otherwise.
$ /usr/local/openmpi/bin/mpirun --prefix /usr/local/openmpi --
mca
pls_rsh_agent ssh --mca btl tcp -np 2 --host
10.12.4.34,10.12.4.32
/path/to/broadcast
It just hangs there, it doesn't prompt me the "Enter the vector
length:"
string . So i just enter a number anyway since i know the
behavior of the
program:
10
Enter the vector length: i am: 0 , and i have 5 vector elements
i am: 1 , and i have 5 vector elements
[0] 10.000000
[0] 10.000000
[0] 10.000000
[0] 10.000000
[0] 10.000000
[0] 10.000000
[0] 10.000000
[0] 10.000000
[0] 10.000000
[0] 10.000000
So, that's the first bump with the openmpi.. Now , if i try to
use ib0
interfaces instead of eth0 ones, i get:
I'm actually surprised this worked in LAM/MPI, to be honest. There
should be an fflush() after the printf() to make sure that the output
is actually sent out of the application.
$ /usr/local/openmpi/bin/mpirun --prefix /usr/local/openmpi
--mca
pls_rsh_agent ssh --mca btl openib -np 2 --host
192.168.1.34,192.168.1.32
/path/to/hello_world
--------------------------------------------------------------------
--
----
No available btl components were found!
This means that there are no components of this type installed
on your
system or all the components reported that they could not be
used.
This is a fatal error; your MPI process is likely to abort.
Check the
output of the "ompi_info" command and ensure that components of
this
type are available on your system. You may also wish to check
the
value of the "component_path" MCA parameter and ensure that it
has at
least one directory that contains valid MCA components.
--------------------------------------------------------------------
--
----
[pe830-01.domain.com:05942]
I know, it thinks that it doesn't have openib components
installed, however,
ompi_info on both machines say otherwise:
$ ompi_info | grep openib
MCA mpool: openib (MCA v1.0, API v1.0, Component v1.0.2)
MCA btl: openib (MCA v1.0, API v1.0, Component v1.0.2)
I don't think it will help, but can you try again with --mca btl
openib,self? For some reason, it appears that the openib component
is saying that it can't run.
Now the questions are...
1- In the case of using lam/mpi over ib0 interfaces.. Does lam/
mpi
automatically just use IPoIB ?
Yes, LAM has no idea what that Open IB thing is -- it just uses the
ethernet device.
2 - Is there a tcpdump-like utility to dump the traffic on
Infiniband HCAs?
I'm not aware of any, but that may occur.
3 - In the case of Open MPI, does --mca btl arg option have to
be passed
everytime? For example,
$ /usr/local/openmpi/bin/mpirun --prefix /usr/local/openmpi --
mca
pls_rsh_agent ssh --mca btl tcp -np 2 --host
10.12.4.34,10.12.4.32
/path/to/hello_world
works just fine, but the same command without the "--mca btl
tcp" bit gives
the:
--------------------------------------------------------------------
--
----
It looks like MPI_INIT failed for some reason; your parallel
process is
likely to abort. There are many reasons that a parallel process
can
fail during MPI_INIT; some of which are due to configuration or
environment
problems. This failure appears to be an internal failure;
here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned value -2 instead of OMPI_SUCCESS
--------------------------------------------------------------------
--
----
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
error ...
This makes it sound like Open IB is failing to setup properly. I'm a
bit out of my league on this one -- is there any application you
can run
4 - How come the behavior of broadcast.c was different on Open
MPI
than it is
on lam/mpi?
I think I answered this one already.
5 - Any ideas as to why i am getting no btl component error when
i want to
use openib even though ompi_info shows it? If it help any
further , I have
the following openib modules :
This usually (but not always) indicates that something is going wrong
with initializing the hardware interface. ompi_info only tries to
load the module, but not initialize the network device.
Brian
--
Brian Barrett
Open MPI developer
http://www.open-mpi.org/
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users