Dagnabbit.. I was specifying ib, not openib .. When i specified
openib, I got this error:

"
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

 PML add procs failed
 --> Returned value -2 instead of OMPI_SUCCESS
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
"

I can run it with openib,self locally, even multi processes with -np
greater than one.. But once the other node is in the picture , i got
this error.. Humm does error message help to troubleshoot?

Thanks,
gurhan
On 5/11/06, Brian Barrett <brbar...@open-mpi.org> wrote:
On May 11, 2006, at 10:10 PM, Gurhan Ozen wrote:

> Brian,
> Thanks for the very clear answers.
>
> I did change my code to include fflush() calls after printf() ...
>
> And I did try with --mca btl ib,self . Interesting result, with --mca
> btl ib,self it hello_world works fine, but broadcast hangs after i
> enter the vector length.
>
> At any rate though, --mca btl ib,self looks like the traffic goes over
> ethernet device .. I couldn't find any documentation on the "self"
> argument of mca, does it mean to explore alternatives if the desired
> btl (in this case ib) doesn't work?

No, self is the loopback device, for sending messages to self.  It is
never used for message routing outside of the current process, but is
required for almost all transports, as send to self can be a sticky
issue.

You are specifying openib, not ib, as the argument to mpirun,
correct?  Either way, I'm not really sure how data could be going
over TCP -- the TCP transport would definitely be disabled in that
case.  At this point, I don't know enough about the Open IB driver to
be of help -- one of the other developers is going to have to jump in
and provide assistance.

> Speaking of documentation, it looks like open-mpi didn't come with a
> man for mpirun, i thought i had seen in one of the slides of Open MPI
> developer's workshop that it did have mpirun.1 . Do i need to check it
> out from svn?

That's one option, or wait for us to release Open MPI 1.0.3 / 1.1.

Brian


> On 5/11/06, Brian Barrett <brbar...@open-mpi.org> wrote:
>> On May 10, 2006, at 10:46 PM, Gurhan Ozen wrote:
>>
>>>  My ultimate goal is to get Open MPI working with openIB stack.
>>> First, I had
>>>  installed lam-mpi , I know it doesn't have support for openIB but
>>> it's still
>>>  relevant to some of my questions  I will ask.. Here is the set up
>>> I have:
>>
>> Yes, keep in mind throughout that while Open MPI does support MVAPI,
>> LAM/MPI will fall back to using IP over IB for communication.
>>
>>>  I have two machines, pe830-01 and pe830-02 .. Both have ethernet
>>> interface and
>>>  HCA interface. The IP addresses follow:
>>>                          eth0                 ib0
>>>  pe830-01     10.12.4.32      192.168.1.32
>>>  pe830-02     10.12.4.34      192.168.1.34
>>>
>>>    So this has worked even though it lamhosts file is configured to
>>> use ib0
>>>    interfaces. I further verified with tcpdump command that none of
>>> this went
>>>    to eth0 ..
>>>
>>>    Anyhow, if i change the lamhosts file to use the eth0 IPs,
>>> things work just
>>>    as the same with no issues . And in that case i see some traffic
>>> on eth0
>>>    with tcpdump.
>>
>> Ok, so at least it sounds like your TCP network is sanely configured.
>>
>>>    Now, when i installed and used Open MPI, things didn't work as
>>> easy.. Here is
>>>    what happens. After recompiling the sources with the mpicc that
>>> comes with
>>>    open-mpi:
>>>
>>>    $ /usr/local/openmpi/bin/mpirun  --prefix /usr/local/openmpi --
>>> mca
>>>    pls_rsh_agent ssh --mca btl tcp -np 2 --host
>>> 10.12.4.34,10.12.4.32
>>>    /path/to/hello_world
>>>    Hello, world, I am 0 of 2 and this is on : pe830-02.
>>>    Hello, world, I am 1 of 2 and this is on: pe830-01.
>>>
>>>    So far so good, using eth0 interfaces.. hello_world works just
>>> fine. Now,
>>>    when i try the broadcast program:
>>
>> In reality, you always need to include two BTLs when specifying.  You
>> need both the one you want to use (mvapi,openib,tcp,etc.) and
>> "self".  You can run into issues otherwise.
>>
>>>    $ /usr/local/openmpi/bin/mpirun  --prefix /usr/local/openmpi --
>>> mca
>>>    pls_rsh_agent ssh --mca btl tcp -np 2 --host
>>> 10.12.4.34,10.12.4.32
>>>    /path/to/broadcast
>>>
>>>    It just hangs there, it doesn't prompt me the "Enter the vector
>>> length:"
>>>    string . So i just enter a number anyway since i know the
>>> behavior of the
>>>    program:
>>>
>>>    10
>>>    Enter the vector length: i am: 0 , and i have 5 vector elements
>>>    i am: 1 , and i have 5 vector elements
>>>    [0] 10.000000
>>>    [0] 10.000000
>>>    [0] 10.000000
>>>    [0] 10.000000
>>>    [0] 10.000000
>>>    [0] 10.000000
>>>    [0] 10.000000
>>>    [0] 10.000000
>>>    [0] 10.000000
>>>    [0] 10.000000
>>>
>>>    So, that's the first bump with the openmpi.. Now , if i try to
>>> use ib0
>>>    interfaces instead of eth0 ones, i get:
>>
>> I'm actually surprised this worked in LAM/MPI, to be honest.  There
>> should be an fflush() after the printf() to make sure that the output
>> is actually sent out of the application.
>>
>>>    $  /usr/local/openmpi/bin/mpirun  --prefix /usr/local/openmpi
>>> --mca
>>>    pls_rsh_agent ssh --mca btl openib -np 2 --host
>>> 192.168.1.34,192.168.1.32
>>>    /path/to/hello_world
>>>
>>> --------------------------------------------------------------------
>>> --
>>> ----
>>>    No available btl components were found!
>>>
>>>    This means that there are no components of this type installed
>>> on your
>>>    system or all the components reported that they could not be
>>> used.
>>>
>>>    This is a fatal error; your MPI process is likely to abort.
>>> Check the
>>>    output of the "ompi_info" command and ensure that components of
>>> this
>>>    type are available on your system.  You may also wish to check
>>> the
>>>    value of the "component_path" MCA parameter and ensure that it
>>> has at
>>>    least one directory that contains valid MCA components.
>>>
>>>
>>> --------------------------------------------------------------------
>>> --
>>> ----
>>>    [pe830-01.domain.com:05942]
>>>
>>>    I know, it thinks that it doesn't have openib components
>>> installed, however,
>>>    ompi_info on both machines say otherwise:
>>>
>>>    $ ompi_info | grep openib
>>>    MCA mpool: openib (MCA v1.0, API v1.0, Component v1.0.2)
>>>    MCA btl: openib (MCA v1.0, API v1.0, Component v1.0.2)
>>
>> I don't think it will help, but can you try again with --mca btl
>> openib,self?  For some reason, it appears that the openib component
>> is saying that it can't run.
>>
>>>    Now the questions are...
>>>    1- In the case of using lam/mpi over ib0 interfaces.. Does lam/
>>> mpi
>>>    automatically just use IPoIB ?
>>
>> Yes, LAM has no idea what that Open IB thing is -- it just uses the
>> ethernet device.
>>
>>>    2 - Is there a tcpdump-like utility to dump the traffic on
>>> Infiniband HCAs?
>>
>> I'm not aware of any, but that may occur.
>>
>>>    3 - In the case of Open MPI, does --mca btl arg option have to
>>> be passed
>>>    everytime? For example,
>>>
>>>    $ /usr/local/openmpi/bin/mpirun  --prefix /usr/local/openmpi --
>>> mca
>>>    pls_rsh_agent ssh --mca btl tcp -np 2 --host
>>> 10.12.4.34,10.12.4.32
>>>    /path/to/hello_world
>>>
>>>    works just fine, but the same command without the "--mca btl
>>> tcp" bit gives
>>>    the:
>>>
>>>
>>> --------------------------------------------------------------------
>>> --
>>> ----
>>>    It looks like MPI_INIT failed for some reason; your parallel
>>> process is
>>>    likely to abort.  There are many reasons that a parallel process
>>> can
>>>    fail during MPI_INIT; some of which are due to configuration or
>>> environment
>>>    problems.  This failure appears to be an internal failure;
>>> here's some
>>>    additional information (which may only be relevant to an Open MPI
>>>    developer):
>>>
>>>      PML add procs failed
>>>      --> Returned value -2 instead of OMPI_SUCCESS
>>>
>>> --------------------------------------------------------------------
>>> --
>>> ----
>>>    *** An error occurred in MPI_Init
>>>    *** before MPI was initialized
>>>    *** MPI_ERRORS_ARE_FATAL (goodbye)
>>>
>>>    error ...
>>
>> This makes it sound like Open IB is failing to setup properly.  I'm a
>> bit out of my league on this one -- is there any application you
>> can run
>>
>>>    4 - How come the behavior of broadcast.c was different on Open
>>> MPI
>>> than it is
>>>    on lam/mpi?
>>
>> I think I answered this one already.
>>
>>>    5 - Any ideas as to why i am getting no btl component error when
>>> i want to
>>>    use openib even though ompi_info shows it? If it help any
>>> further , I have
>>>    the following openib modules :
>>
>> This usually (but not always) indicates that something is going wrong
>> with initializing the hardware interface.  ompi_info only tries to
>> load the module, but not initialize the network device.
>>
>>
>> Brian
>>
>> --
>>    Brian Barrett
>>    Open MPI developer
>>    http://www.open-mpi.org/
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to