iirc, ompi internally uses networks and not interface names.
what did you use in your tests ?
can you try with networks ?

Cheers,

Gilles

On Saturday, May 14, 2016, dpchoudh . <dpcho...@gmail.com> wrote:

> Hello Gilles
>
> Thanks for your prompt follow up. It looks this this issue is somehow
> specific to the Broadcom NIC. If I take it out, the rest of them work in
> any combination. On further investigation, I found that the name that
> 'ifconfig' shows for this intterface is different from what it is named in
> internal scripts. Could be a bug in CentOS, but at least does not look like
> an OpenMPI issue.
>
> Sorry for raising the false alarm.
>
> Durga
>
> The surgeon general advises you to eat right, exercise regularly and quit
> ageing.
>
> On Sat, May 14, 2016 at 12:02 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com
> <javascript:_e(%7B%7D,'cvml','gilles.gouaillar...@gmail.com');>> wrote:
>
>> at first I recommend you test 7 cases
>> - one network only (3 cases)
>> - two networks ony (3 cases)
>> - three networks (1 case)
>>
>> and see when things hang
>>
>> you might also want to
>> mpirun --mca oob_tcp_if_include 10.1.10.0/24 ...
>> to ensure no hang will happen in oob
>>
>> as usual, double check no firewall is running, and your hosts can ping
>> each other
>>
>> Cheers,
>>
>> Gilles
>>
>> On Saturday, May 14, 2016, dpchoudh . <dpcho...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','dpcho...@gmail.com');>> wrote:
>>
>>> Dear developers
>>>
>>> I have been observing this issue all along on the master branch, but
>>> have been brushing off as something to do with my installation.
>>>
>>> Right now, I just downloaded a fresh checkout (via git pull), built and
>>> installed it (after deleting /usr/local/lib/openmpi/) and I can reproduce
>>> the hang 100% of the time.
>>>
>>> Description of the setup:
>>>
>>> 1. Two x86_64 boxes (dual xeons, 6 core each)
>>> 2. Four network interfaces, 3 running IP:
>>>     Broadcom GbE (IP 10.01.10.X/24) BW 1 Gbps
>>>     Chelsio iWARP (IP 10.10.10.X/24) BW 10 Gbps
>>>     Qlogic Infiniband (IP 10.01.11.X/24) BW 20Gbps
>>>     LSI logic Fibre channel (not running IP, I don't think this matters)
>>>
>>> All of the NICs have their link UP. All the NICs are in separate IP
>>> subnets, connected back to back.
>>>
>>> With this, the following command hangs:
>>> (The hostfile is this:
>>> 10.10.10.10 slots=1
>>> 10.10.10.11 slots=1
>>>
>>> [durga@smallMPI ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp
>>> -mca pml ob1 ./mpitest
>>>
>>> with the following output:
>>>
>>> Hello world from processor smallMPI, rank 0 out of 2 processors
>>> Hello world from processor bigMPI, rank 1 out of 2 processors
>>> smallMPI sent haha!, rank 0
>>> bigMPI received haha!, rank 1
>>>
>>> The stack trace at rank 0 is:
>>>
>>> (gdb) bt
>>> #0  0x00007f9cb844769d in poll () from /lib64/libc.so.6
>>> #1  0x00007f9cb79354d6 in poll_dispatch (base=0xddb540,
>>> tv=0x7ffc065d01b0) at poll.c:165
>>> #2  0x00007f9cb792d180 in opal_libevent2022_event_base_loop
>>> (base=0xddb540, flags=2) at event.c:1630
>>> #3  0x00007f9cb7851e74 in opal_progress () at runtime/opal_progress.c:171
>>> #4  0x00007f9cb89bc47d in opal_condition_wait (c=0x7f9cb8f37c40
>>> <ompi_request_cond>, m=0x7f9cb8f37bc0 <ompi_request_lock>) at
>>> ../opal/threads/condition.h:76
>>> #5  0x00007f9cb89bcadf in ompi_request_default_wait_all (count=2,
>>> requests=0x7ffc065d0360, statuses=0x7ffc065d0330) at request/req_wait.c:287
>>> #6  0x00007f9cb8a95469 in ompi_coll_base_sendrecv_zero (dest=1,
>>> stag=-16, source=1, rtag=-16, comm=0x601280 <ompi_mpi_comm_world>)
>>>     at base/coll_base_barrier.c:63
>>> #7  0x00007f9cb8a95b86 in ompi_coll_base_barrier_intra_two_procs
>>> (comm=0x601280 <ompi_mpi_comm_world>, module=0xeb4a00) at
>>> base/coll_base_barrier.c:313
>>> #8  0x00007f9cb8ac6d1c in ompi_coll_tuned_barrier_intra_dec_fixed
>>> (comm=0x601280 <ompi_mpi_comm_world>, module=0xeb4a00) at
>>> coll_tuned_decision_fixed.c:196
>>> #9  0x00007f9cb89dc689 in PMPI_Barrier (comm=0x601280
>>> <ompi_mpi_comm_world>) at pbarrier.c:63
>>> #10 0x0000000000400b11 in main (argc=1, argv=0x7ffc065d0648) at
>>> mpitest.c:27
>>>
>>> and at rank 1 is:
>>>
>>> (gdb) bt
>>> #0  0x00007f1101e7d69d in poll () from /lib64/libc.so.6
>>> #1  0x00007f110136b4d6 in poll_dispatch (base=0x1d54540,
>>> tv=0x7ffd73013710) at poll.c:165
>>> #2  0x00007f1101363180 in opal_libevent2022_event_base_loop
>>> (base=0x1d54540, flags=2) at event.c:1630
>>> #3  0x00007f1101287e74 in opal_progress () at runtime/opal_progress.c:171
>>> #4  0x00007f11023f247d in opal_condition_wait (c=0x7f110296dc40
>>> <ompi_request_cond>, m=0x7f110296dbc0 <ompi_request_lock>) at
>>> ../opal/threads/condition.h:76
>>> #5  0x00007f11023f2adf in ompi_request_default_wait_all (count=2,
>>> requests=0x7ffd730138c0, statuses=0x7ffd73013890) at request/req_wait.c:287
>>> #6  0x00007f11024cb469 in ompi_coll_base_sendrecv_zero (dest=0,
>>> stag=-16, source=0, rtag=-16, comm=0x601280 <ompi_mpi_comm_world>)
>>>     at base/coll_base_barrier.c:63
>>> #7  0x00007f11024cbb86 in ompi_coll_base_barrier_intra_two_procs
>>> (comm=0x601280 <ompi_mpi_comm_world>, module=0x1e2ebc0) at
>>> base/coll_base_barrier.c:313
>>> #8  0x00007f11024cde3c in ompi_coll_tuned_barrier_intra_dec_fixed
>>> (comm=0x601280 <ompi_mpi_comm_world>, module=0x1e2ebc0) at
>>> coll_tuned_decision_fixed.c:196
>>> #9  0x00007f1102412689 in PMPI_Barrier (comm=0x601280
>>> <ompi_mpi_comm_world>) at pbarrier.c:63
>>> #10 0x0000000000400b11 in main (argc=1, argv=0x7ffd73013ba8) at
>>> mpitest.c:27
>>>
>>> The code for the test program is:
>>>
>>> #include <mpi.h>
>>> #include <stdio.h>
>>> #include <string.h>
>>> #include <stdlib.h>
>>>
>>> int main(int argc, char *argv[])
>>> {
>>>     int world_size, world_rank, name_len;
>>>     char hostname[MPI_MAX_PROCESSOR_NAME], buf[8];
>>>
>>>     MPI_Init(&argc, &argv);
>>>     MPI_Comm_size(MPI_COMM_WORLD, &world_size);
>>>     MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
>>>     MPI_Get_processor_name(hostname, &name_len);
>>>     printf("Hello world from processor %s, rank %d out of %d
>>> processors\n", hostname, world_rank, world_size);
>>>     if (world_rank == 1)
>>>     {
>>>     MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
>>>     printf("%s received %s, rank %d\n", hostname, buf, world_rank);
>>>     }
>>>     else
>>>     {
>>>     strcpy(buf, "haha!");
>>>     MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD);
>>>     printf("%s sent %s, rank %d\n", hostname, buf, world_rank);
>>>     }
>>>     MPI_Barrier(MPI_COMM_WORLD);
>>>     MPI_Finalize();
>>>     return 0;
>>> }
>>>
>>> I have a strong feeling that there is an issue in this kind of
>>> situation. I'll be more than happy to run further tests if someone asks me
>>> to.
>>>
>>> Thank you
>>> Durga
>>>
>>> The surgeon general advises you to eat right, exercise regularly and
>>> quit ageing.
>>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/05/29196.php
>>
>
>

Reply via email to