No, I used IP addresses in all my tests. What I found that if I used the IP
address of the Broadcom NIC in hostfile and used that network exclusively
(btl_tcp_if_include), the mpirun command hung silently. If I used the IP
address of another NIC in the host file (and Broadcom NIC exclusively),
mpirun crashed saying the remote process is unreachable. If I used the
other two networks exclusively (and any of their IP addresses in the host
file) it works fine.

Since TCP itself does not care what the underlying NIC is, it is most
likely some kind of firewall issue, as you guessed (I did disable it, but
there could be other related issues). In any case, I believe it has nothing
to do with OMPI. One thing that is different between the Broadcom NIC and
the rest is that the Broadcom NIC is connected to the WAN side and thus
gets its IP via DHCP, where as the rest have static IPs. I don't see why
that would make a difference, but it is possible that CentOS is enforcing
some kind of security policy that I am not aware of.

Thank you for for feedback.

Durga

The surgeon general advises you to eat right, exercise regularly and quit
ageing.

On Sat, May 14, 2016 at 1:13 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> iirc, ompi internally uses networks and not interface names.
> what did you use in your tests ?
> can you try with networks ?
>
> Cheers,
>
> Gilles
>
> On Saturday, May 14, 2016, dpchoudh . <dpcho...@gmail.com> wrote:
>
>> Hello Gilles
>>
>> Thanks for your prompt follow up. It looks this this issue is somehow
>> specific to the Broadcom NIC. If I take it out, the rest of them work in
>> any combination. On further investigation, I found that the name that
>> 'ifconfig' shows for this intterface is different from what it is named in
>> internal scripts. Could be a bug in CentOS, but at least does not look like
>> an OpenMPI issue.
>>
>> Sorry for raising the false alarm.
>>
>> Durga
>>
>> The surgeon general advises you to eat right, exercise regularly and quit
>> ageing.
>>
>> On Sat, May 14, 2016 at 12:02 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com> wrote:
>>
>>> at first I recommend you test 7 cases
>>> - one network only (3 cases)
>>> - two networks ony (3 cases)
>>> - three networks (1 case)
>>>
>>> and see when things hang
>>>
>>> you might also want to
>>> mpirun --mca oob_tcp_if_include 10.1.10.0/24 ...
>>> to ensure no hang will happen in oob
>>>
>>> as usual, double check no firewall is running, and your hosts can ping
>>> each other
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Saturday, May 14, 2016, dpchoudh . <dpcho...@gmail.com> wrote:
>>>
>>>> Dear developers
>>>>
>>>> I have been observing this issue all along on the master branch, but
>>>> have been brushing off as something to do with my installation.
>>>>
>>>> Right now, I just downloaded a fresh checkout (via git pull), built and
>>>> installed it (after deleting /usr/local/lib/openmpi/) and I can reproduce
>>>> the hang 100% of the time.
>>>>
>>>> Description of the setup:
>>>>
>>>> 1. Two x86_64 boxes (dual xeons, 6 core each)
>>>> 2. Four network interfaces, 3 running IP:
>>>>     Broadcom GbE (IP 10.01.10.X/24) BW 1 Gbps
>>>>     Chelsio iWARP (IP 10.10.10.X/24) BW 10 Gbps
>>>>     Qlogic Infiniband (IP 10.01.11.X/24) BW 20Gbps
>>>>     LSI logic Fibre channel (not running IP, I don't think this matters)
>>>>
>>>> All of the NICs have their link UP. All the NICs are in separate IP
>>>> subnets, connected back to back.
>>>>
>>>> With this, the following command hangs:
>>>> (The hostfile is this:
>>>> 10.10.10.10 slots=1
>>>> 10.10.10.11 slots=1
>>>>
>>>> [durga@smallMPI ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl
>>>> self,tcp -mca pml ob1 ./mpitest
>>>>
>>>> with the following output:
>>>>
>>>> Hello world from processor smallMPI, rank 0 out of 2 processors
>>>> Hello world from processor bigMPI, rank 1 out of 2 processors
>>>> smallMPI sent haha!, rank 0
>>>> bigMPI received haha!, rank 1
>>>>
>>>> The stack trace at rank 0 is:
>>>>
>>>> (gdb) bt
>>>> #0  0x00007f9cb844769d in poll () from /lib64/libc.so.6
>>>> #1  0x00007f9cb79354d6 in poll_dispatch (base=0xddb540,
>>>> tv=0x7ffc065d01b0) at poll.c:165
>>>> #2  0x00007f9cb792d180 in opal_libevent2022_event_base_loop
>>>> (base=0xddb540, flags=2) at event.c:1630
>>>> #3  0x00007f9cb7851e74 in opal_progress () at
>>>> runtime/opal_progress.c:171
>>>> #4  0x00007f9cb89bc47d in opal_condition_wait (c=0x7f9cb8f37c40
>>>> <ompi_request_cond>, m=0x7f9cb8f37bc0 <ompi_request_lock>) at
>>>> ../opal/threads/condition.h:76
>>>> #5  0x00007f9cb89bcadf in ompi_request_default_wait_all (count=2,
>>>> requests=0x7ffc065d0360, statuses=0x7ffc065d0330) at request/req_wait.c:287
>>>> #6  0x00007f9cb8a95469 in ompi_coll_base_sendrecv_zero (dest=1,
>>>> stag=-16, source=1, rtag=-16, comm=0x601280 <ompi_mpi_comm_world>)
>>>>     at base/coll_base_barrier.c:63
>>>> #7  0x00007f9cb8a95b86 in ompi_coll_base_barrier_intra_two_procs
>>>> (comm=0x601280 <ompi_mpi_comm_world>, module=0xeb4a00) at
>>>> base/coll_base_barrier.c:313
>>>> #8  0x00007f9cb8ac6d1c in ompi_coll_tuned_barrier_intra_dec_fixed
>>>> (comm=0x601280 <ompi_mpi_comm_world>, module=0xeb4a00) at
>>>> coll_tuned_decision_fixed.c:196
>>>> #9  0x00007f9cb89dc689 in PMPI_Barrier (comm=0x601280
>>>> <ompi_mpi_comm_world>) at pbarrier.c:63
>>>> #10 0x0000000000400b11 in main (argc=1, argv=0x7ffc065d0648) at
>>>> mpitest.c:27
>>>>
>>>> and at rank 1 is:
>>>>
>>>> (gdb) bt
>>>> #0  0x00007f1101e7d69d in poll () from /lib64/libc.so.6
>>>> #1  0x00007f110136b4d6 in poll_dispatch (base=0x1d54540,
>>>> tv=0x7ffd73013710) at poll.c:165
>>>> #2  0x00007f1101363180 in opal_libevent2022_event_base_loop
>>>> (base=0x1d54540, flags=2) at event.c:1630
>>>> #3  0x00007f1101287e74 in opal_progress () at
>>>> runtime/opal_progress.c:171
>>>> #4  0x00007f11023f247d in opal_condition_wait (c=0x7f110296dc40
>>>> <ompi_request_cond>, m=0x7f110296dbc0 <ompi_request_lock>) at
>>>> ../opal/threads/condition.h:76
>>>> #5  0x00007f11023f2adf in ompi_request_default_wait_all (count=2,
>>>> requests=0x7ffd730138c0, statuses=0x7ffd73013890) at request/req_wait.c:287
>>>> #6  0x00007f11024cb469 in ompi_coll_base_sendrecv_zero (dest=0,
>>>> stag=-16, source=0, rtag=-16, comm=0x601280 <ompi_mpi_comm_world>)
>>>>     at base/coll_base_barrier.c:63
>>>> #7  0x00007f11024cbb86 in ompi_coll_base_barrier_intra_two_procs
>>>> (comm=0x601280 <ompi_mpi_comm_world>, module=0x1e2ebc0) at
>>>> base/coll_base_barrier.c:313
>>>> #8  0x00007f11024cde3c in ompi_coll_tuned_barrier_intra_dec_fixed
>>>> (comm=0x601280 <ompi_mpi_comm_world>, module=0x1e2ebc0) at
>>>> coll_tuned_decision_fixed.c:196
>>>> #9  0x00007f1102412689 in PMPI_Barrier (comm=0x601280
>>>> <ompi_mpi_comm_world>) at pbarrier.c:63
>>>> #10 0x0000000000400b11 in main (argc=1, argv=0x7ffd73013ba8) at
>>>> mpitest.c:27
>>>>
>>>> The code for the test program is:
>>>>
>>>> #include <mpi.h>
>>>> #include <stdio.h>
>>>> #include <string.h>
>>>> #include <stdlib.h>
>>>>
>>>> int main(int argc, char *argv[])
>>>> {
>>>>     int world_size, world_rank, name_len;
>>>>     char hostname[MPI_MAX_PROCESSOR_NAME], buf[8];
>>>>
>>>>     MPI_Init(&argc, &argv);
>>>>     MPI_Comm_size(MPI_COMM_WORLD, &world_size);
>>>>     MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
>>>>     MPI_Get_processor_name(hostname, &name_len);
>>>>     printf("Hello world from processor %s, rank %d out of %d
>>>> processors\n", hostname, world_rank, world_size);
>>>>     if (world_rank == 1)
>>>>     {
>>>>     MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD,
>>>> MPI_STATUS_IGNORE);
>>>>     printf("%s received %s, rank %d\n", hostname, buf, world_rank);
>>>>     }
>>>>     else
>>>>     {
>>>>     strcpy(buf, "haha!");
>>>>     MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD);
>>>>     printf("%s sent %s, rank %d\n", hostname, buf, world_rank);
>>>>     }
>>>>     MPI_Barrier(MPI_COMM_WORLD);
>>>>     MPI_Finalize();
>>>>     return 0;
>>>> }
>>>>
>>>> I have a strong feeling that there is an issue in this kind of
>>>> situation. I'll be more than happy to run further tests if someone asks me
>>>> to.
>>>>
>>>> Thank you
>>>> Durga
>>>>
>>>> The surgeon general advises you to eat right, exercise regularly and
>>>> quit ageing.
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/05/29196.php
>>>
>>
>>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29198.php
>

Reply via email to