An update on this issue:

If I choose only the IP interface matching the address that was specified
in the hostfile, the program terminates successfully; i.e. the following
command works:

mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp -mca pml ob1 -mca
btl_tcp_if_include enp35s0 ./mpitest

(enp35s0 is in the 10.10.10.X network)

If my run method is incorrect, please let me know.

An unrelated issue: a pull from github.com seems unusually slow (a simple
'already uptodate' message takes minutes to complete.) Are others
experiencing the same?

The surgeon general advises you to eat right, exercise regularly and quit
ageing.

On Fri, May 13, 2016 at 11:24 PM, dpchoudh . <dpcho...@gmail.com> wrote:

> Dear developers
>
> I have been observing this issue all along on the master branch, but have
> been brushing off as something to do with my installation.
>
> Right now, I just downloaded a fresh checkout (via git pull), built and
> installed it (after deleting /usr/local/lib/openmpi/) and I can reproduce
> the hang 100% of the time.
>
> Description of the setup:
>
> 1. Two x86_64 boxes (dual xeons, 6 core each)
> 2. Four network interfaces, 3 running IP:
>     Broadcom GbE (IP 10.01.10.X/24) BW 1 Gbps
>     Chelsio iWARP (IP 10.10.10.X/24) BW 10 Gbps
>     Qlogic Infiniband (IP 10.01.11.X/24) BW 20Gbps
>     LSI logic Fibre channel (not running IP, I don't think this matters)
>
> All of the NICs have their link UP. All the NICs are in separate IP
> subnets, connected back to back.
>
> With this, the following command hangs:
> (The hostfile is this:
> 10.10.10.10 slots=1
> 10.10.10.11 slots=1
>
> [durga@smallMPI ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp
> -mca pml ob1 ./mpitest
>
> with the following output:
>
> Hello world from processor smallMPI, rank 0 out of 2 processors
> Hello world from processor bigMPI, rank 1 out of 2 processors
> smallMPI sent haha!, rank 0
> bigMPI received haha!, rank 1
>
> The stack trace at rank 0 is:
>
> (gdb) bt
> #0  0x00007f9cb844769d in poll () from /lib64/libc.so.6
> #1  0x00007f9cb79354d6 in poll_dispatch (base=0xddb540, tv=0x7ffc065d01b0)
> at poll.c:165
> #2  0x00007f9cb792d180 in opal_libevent2022_event_base_loop
> (base=0xddb540, flags=2) at event.c:1630
> #3  0x00007f9cb7851e74 in opal_progress () at runtime/opal_progress.c:171
> #4  0x00007f9cb89bc47d in opal_condition_wait (c=0x7f9cb8f37c40
> <ompi_request_cond>, m=0x7f9cb8f37bc0 <ompi_request_lock>) at
> ../opal/threads/condition.h:76
> #5  0x00007f9cb89bcadf in ompi_request_default_wait_all (count=2,
> requests=0x7ffc065d0360, statuses=0x7ffc065d0330) at request/req_wait.c:287
> #6  0x00007f9cb8a95469 in ompi_coll_base_sendrecv_zero (dest=1, stag=-16,
> source=1, rtag=-16, comm=0x601280 <ompi_mpi_comm_world>)
>     at base/coll_base_barrier.c:63
> #7  0x00007f9cb8a95b86 in ompi_coll_base_barrier_intra_two_procs
> (comm=0x601280 <ompi_mpi_comm_world>, module=0xeb4a00) at
> base/coll_base_barrier.c:313
> #8  0x00007f9cb8ac6d1c in ompi_coll_tuned_barrier_intra_dec_fixed
> (comm=0x601280 <ompi_mpi_comm_world>, module=0xeb4a00) at
> coll_tuned_decision_fixed.c:196
> #9  0x00007f9cb89dc689 in PMPI_Barrier (comm=0x601280
> <ompi_mpi_comm_world>) at pbarrier.c:63
> #10 0x0000000000400b11 in main (argc=1, argv=0x7ffc065d0648) at
> mpitest.c:27
>
> and at rank 1 is:
>
> (gdb) bt
> #0  0x00007f1101e7d69d in poll () from /lib64/libc.so.6
> #1  0x00007f110136b4d6 in poll_dispatch (base=0x1d54540,
> tv=0x7ffd73013710) at poll.c:165
> #2  0x00007f1101363180 in opal_libevent2022_event_base_loop
> (base=0x1d54540, flags=2) at event.c:1630
> #3  0x00007f1101287e74 in opal_progress () at runtime/opal_progress.c:171
> #4  0x00007f11023f247d in opal_condition_wait (c=0x7f110296dc40
> <ompi_request_cond>, m=0x7f110296dbc0 <ompi_request_lock>) at
> ../opal/threads/condition.h:76
> #5  0x00007f11023f2adf in ompi_request_default_wait_all (count=2,
> requests=0x7ffd730138c0, statuses=0x7ffd73013890) at request/req_wait.c:287
> #6  0x00007f11024cb469 in ompi_coll_base_sendrecv_zero (dest=0, stag=-16,
> source=0, rtag=-16, comm=0x601280 <ompi_mpi_comm_world>)
>     at base/coll_base_barrier.c:63
> #7  0x00007f11024cbb86 in ompi_coll_base_barrier_intra_two_procs
> (comm=0x601280 <ompi_mpi_comm_world>, module=0x1e2ebc0) at
> base/coll_base_barrier.c:313
> #8  0x00007f11024cde3c in ompi_coll_tuned_barrier_intra_dec_fixed
> (comm=0x601280 <ompi_mpi_comm_world>, module=0x1e2ebc0) at
> coll_tuned_decision_fixed.c:196
> #9  0x00007f1102412689 in PMPI_Barrier (comm=0x601280
> <ompi_mpi_comm_world>) at pbarrier.c:63
> #10 0x0000000000400b11 in main (argc=1, argv=0x7ffd73013ba8) at
> mpitest.c:27
>
> The code for the test program is:
>
> #include <mpi.h>
> #include <stdio.h>
> #include <string.h>
> #include <stdlib.h>
>
> int main(int argc, char *argv[])
> {
>     int world_size, world_rank, name_len;
>     char hostname[MPI_MAX_PROCESSOR_NAME], buf[8];
>
>     MPI_Init(&argc, &argv);
>     MPI_Comm_size(MPI_COMM_WORLD, &world_size);
>     MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
>     MPI_Get_processor_name(hostname, &name_len);
>     printf("Hello world from processor %s, rank %d out of %d
> processors\n", hostname, world_rank, world_size);
>     if (world_rank == 1)
>     {
>     MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
>     printf("%s received %s, rank %d\n", hostname, buf, world_rank);
>     }
>     else
>     {
>     strcpy(buf, "haha!");
>     MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD);
>     printf("%s sent %s, rank %d\n", hostname, buf, world_rank);
>     }
>     MPI_Barrier(MPI_COMM_WORLD);
>     MPI_Finalize();
>     return 0;
> }
>
> I have a strong feeling that there is an issue in this kind of situation.
> I'll be more than happy to run further tests if someone asks me to.
>
> Thank you
> Durga
>
> The surgeon general advises you to eat right, exercise regularly and quit
> ageing.
>

Reply via email to