at first I recommend you test 7 cases
- one network only (3 cases)
- two networks ony (3 cases)
- three networks (1 case)

and see when things hang

you might also want to
mpirun --mca oob_tcp_if_include 10.1.10.0/24 ...
to ensure no hang will happen in oob

as usual, double check no firewall is running, and your hosts can ping each
other

Cheers,

Gilles

On Saturday, May 14, 2016, dpchoudh . <dpcho...@gmail.com> wrote:

> Dear developers
>
> I have been observing this issue all along on the master branch, but have
> been brushing off as something to do with my installation.
>
> Right now, I just downloaded a fresh checkout (via git pull), built and
> installed it (after deleting /usr/local/lib/openmpi/) and I can reproduce
> the hang 100% of the time.
>
> Description of the setup:
>
> 1. Two x86_64 boxes (dual xeons, 6 core each)
> 2. Four network interfaces, 3 running IP:
>     Broadcom GbE (IP 10.01.10.X/24) BW 1 Gbps
>     Chelsio iWARP (IP 10.10.10.X/24) BW 10 Gbps
>     Qlogic Infiniband (IP 10.01.11.X/24) BW 20Gbps
>     LSI logic Fibre channel (not running IP, I don't think this matters)
>
> All of the NICs have their link UP. All the NICs are in separate IP
> subnets, connected back to back.
>
> With this, the following command hangs:
> (The hostfile is this:
> 10.10.10.10 slots=1
> 10.10.10.11 slots=1
>
> [durga@smallMPI ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp
> -mca pml ob1 ./mpitest
>
> with the following output:
>
> Hello world from processor smallMPI, rank 0 out of 2 processors
> Hello world from processor bigMPI, rank 1 out of 2 processors
> smallMPI sent haha!, rank 0
> bigMPI received haha!, rank 1
>
> The stack trace at rank 0 is:
>
> (gdb) bt
> #0  0x00007f9cb844769d in poll () from /lib64/libc.so.6
> #1  0x00007f9cb79354d6 in poll_dispatch (base=0xddb540, tv=0x7ffc065d01b0)
> at poll.c:165
> #2  0x00007f9cb792d180 in opal_libevent2022_event_base_loop
> (base=0xddb540, flags=2) at event.c:1630
> #3  0x00007f9cb7851e74 in opal_progress () at runtime/opal_progress.c:171
> #4  0x00007f9cb89bc47d in opal_condition_wait (c=0x7f9cb8f37c40
> <ompi_request_cond>, m=0x7f9cb8f37bc0 <ompi_request_lock>) at
> ../opal/threads/condition.h:76
> #5  0x00007f9cb89bcadf in ompi_request_default_wait_all (count=2,
> requests=0x7ffc065d0360, statuses=0x7ffc065d0330) at request/req_wait.c:287
> #6  0x00007f9cb8a95469 in ompi_coll_base_sendrecv_zero (dest=1, stag=-16,
> source=1, rtag=-16, comm=0x601280 <ompi_mpi_comm_world>)
>     at base/coll_base_barrier.c:63
> #7  0x00007f9cb8a95b86 in ompi_coll_base_barrier_intra_two_procs
> (comm=0x601280 <ompi_mpi_comm_world>, module=0xeb4a00) at
> base/coll_base_barrier.c:313
> #8  0x00007f9cb8ac6d1c in ompi_coll_tuned_barrier_intra_dec_fixed
> (comm=0x601280 <ompi_mpi_comm_world>, module=0xeb4a00) at
> coll_tuned_decision_fixed.c:196
> #9  0x00007f9cb89dc689 in PMPI_Barrier (comm=0x601280
> <ompi_mpi_comm_world>) at pbarrier.c:63
> #10 0x0000000000400b11 in main (argc=1, argv=0x7ffc065d0648) at
> mpitest.c:27
>
> and at rank 1 is:
>
> (gdb) bt
> #0  0x00007f1101e7d69d in poll () from /lib64/libc.so.6
> #1  0x00007f110136b4d6 in poll_dispatch (base=0x1d54540,
> tv=0x7ffd73013710) at poll.c:165
> #2  0x00007f1101363180 in opal_libevent2022_event_base_loop
> (base=0x1d54540, flags=2) at event.c:1630
> #3  0x00007f1101287e74 in opal_progress () at runtime/opal_progress.c:171
> #4  0x00007f11023f247d in opal_condition_wait (c=0x7f110296dc40
> <ompi_request_cond>, m=0x7f110296dbc0 <ompi_request_lock>) at
> ../opal/threads/condition.h:76
> #5  0x00007f11023f2adf in ompi_request_default_wait_all (count=2,
> requests=0x7ffd730138c0, statuses=0x7ffd73013890) at request/req_wait.c:287
> #6  0x00007f11024cb469 in ompi_coll_base_sendrecv_zero (dest=0, stag=-16,
> source=0, rtag=-16, comm=0x601280 <ompi_mpi_comm_world>)
>     at base/coll_base_barrier.c:63
> #7  0x00007f11024cbb86 in ompi_coll_base_barrier_intra_two_procs
> (comm=0x601280 <ompi_mpi_comm_world>, module=0x1e2ebc0) at
> base/coll_base_barrier.c:313
> #8  0x00007f11024cde3c in ompi_coll_tuned_barrier_intra_dec_fixed
> (comm=0x601280 <ompi_mpi_comm_world>, module=0x1e2ebc0) at
> coll_tuned_decision_fixed.c:196
> #9  0x00007f1102412689 in PMPI_Barrier (comm=0x601280
> <ompi_mpi_comm_world>) at pbarrier.c:63
> #10 0x0000000000400b11 in main (argc=1, argv=0x7ffd73013ba8) at
> mpitest.c:27
>
> The code for the test program is:
>
> #include <mpi.h>
> #include <stdio.h>
> #include <string.h>
> #include <stdlib.h>
>
> int main(int argc, char *argv[])
> {
>     int world_size, world_rank, name_len;
>     char hostname[MPI_MAX_PROCESSOR_NAME], buf[8];
>
>     MPI_Init(&argc, &argv);
>     MPI_Comm_size(MPI_COMM_WORLD, &world_size);
>     MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
>     MPI_Get_processor_name(hostname, &name_len);
>     printf("Hello world from processor %s, rank %d out of %d
> processors\n", hostname, world_rank, world_size);
>     if (world_rank == 1)
>     {
>     MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
>     printf("%s received %s, rank %d\n", hostname, buf, world_rank);
>     }
>     else
>     {
>     strcpy(buf, "haha!");
>     MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD);
>     printf("%s sent %s, rank %d\n", hostname, buf, world_rank);
>     }
>     MPI_Barrier(MPI_COMM_WORLD);
>     MPI_Finalize();
>     return 0;
> }
>
> I have a strong feeling that there is an issue in this kind of situation.
> I'll be more than happy to run further tests if someone asks me to.
>
> Thank you
> Durga
>
> The surgeon general advises you to eat right, exercise regularly and quit
> ageing.
>

Reply via email to