at first I recommend you test 7 cases - one network only (3 cases) - two networks ony (3 cases) - three networks (1 case)
and see when things hang you might also want to mpirun --mca oob_tcp_if_include 10.1.10.0/24 ... to ensure no hang will happen in oob as usual, double check no firewall is running, and your hosts can ping each other Cheers, Gilles On Saturday, May 14, 2016, dpchoudh . <dpcho...@gmail.com> wrote: > Dear developers > > I have been observing this issue all along on the master branch, but have > been brushing off as something to do with my installation. > > Right now, I just downloaded a fresh checkout (via git pull), built and > installed it (after deleting /usr/local/lib/openmpi/) and I can reproduce > the hang 100% of the time. > > Description of the setup: > > 1. Two x86_64 boxes (dual xeons, 6 core each) > 2. Four network interfaces, 3 running IP: > Broadcom GbE (IP 10.01.10.X/24) BW 1 Gbps > Chelsio iWARP (IP 10.10.10.X/24) BW 10 Gbps > Qlogic Infiniband (IP 10.01.11.X/24) BW 20Gbps > LSI logic Fibre channel (not running IP, I don't think this matters) > > All of the NICs have their link UP. All the NICs are in separate IP > subnets, connected back to back. > > With this, the following command hangs: > (The hostfile is this: > 10.10.10.10 slots=1 > 10.10.10.11 slots=1 > > [durga@smallMPI ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp > -mca pml ob1 ./mpitest > > with the following output: > > Hello world from processor smallMPI, rank 0 out of 2 processors > Hello world from processor bigMPI, rank 1 out of 2 processors > smallMPI sent haha!, rank 0 > bigMPI received haha!, rank 1 > > The stack trace at rank 0 is: > > (gdb) bt > #0 0x00007f9cb844769d in poll () from /lib64/libc.so.6 > #1 0x00007f9cb79354d6 in poll_dispatch (base=0xddb540, tv=0x7ffc065d01b0) > at poll.c:165 > #2 0x00007f9cb792d180 in opal_libevent2022_event_base_loop > (base=0xddb540, flags=2) at event.c:1630 > #3 0x00007f9cb7851e74 in opal_progress () at runtime/opal_progress.c:171 > #4 0x00007f9cb89bc47d in opal_condition_wait (c=0x7f9cb8f37c40 > <ompi_request_cond>, m=0x7f9cb8f37bc0 <ompi_request_lock>) at > ../opal/threads/condition.h:76 > #5 0x00007f9cb89bcadf in ompi_request_default_wait_all (count=2, > requests=0x7ffc065d0360, statuses=0x7ffc065d0330) at request/req_wait.c:287 > #6 0x00007f9cb8a95469 in ompi_coll_base_sendrecv_zero (dest=1, stag=-16, > source=1, rtag=-16, comm=0x601280 <ompi_mpi_comm_world>) > at base/coll_base_barrier.c:63 > #7 0x00007f9cb8a95b86 in ompi_coll_base_barrier_intra_two_procs > (comm=0x601280 <ompi_mpi_comm_world>, module=0xeb4a00) at > base/coll_base_barrier.c:313 > #8 0x00007f9cb8ac6d1c in ompi_coll_tuned_barrier_intra_dec_fixed > (comm=0x601280 <ompi_mpi_comm_world>, module=0xeb4a00) at > coll_tuned_decision_fixed.c:196 > #9 0x00007f9cb89dc689 in PMPI_Barrier (comm=0x601280 > <ompi_mpi_comm_world>) at pbarrier.c:63 > #10 0x0000000000400b11 in main (argc=1, argv=0x7ffc065d0648) at > mpitest.c:27 > > and at rank 1 is: > > (gdb) bt > #0 0x00007f1101e7d69d in poll () from /lib64/libc.so.6 > #1 0x00007f110136b4d6 in poll_dispatch (base=0x1d54540, > tv=0x7ffd73013710) at poll.c:165 > #2 0x00007f1101363180 in opal_libevent2022_event_base_loop > (base=0x1d54540, flags=2) at event.c:1630 > #3 0x00007f1101287e74 in opal_progress () at runtime/opal_progress.c:171 > #4 0x00007f11023f247d in opal_condition_wait (c=0x7f110296dc40 > <ompi_request_cond>, m=0x7f110296dbc0 <ompi_request_lock>) at > ../opal/threads/condition.h:76 > #5 0x00007f11023f2adf in ompi_request_default_wait_all (count=2, > requests=0x7ffd730138c0, statuses=0x7ffd73013890) at request/req_wait.c:287 > #6 0x00007f11024cb469 in ompi_coll_base_sendrecv_zero (dest=0, stag=-16, > source=0, rtag=-16, comm=0x601280 <ompi_mpi_comm_world>) > at base/coll_base_barrier.c:63 > #7 0x00007f11024cbb86 in ompi_coll_base_barrier_intra_two_procs > (comm=0x601280 <ompi_mpi_comm_world>, module=0x1e2ebc0) at > base/coll_base_barrier.c:313 > #8 0x00007f11024cde3c in ompi_coll_tuned_barrier_intra_dec_fixed > (comm=0x601280 <ompi_mpi_comm_world>, module=0x1e2ebc0) at > coll_tuned_decision_fixed.c:196 > #9 0x00007f1102412689 in PMPI_Barrier (comm=0x601280 > <ompi_mpi_comm_world>) at pbarrier.c:63 > #10 0x0000000000400b11 in main (argc=1, argv=0x7ffd73013ba8) at > mpitest.c:27 > > The code for the test program is: > > #include <mpi.h> > #include <stdio.h> > #include <string.h> > #include <stdlib.h> > > int main(int argc, char *argv[]) > { > int world_size, world_rank, name_len; > char hostname[MPI_MAX_PROCESSOR_NAME], buf[8]; > > MPI_Init(&argc, &argv); > MPI_Comm_size(MPI_COMM_WORLD, &world_size); > MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); > MPI_Get_processor_name(hostname, &name_len); > printf("Hello world from processor %s, rank %d out of %d > processors\n", hostname, world_rank, world_size); > if (world_rank == 1) > { > MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD, MPI_STATUS_IGNORE); > printf("%s received %s, rank %d\n", hostname, buf, world_rank); > } > else > { > strcpy(buf, "haha!"); > MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD); > printf("%s sent %s, rank %d\n", hostname, buf, world_rank); > } > MPI_Barrier(MPI_COMM_WORLD); > MPI_Finalize(); > return 0; > } > > I have a strong feeling that there is an issue in this kind of situation. > I'll be more than happy to run further tests if someone asks me to. > > Thank you > Durga > > The surgeon general advises you to eat right, exercise regularly and quit > ageing. >