iirc, ompi internally uses networks and not interface names. what did you use in your tests ? can you try with networks ?
Cheers, Gilles On Saturday, May 14, 2016, dpchoudh . <dpcho...@gmail.com> wrote: > Hello Gilles > > Thanks for your prompt follow up. It looks this this issue is somehow > specific to the Broadcom NIC. If I take it out, the rest of them work in > any combination. On further investigation, I found that the name that > 'ifconfig' shows for this intterface is different from what it is named in > internal scripts. Could be a bug in CentOS, but at least does not look like > an OpenMPI issue. > > Sorry for raising the false alarm. > > Durga > > The surgeon general advises you to eat right, exercise regularly and quit > ageing. > > On Sat, May 14, 2016 at 12:02 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com > <javascript:_e(%7B%7D,'cvml','gilles.gouaillar...@gmail.com');>> wrote: > >> at first I recommend you test 7 cases >> - one network only (3 cases) >> - two networks ony (3 cases) >> - three networks (1 case) >> >> and see when things hang >> >> you might also want to >> mpirun --mca oob_tcp_if_include 10.1.10.0/24 ... >> to ensure no hang will happen in oob >> >> as usual, double check no firewall is running, and your hosts can ping >> each other >> >> Cheers, >> >> Gilles >> >> On Saturday, May 14, 2016, dpchoudh . <dpcho...@gmail.com >> <javascript:_e(%7B%7D,'cvml','dpcho...@gmail.com');>> wrote: >> >>> Dear developers >>> >>> I have been observing this issue all along on the master branch, but >>> have been brushing off as something to do with my installation. >>> >>> Right now, I just downloaded a fresh checkout (via git pull), built and >>> installed it (after deleting /usr/local/lib/openmpi/) and I can reproduce >>> the hang 100% of the time. >>> >>> Description of the setup: >>> >>> 1. Two x86_64 boxes (dual xeons, 6 core each) >>> 2. Four network interfaces, 3 running IP: >>> Broadcom GbE (IP 10.01.10.X/24) BW 1 Gbps >>> Chelsio iWARP (IP 10.10.10.X/24) BW 10 Gbps >>> Qlogic Infiniband (IP 10.01.11.X/24) BW 20Gbps >>> LSI logic Fibre channel (not running IP, I don't think this matters) >>> >>> All of the NICs have their link UP. All the NICs are in separate IP >>> subnets, connected back to back. >>> >>> With this, the following command hangs: >>> (The hostfile is this: >>> 10.10.10.10 slots=1 >>> 10.10.10.11 slots=1 >>> >>> [durga@smallMPI ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp >>> -mca pml ob1 ./mpitest >>> >>> with the following output: >>> >>> Hello world from processor smallMPI, rank 0 out of 2 processors >>> Hello world from processor bigMPI, rank 1 out of 2 processors >>> smallMPI sent haha!, rank 0 >>> bigMPI received haha!, rank 1 >>> >>> The stack trace at rank 0 is: >>> >>> (gdb) bt >>> #0 0x00007f9cb844769d in poll () from /lib64/libc.so.6 >>> #1 0x00007f9cb79354d6 in poll_dispatch (base=0xddb540, >>> tv=0x7ffc065d01b0) at poll.c:165 >>> #2 0x00007f9cb792d180 in opal_libevent2022_event_base_loop >>> (base=0xddb540, flags=2) at event.c:1630 >>> #3 0x00007f9cb7851e74 in opal_progress () at runtime/opal_progress.c:171 >>> #4 0x00007f9cb89bc47d in opal_condition_wait (c=0x7f9cb8f37c40 >>> <ompi_request_cond>, m=0x7f9cb8f37bc0 <ompi_request_lock>) at >>> ../opal/threads/condition.h:76 >>> #5 0x00007f9cb89bcadf in ompi_request_default_wait_all (count=2, >>> requests=0x7ffc065d0360, statuses=0x7ffc065d0330) at request/req_wait.c:287 >>> #6 0x00007f9cb8a95469 in ompi_coll_base_sendrecv_zero (dest=1, >>> stag=-16, source=1, rtag=-16, comm=0x601280 <ompi_mpi_comm_world>) >>> at base/coll_base_barrier.c:63 >>> #7 0x00007f9cb8a95b86 in ompi_coll_base_barrier_intra_two_procs >>> (comm=0x601280 <ompi_mpi_comm_world>, module=0xeb4a00) at >>> base/coll_base_barrier.c:313 >>> #8 0x00007f9cb8ac6d1c in ompi_coll_tuned_barrier_intra_dec_fixed >>> (comm=0x601280 <ompi_mpi_comm_world>, module=0xeb4a00) at >>> coll_tuned_decision_fixed.c:196 >>> #9 0x00007f9cb89dc689 in PMPI_Barrier (comm=0x601280 >>> <ompi_mpi_comm_world>) at pbarrier.c:63 >>> #10 0x0000000000400b11 in main (argc=1, argv=0x7ffc065d0648) at >>> mpitest.c:27 >>> >>> and at rank 1 is: >>> >>> (gdb) bt >>> #0 0x00007f1101e7d69d in poll () from /lib64/libc.so.6 >>> #1 0x00007f110136b4d6 in poll_dispatch (base=0x1d54540, >>> tv=0x7ffd73013710) at poll.c:165 >>> #2 0x00007f1101363180 in opal_libevent2022_event_base_loop >>> (base=0x1d54540, flags=2) at event.c:1630 >>> #3 0x00007f1101287e74 in opal_progress () at runtime/opal_progress.c:171 >>> #4 0x00007f11023f247d in opal_condition_wait (c=0x7f110296dc40 >>> <ompi_request_cond>, m=0x7f110296dbc0 <ompi_request_lock>) at >>> ../opal/threads/condition.h:76 >>> #5 0x00007f11023f2adf in ompi_request_default_wait_all (count=2, >>> requests=0x7ffd730138c0, statuses=0x7ffd73013890) at request/req_wait.c:287 >>> #6 0x00007f11024cb469 in ompi_coll_base_sendrecv_zero (dest=0, >>> stag=-16, source=0, rtag=-16, comm=0x601280 <ompi_mpi_comm_world>) >>> at base/coll_base_barrier.c:63 >>> #7 0x00007f11024cbb86 in ompi_coll_base_barrier_intra_two_procs >>> (comm=0x601280 <ompi_mpi_comm_world>, module=0x1e2ebc0) at >>> base/coll_base_barrier.c:313 >>> #8 0x00007f11024cde3c in ompi_coll_tuned_barrier_intra_dec_fixed >>> (comm=0x601280 <ompi_mpi_comm_world>, module=0x1e2ebc0) at >>> coll_tuned_decision_fixed.c:196 >>> #9 0x00007f1102412689 in PMPI_Barrier (comm=0x601280 >>> <ompi_mpi_comm_world>) at pbarrier.c:63 >>> #10 0x0000000000400b11 in main (argc=1, argv=0x7ffd73013ba8) at >>> mpitest.c:27 >>> >>> The code for the test program is: >>> >>> #include <mpi.h> >>> #include <stdio.h> >>> #include <string.h> >>> #include <stdlib.h> >>> >>> int main(int argc, char *argv[]) >>> { >>> int world_size, world_rank, name_len; >>> char hostname[MPI_MAX_PROCESSOR_NAME], buf[8]; >>> >>> MPI_Init(&argc, &argv); >>> MPI_Comm_size(MPI_COMM_WORLD, &world_size); >>> MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); >>> MPI_Get_processor_name(hostname, &name_len); >>> printf("Hello world from processor %s, rank %d out of %d >>> processors\n", hostname, world_rank, world_size); >>> if (world_rank == 1) >>> { >>> MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD, MPI_STATUS_IGNORE); >>> printf("%s received %s, rank %d\n", hostname, buf, world_rank); >>> } >>> else >>> { >>> strcpy(buf, "haha!"); >>> MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD); >>> printf("%s sent %s, rank %d\n", hostname, buf, world_rank); >>> } >>> MPI_Barrier(MPI_COMM_WORLD); >>> MPI_Finalize(); >>> return 0; >>> } >>> >>> I have a strong feeling that there is an issue in this kind of >>> situation. I'll be more than happy to run further tests if someone asks me >>> to. >>> >>> Thank you >>> Durga >>> >>> The surgeon general advises you to eat right, exercise regularly and >>> quit ageing. >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/05/29196.php >> > >