Dear developers

I have been observing this issue all along on the master branch, but have
been brushing off as something to do with my installation.

Right now, I just downloaded a fresh checkout (via git pull), built and
installed it (after deleting /usr/local/lib/openmpi/) and I can reproduce
the hang 100% of the time.

Description of the setup:

1. Two x86_64 boxes (dual xeons, 6 core each)
2. Four network interfaces, 3 running IP:
    Broadcom GbE (IP 10.01.10.X/24) BW 1 Gbps
    Chelsio iWARP (IP 10.10.10.X/24) BW 10 Gbps
    Qlogic Infiniband (IP 10.01.11.X/24) BW 20Gbps
    LSI logic Fibre channel (not running IP, I don't think this matters)

All of the NICs have their link UP. All the NICs are in separate IP
subnets, connected back to back.

With this, the following command hangs:
(The hostfile is this:
10.10.10.10 slots=1
10.10.10.11 slots=1

[durga@smallMPI ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp
-mca pml ob1 ./mpitest

with the following output:

Hello world from processor smallMPI, rank 0 out of 2 processors
Hello world from processor bigMPI, rank 1 out of 2 processors
smallMPI sent haha!, rank 0
bigMPI received haha!, rank 1

The stack trace at rank 0 is:

(gdb) bt
#0  0x00007f9cb844769d in poll () from /lib64/libc.so.6
#1  0x00007f9cb79354d6 in poll_dispatch (base=0xddb540, tv=0x7ffc065d01b0)
at poll.c:165
#2  0x00007f9cb792d180 in opal_libevent2022_event_base_loop (base=0xddb540,
flags=2) at event.c:1630
#3  0x00007f9cb7851e74 in opal_progress () at runtime/opal_progress.c:171
#4  0x00007f9cb89bc47d in opal_condition_wait (c=0x7f9cb8f37c40
<ompi_request_cond>, m=0x7f9cb8f37bc0 <ompi_request_lock>) at
../opal/threads/condition.h:76
#5  0x00007f9cb89bcadf in ompi_request_default_wait_all (count=2,
requests=0x7ffc065d0360, statuses=0x7ffc065d0330) at request/req_wait.c:287
#6  0x00007f9cb8a95469 in ompi_coll_base_sendrecv_zero (dest=1, stag=-16,
source=1, rtag=-16, comm=0x601280 <ompi_mpi_comm_world>)
    at base/coll_base_barrier.c:63
#7  0x00007f9cb8a95b86 in ompi_coll_base_barrier_intra_two_procs
(comm=0x601280 <ompi_mpi_comm_world>, module=0xeb4a00) at
base/coll_base_barrier.c:313
#8  0x00007f9cb8ac6d1c in ompi_coll_tuned_barrier_intra_dec_fixed
(comm=0x601280 <ompi_mpi_comm_world>, module=0xeb4a00) at
coll_tuned_decision_fixed.c:196
#9  0x00007f9cb89dc689 in PMPI_Barrier (comm=0x601280
<ompi_mpi_comm_world>) at pbarrier.c:63
#10 0x0000000000400b11 in main (argc=1, argv=0x7ffc065d0648) at mpitest.c:27

and at rank 1 is:

(gdb) bt
#0  0x00007f1101e7d69d in poll () from /lib64/libc.so.6
#1  0x00007f110136b4d6 in poll_dispatch (base=0x1d54540, tv=0x7ffd73013710)
at poll.c:165
#2  0x00007f1101363180 in opal_libevent2022_event_base_loop
(base=0x1d54540, flags=2) at event.c:1630
#3  0x00007f1101287e74 in opal_progress () at runtime/opal_progress.c:171
#4  0x00007f11023f247d in opal_condition_wait (c=0x7f110296dc40
<ompi_request_cond>, m=0x7f110296dbc0 <ompi_request_lock>) at
../opal/threads/condition.h:76
#5  0x00007f11023f2adf in ompi_request_default_wait_all (count=2,
requests=0x7ffd730138c0, statuses=0x7ffd73013890) at request/req_wait.c:287
#6  0x00007f11024cb469 in ompi_coll_base_sendrecv_zero (dest=0, stag=-16,
source=0, rtag=-16, comm=0x601280 <ompi_mpi_comm_world>)
    at base/coll_base_barrier.c:63
#7  0x00007f11024cbb86 in ompi_coll_base_barrier_intra_two_procs
(comm=0x601280 <ompi_mpi_comm_world>, module=0x1e2ebc0) at
base/coll_base_barrier.c:313
#8  0x00007f11024cde3c in ompi_coll_tuned_barrier_intra_dec_fixed
(comm=0x601280 <ompi_mpi_comm_world>, module=0x1e2ebc0) at
coll_tuned_decision_fixed.c:196
#9  0x00007f1102412689 in PMPI_Barrier (comm=0x601280
<ompi_mpi_comm_world>) at pbarrier.c:63
#10 0x0000000000400b11 in main (argc=1, argv=0x7ffd73013ba8) at mpitest.c:27

The code for the test program is:

#include <mpi.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
    int world_size, world_rank, name_len;
    char hostname[MPI_MAX_PROCESSOR_NAME], buf[8];

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
    MPI_Get_processor_name(hostname, &name_len);
    printf("Hello world from processor %s, rank %d out of %d processors\n",
hostname, world_rank, world_size);
    if (world_rank == 1)
    {
    MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    printf("%s received %s, rank %d\n", hostname, buf, world_rank);
    }
    else
    {
    strcpy(buf, "haha!");
    MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD);
    printf("%s sent %s, rank %d\n", hostname, buf, world_rank);
    }
    MPI_Barrier(MPI_COMM_WORLD);
    MPI_Finalize();
    return 0;
}

I have a strong feeling that there is an issue in this kind of situation.
I'll be more than happy to run further tests if someone asks me to.

Thank you
Durga

The surgeon general advises you to eat right, exercise regularly and quit
ageing.

Reply via email to