Hello Gilles and all
I am sorry to be bugging the developers, but this issue seems to be nagging
me, and I am surprised it does not seem to affect anybody else. But then
again, I am using the master branch, and most users are probably using a
released version.
This time I am using a totally different cluster. This has NO verbs capable
interface; just 2 Ethernet (1 of which has no IP address and hence is
unusable) plus 1 proprietary interface that currently supports only IP
traffic. The two IP interfaces (Ethernet and proprietary) are on different
IP subnets.
My test program is as follows:
#include <stdio.h>
#include <string.h>
#include "mpi.h"
int main(int argc, char *argv[])
{
char host[128];
int n;
MPI_Init(&argc, &argv);
MPI_Get_processor_name(host, &n);
printf("Hello from %s\n", host);
MPI_Comm_size(MPI_COMM_WORLD, &n);
printf("The world has %d nodes\n", n);
MPI_Comm_rank(MPI_COMM_WORLD, &n);
printf("My rank is %d\n",n);
//#if 0
if (n == 0)
{
strcpy(host, "ha!");
MPI_Send(host, strlen(host) + 1, MPI_CHAR, 1, 1, MPI_COMM_WORLD);
printf("sent %s\n", host);
}
else
{
//int len = strlen(host) + 1;
bzero(host, 128);
MPI_Recv(host, 4, MPI_CHAR, 0, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Received %s from rank 0\n", host);
}
//#endif
MPI_Finalize();
return 0;
}
This program, when run between two nodes, hangs. The command was:
[durga@b-1 ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp -mca pml
ob1 -mca btl_tcp_if_include eno1 ./mpitest
And the hang is with the following output: (eno1 is one of the gigEth
interfaces, that takes OOB traffic as well)
Hello from b-1
The world has 2 nodes
My rank is 0
Hello from b-2
The world has 2 nodes
My rank is 1
Note that if I uncomment the #if 0 - #endif (i.e. comment out the
MPI_Send()/MPI_Recv() part, the program runs to completion. Also note that
the printfs following MPI_Send()/MPI_Recv() do not show up on console.
Upon attaching gdb, the stack trace from the master node is as follows:
Missing separate debuginfos, use: debuginfo-install
glibc-2.17-78.el7.x86_64 libpciaccess-0.13.4-2.el7.x86_64
(gdb) bt
#0 0x00007f72a533eb7d in poll () from /lib64/libc.so.6
#1 0x00007f72a4cb7146 in poll_dispatch (base=0xee33d0, tv=0x7fff81057b70)
at poll.c:165
#2 0x00007f72a4caede0 in opal_libevent2022_event_base_loop (base=0xee33d0,
flags=2) at event.c:1630
#3 0x00007f72a4c4e692 in opal_progress () at runtime/opal_progress.c:171
#4 0x00007f72a0d07ac1 in opal_condition_wait (
c=0x7f72a5bb1e00 <ompi_request_cond>, m=0x7f72a5bb1d80
<ompi_request_lock>)
at ../../../../opal/threads/condition.h:76
#5 0x00007f72a0d07ca2 in ompi_request_wait_completion (req=0x113eb80)
at ../../../../ompi/request/request.h:383
#6 0x00007f72a0d09cd5 in mca_pml_ob1_send (buf=0x7fff81057db0, count=4,
datatype=0x601080 <ompi_mpi_char>, dst=1, tag=1,
sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x601280
<ompi_mpi_comm_world>)
at pml_ob1_isend.c:251
#7 0x00007f72a58d6be3 in PMPI_Send (buf=0x7fff81057db0, count=4,
type=0x601080 <ompi_mpi_char>, dest=1, tag=1,
comm=0x601280 <ompi_mpi_comm_world>) at psend.c:78
#8 0x0000000000400afa in main (argc=1, argv=0x7fff81057f18) at mpitest.c:19
(gdb)
And the backtrace on the non-master node is:
(gdb) bt
#0 0x00007ff3b377e48d in nanosleep () from /lib64/libc.so.6
#1 0x00007ff3b37af014 in usleep () from /lib64/libc.so.6
#2 0x00007ff3b0c922de in OPAL_PMIX_PMIX120_PMIx_Fence (procs=0x0, nprocs=0,
info=0x0, ninfo=0) at src/client/pmix_client_fence.c:100
#3 0x00007ff3b0c5f1a6 in pmix120_fence (procs=0x0, collect_data=0)
at pmix120_client.c:258
#4 0x00007ff3b3cf8f4b in ompi_mpi_finalize ()
at runtime/ompi_mpi_finalize.c:242
#5 0x00007ff3b3d23295 in PMPI_Finalize () at pfinalize.c:47
#6 0x0000000000400958 in main (argc=1, argv=0x7fff785e8788) at mpitest.c:30
(gdb)
The hostfile is as follows:
[durga@b-1 ~]$ cat hostfile
10.4.70.10 slots=1
10.4.70.11 slots=1
#10.4.70.12 slots=1
And the ifconfig output from the master node is as follows (the other node
is similar; all the IP interfaces are in their respective subnets) :
[durga@b-1 ~]$ ifconfig
eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.4.70.10 netmask 255.255.255.0 broadcast 10.4.70.255
inet6 fe80::21e:c9ff:fefe:13df prefixlen 64 scopeid 0x20<link>
ether 00:1e:c9:fe:13:df txqueuelen 1000 (Ethernet)
RX packets 48215 bytes 27842846 (26.5 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 52746 bytes 7817568 (7.4 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 16
eno2: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
ether 00:1e:c9:fe:13:e0 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 17
lf0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2016
inet 192.168.1.2 netmask 255.255.255.0 broadcast 192.168.1.255
inet6 fe80::3002:ff:fe33:3333 prefixlen 64 scopeid 0x20<link>
ether 32:02:00:33:33:33 txqueuelen 1000 (Ethernet)
RX packets 10 bytes 512 (512.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 22 bytes 1536 (1.5 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 0 (Local Loopback)
RX packets 26 bytes 1378 (1.3 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 26 bytes 1378 (1.3 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Please help me with this. I am stuck with the TCP transport, which is the
most basic of all transports.
Thanks in advance
Durga
1% of the executables have 99% of CPU privilege!
Userspace code! Unite!! Occupy the kernel!!!
On Tue, Apr 12, 2016 at 9:32 PM, Gilles Gouaillardet <[email protected]>
wrote:
> This is quite unlikely, and fwiw, your test program works for me.
>
> i suggest you check your 3 TCP networks are usable, for example
>
> $ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp -mca pml ob1 --mca
> btl_tcp_if_include xxx ./mpitest
>
> in which xxx is a [list of] interface name :
> eth0
> eth1
> ib0
> eth0,eth1
> eth0,ib0
> ...
> eth0,eth1,ib0
>
> and see where problem start occuring.
>
> btw, are your 3 interfaces in 3 different subnet ? is routing required
> between two interfaces of the same type ?
>
> Cheers,
>
> Gilles
>
> On 4/13/2016 7:15 AM, dpchoudh . wrote:
>
> Hi all
>
> I have reported this issue before, but then had brushed it off as
> something that was caused by my modifications to the source tree. It looks
> like that is not the case.
>
> Just now, I did the following:
>
> 1. Cloned a fresh copy from master.
> 2. Configured with the following flags, built and installed it in my
> two-node "cluster".
> --enable-debug --enable-debug-symbols --disable-dlopen
> 3. Compiled the following program, mpitest.c with these flags: -g3 -Wall
> -Wextra
> 4. Ran it like this:
> [durga@smallMPI ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp
> -mca pml ob1 ./mpitest
>
> With this, the code hangs at MPI_Barrier() on both nodes, after generating
> the following output:
>
> Hello world from processor smallMPI, rank 0 out of 2 processors
> Hello world from processor bigMPI, rank 1 out of 2 processors
> smallMPI sent haha!
> bigMPI received haha!
> <Hangs until killed by ^C>
> Attaching to the hung process at one node gives the following backtrace:
>
> (gdb) bt
> #0 0x00007f55b0f41c3d in poll () from /lib64/libc.so.6
> #1 0x00007f55b03ccde6 in poll_dispatch (base=0x70e7b0, tv=0x7ffd1bb551c0)
> at poll.c:165
> #2 0x00007f55b03c4a90 in opal_libevent2022_event_base_loop
> (base=0x70e7b0, flags=2) at event.c:1630
> #3 0x00007f55b02f0144 in opal_progress () at runtime/opal_progress.c:171
> #4 0x00007f55b14b4d8b in opal_condition_wait (c=0x7f55b19fec40
> <ompi_request_cond>, m=0x7f55b19febc0 <ompi_request_lock>) at
> ../opal/threads/condition.h:76
> #5 0x00007f55b14b531b in ompi_request_default_wait_all (count=2,
> requests=0x7ffd1bb55370, statuses=0x7ffd1bb55340) at request/req_wait.c:287
> #6 0x00007f55b157a225 in ompi_coll_base_sendrecv_zero (dest=1, stag=-16,
> source=1, rtag=-16, comm=0x601280 <ompi_mpi_comm_world>)
> at base/coll_base_barrier.c:63
> #7 0x00007f55b157a92a in ompi_coll_base_barrier_intra_two_procs
> (comm=0x601280 <ompi_mpi_comm_world>, module=0x7c2630) at
> base/coll_base_barrier.c:308
> #8 0x00007f55b15aafec in ompi_coll_tuned_barrier_intra_dec_fixed
> (comm=0x601280 <ompi_mpi_comm_world>, module=0x7c2630) at
> coll_tuned_decision_fixed.c:196
> #9 0x00007f55b14d36fd in PMPI_Barrier (comm=0x601280
> <ompi_mpi_comm_world>) at pbarrier.c:63
> #10 0x0000000000400b0b in main (argc=1, argv=0x7ffd1bb55658) at
> mpitest.c:26
> (gdb)
>
> Thinking that this might be a bug in tuned collectives, since that is what
> the stack shows, I ran the program like this (basically adding the ^tuned
> part)
>
> [durga@smallMPI ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp
> -mca pml ob1 -mca coll ^tuned ./mpitest
>
> It still hangs, but now with a different stack trace:
> (gdb) bt
> #0 0x00007f910d38ac3d in poll () from /lib64/libc.so.6
> #1 0x00007f910c815de6 in poll_dispatch (base=0x1a317b0,
> tv=0x7fff43ee3610) at poll.c:165
> #2 0x00007f910c80da90 in opal_libevent2022_event_base_loop
> (base=0x1a317b0, flags=2) at event.c:1630
> #3 0x00007f910c739144 in opal_progress () at runtime/opal_progress.c:171
> #4 0x00007f910db130f7 in opal_condition_wait (c=0x7f910de47c40
> <ompi_request_cond>, m=0x7f910de47bc0 <ompi_request_lock>)
> at ../../../../opal/threads/condition.h:76
> #5 0x00007f910db132d8 in ompi_request_wait_completion (req=0x1b07680) at
> ../../../../ompi/request/request.h:383
> #6 0x00007f910db1533b in mca_pml_ob1_send (buf=0x0, count=0,
> datatype=0x7f910de1e340 <ompi_mpi_byte>, dst=1, tag=-16,
> sendmode=MCA_PML_BASE_SEND_STANDARD,
> comm=0x601280 <ompi_mpi_comm_world>) at pml_ob1_isend.c:259
> #7 0x00007f910d9c3b38 in ompi_coll_base_barrier_intra_basic_linear
> (comm=0x601280 <ompi_mpi_comm_world>, module=0x1b092c0) at
> base/coll_base_barrier.c:368
> #8 0x00007f910d91c6fd in PMPI_Barrier (comm=0x601280
> <ompi_mpi_comm_world>) at pbarrier.c:63
> #9 0x0000000000400b0b in main (argc=1, argv=0x7fff43ee3a58) at
> mpitest.c:26
> (gdb)
>
> The mpitest.c program is as follows:
> #include <mpi.h>
> #include <stdio.h>
> #include <string.h>
>
> int main(int argc, char** argv)
> {
> int world_size, world_rank, name_len;
> char hostname[MPI_MAX_PROCESSOR_NAME], buf[8];
>
> MPI_Init(&argc, &argv);
> MPI_Comm_size(MPI_COMM_WORLD, &world_size);
> MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
> MPI_Get_processor_name(hostname, &name_len);
> printf("Hello world from processor %s, rank %d out of %d
> processors\n", hostname, world_rank, world_size);
> if (world_rank == 1)
> {
> MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
> printf("%s received %s\n", hostname, buf);
> }
> else
> {
> strcpy(buf, "haha!");
> MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD);
> printf("%s sent %s\n", hostname, buf);
> }
> MPI_Barrier(MPI_COMM_WORLD);
> MPI_Finalize();
> return 0;
> }
>
> The hostfile is as follows:
> 10.10.10.10 slots=1
> 10.10.10.11 slots=1
>
> The two nodes are connected by three physical and 3 logical networks:
> Physical: Gigabit Ethernet, 10G iWARP, 20G Infiniband
> Logical: IP (all 3), PSM (Qlogic Infiniband), Verbs (iWARP and Infiniband)
>
> Please note again that this is a fresh, brand new clone.
>
> Is this a bug (perhaps a side effect of --disable-dlopen) or something I
> am doing wrong?
>
> Thanks
> Durga
>
> We learn from history that we never learn from history.
>
>
> _______________________________________________
> users mailing [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/04/28930.php
>
>
>
> _______________________________________________
> users mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/04/28932.php
>