Dear developers Thank you all for jumping in to help. Here is what I have found so far:
1. Running Netpipe (NPmpi) between the two nodes (in either order) was successful, but following this test, my original code still hung. 2. Following Gilles's advice, I then added an MPI_Barrier() at the end of the code, just before MPI_Finalize(), and, to my surprise, the code ran to completion! 3. Then, I took out the barrier, leaving the code the way it was before, and it still ran to completion! 4. I tried several variations of call sequence, and all of them ran successfully. I can't explain why the runtime behavior seems to depend on the phase of the moon, but, although I cannot prove it, I have a gut feeling there is a bug somewhere in the development branch. I never run into this issue when running the release branch. (I sometimes work as MPI application developer, when I use the release branch, and sometime as MPI developer, when I use the master branch). Thank you all, again. Durga 1% of the executables have 99% of CPU privilege! Userspace code! Unite!! Occupy the kernel!!! On Mon, Apr 18, 2016 at 8:04 AM, George Bosilca <bosi...@icl.utk.edu> wrote: > Durga, > > Can you run a simple netpipe over TCP using any of the two interfaces you > mentioned? > > George > On Apr 18, 2016 11:08 AM, "Gilles Gouaillardet" < > gilles.gouaillar...@gmail.com> wrote: > >> An other test is to swap the hostnames. >> If the single barrier test fails, this can hint to a firewall. >> >> Cheers, >> >> Gilles >> >> Gilles Gouaillardet <gil...@rist.or.jp> wrote: >> sudo make uninstall >> will not remove modules that are no more built >> sudo rm -rf /usr/local/lib/openmpi >> is safe thought >> >> i confirm i did not see any issue on a system with two networks >> >> Cheers, >> >> Gilles >> >> On 4/18/2016 2:53 PM, dpchoudh . wrote: >> >> Hello Gilles >> >> I did a >> sudo make uninstall >> followed by a >> sudo make install >> on both nodes. But that did not make a difference. I will try your >> tarball build suggestion a bit later. >> >> What I find a bit strange is that only I seem to be getting into this >> issue. What could I be doing wrong? Or am I discovering an obscure bug? >> >> Thanks >> Durga >> >> 1% of the executables have 99% of CPU privilege! >> Userspace code! Unite!! Occupy the kernel!!! >> >> On Mon, Apr 18, 2016 at 1:21 AM, Gilles Gouaillardet <gil...@rist.or.jp> >> wrote: >> >>> so you might want to >>> rm -rf /usr/local/lib/openmpi >>> and run >>> make install >>> again, just to make sure old stuff does not get in the way >>> >>> Cheers, >>> >>> Gilles >>> >>> >>> On 4/18/2016 2:12 PM, dpchoudh . wrote: >>> >>> Hello Gilles >>> >>> Thank you very much for your feedback. You are right that my original >>> stack trace was on code that was several weeks behind, but updating it just >>> now did not seem to make a difference: I am copying the stack from the >>> latest code below: >>> >>> On the master node: >>> >>> (gdb) bt >>> #0 0x00007fc0524cbb7d in poll () from /lib64/libc.so.6 >>> #1 0x00007fc051e53116 in poll_dispatch (base=0x1aabbe0, >>> tv=0x7fff29fcb240) at poll.c:165 >>> #2 0x00007fc051e4adb0 in opal_libevent2022_event_base_loop >>> (base=0x1aabbe0, flags=2) at event.c:1630 >>> #3 0x00007fc051de9a00 in opal_progress () at runtime/opal_progress.c:171 >>> #4 0x00007fc04ce46b0b in opal_condition_wait (c=0x7fc052d3cde0 >>> <ompi_request_cond>, >>> m=0x7fc052d3cd60 <ompi_request_lock>) at >>> ../../../../opal/threads/condition.h:76 >>> #5 0x00007fc04ce46cec in ompi_request_wait_completion (req=0x1b7b580) >>> at ../../../../ompi/request/request.h:383 >>> #6 0x00007fc04ce48d4f in mca_pml_ob1_send (buf=0x7fff29fcb480, count=4, >>> datatype=0x601080 <ompi_mpi_char>, dst=1, tag=1, >>> sendmode=MCA_PML_BASE_SEND_STANDARD, >>> comm=0x601280 <ompi_mpi_comm_world>) at pml_ob1_isend.c:259 >>> #7 0x00007fc052a62d73 in PMPI_Send (buf=0x7fff29fcb480, count=4, >>> type=0x601080 <ompi_mpi_char>, dest=1, >>> tag=1, comm=0x601280 <ompi_mpi_comm_world>) at psend.c:78 >>> #8 0x0000000000400afa in main (argc=1, argv=0x7fff29fcb5e8) at >>> mpitest.c:19 >>> (gdb) >>> >>> And on the non-master node >>> >>> (gdb) bt >>> #0 0x00007fad2c32148d in nanosleep () from /lib64/libc.so.6 >>> #1 0x00007fad2c352014 in usleep () from /lib64/libc.so.6 >>> #2 0x00007fad296412de in OPAL_PMIX_PMIX120_PMIx_Fence (procs=0x0, >>> nprocs=0, info=0x0, ninfo=0) >>> at src/client/pmix_client_fence.c:100 >>> #3 0x00007fad2960e1a6 in pmix120_fence (procs=0x0, collect_data=0) at >>> pmix120_client.c:258 >>> #4 0x00007fad2c89b2da in ompi_mpi_finalize () at >>> runtime/ompi_mpi_finalize.c:242 >>> #5 0x00007fad2c8c5849 in PMPI_Finalize () at pfinalize.c:47 >>> #6 0x0000000000400958 in main (argc=1, argv=0x7fff163879c8) at >>> mpitest.c:30 >>> (gdb) >>> >>> And my configuration was done as follows: >>> >>> $ ./configure --enable-debug --enable-debug-symbols >>> >>> I double checked to ensure that there is not an older installation of >>> OpenMPI that is getting mixed up with the master branch. >>> sudo yum list installed | grep -i mpi >>> shows nothing on both nodes, and pmap -p <pid> shows that all the >>> libraries are coming from /usr/local/lib, which seems to be correct. I >>> am also quite sure about the firewall issue (that there is none). I will >>> try out your suggestion on installing from a tarball and see how it goes. >>> >>> Thanks >>> Durga >>> >>> 1% of the executables have 99% of CPU privilege! >>> Userspace code! Unite!! Occupy the kernel!!! >>> >>> On Mon, Apr 18, 2016 at 12:47 AM, Gilles Gouaillardet < >>> <gil...@rist.or.jp>gil...@rist.or.jp> wrote: >>> >>>> here is your stack trace >>>> >>>> #6 0x00007f72a0d09cd5 in mca_pml_ob1_send (buf=0x7fff81057db0, count=4, >>>> datatype=0x601080 <ompi_mpi_char>, dst=1, tag=1, >>>> sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x601280 >>>> <ompi_mpi_comm_world>) >>>> >>>> at line 251 >>>> >>>> >>>> that would be line 259 in current master, and this file was updated 21 >>>> days ago >>>> and that suggests your master is not quite up to date. >>>> >>>> even if the message is sent eagerly, the ob1 pml does use an internal >>>> request it will wait for. >>>> >>>> btw, did you configure with --enable-mpi-thread-multiple ? >>>> did you configure with --enable-mpirun-prefix-by-default ? >>>> did you configure with --disable-dlopen ? >>>> >>>> at first, i d recommend you download a tarball from >>>> https://www.open-mpi.org/nightly/master, >>>> configure && make && make install >>>> using a new install dir, and check if the issue is still here or not. >>>> >>>> there could be some side effects if some old modules were not removed >>>> and/or if you are >>>> not using the modules you expect. >>>> /* when it hangs, you can pmap <pid> and check the path of the openmpi >>>> libraries are the one you expect */ >>>> >>>> what if you do not send/recv but invoke MPI_Barrier multiple times ? >>>> what if you send/recv a one byte message instead ? >>>> did you double check there is no firewall running on your nodes ? >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 4/18/2016 1:06 PM, dpchoudh . wrote: >>>> >>>> Thank you for your suggestion, Ralph. But it did not make any >>>> difference. >>>> >>>> Let me say that my code is about a week stale. I just did a git pull >>>> and am building it right now. The build takes quite a bit of time, so I >>>> avoid doing that unless there is a reason. But what I am trying out is the >>>> most basic functionality, so I'd think a week or so of lag would not make a >>>> difference. >>>> >>>> Does the stack trace suggest something to you? It seems that the send >>>> hangs; but a 4 byte send should be sent eagerly. >>>> >>>> Best regards >>>> 'Durga >>>> >>>> 1% of the executables have 99% of CPU privilege! >>>> Userspace code! Unite!! Occupy the kernel!!! >>>> >>>> On Sun, Apr 17, 2016 at 11:55 PM, Ralph Castain < <r...@open-mpi.org> >>>> r...@open-mpi.org> wrote: >>>> >>>>> Try adding -mca oob_tcp_if_include eno1 to your cmd line and see if >>>>> that makes a difference >>>>> >>>>> On Apr 17, 2016, at 8:43 PM, dpchoudh . < <dpcho...@gmail.com> >>>>> dpcho...@gmail.com> wrote: >>>>> >>>>> Hello Gilles and all >>>>> >>>>> I am sorry to be bugging the developers, but this issue seems to be >>>>> nagging me, and I am surprised it does not seem to affect anybody else. >>>>> But >>>>> then again, I am using the master branch, and most users are probably >>>>> using >>>>> a released version. >>>>> >>>>> This time I am using a totally different cluster. This has NO verbs >>>>> capable interface; just 2 Ethernet (1 of which has no IP address and hence >>>>> is unusable) plus 1 proprietary interface that currently supports only IP >>>>> traffic. The two IP interfaces (Ethernet and proprietary) are on different >>>>> IP subnets. >>>>> >>>>> My test program is as follows: >>>>> >>>>> #include <stdio.h> >>>>> #include <string.h> >>>>> #include "mpi.h" >>>>> int main(int argc, char *argv[]) >>>>> { >>>>> char host[128]; >>>>> int n; >>>>> MPI_Init(&argc, &argv); >>>>> MPI_Get_processor_name(host, &n); >>>>> printf("Hello from %s\n", host); >>>>> MPI_Comm_size(MPI_COMM_WORLD, &n); >>>>> printf("The world has %d nodes\n", n); >>>>> MPI_Comm_rank(MPI_COMM_WORLD, &n); >>>>> printf("My rank is %d\n",n); >>>>> //#if 0 >>>>> if (n == 0) >>>>> { >>>>> strcpy(host, "ha!"); >>>>> MPI_Send(host, strlen(host) + 1, MPI_CHAR, 1, 1, MPI_COMM_WORLD); >>>>> printf("sent %s\n", host); >>>>> } >>>>> else >>>>> { >>>>> //int len = strlen(host) + 1; >>>>> bzero(host, 128); >>>>> MPI_Recv(host, 4, MPI_CHAR, 0, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE); >>>>> printf("Received %s from rank 0\n", host); >>>>> } >>>>> //#endif >>>>> MPI_Finalize(); >>>>> return 0; >>>>> } >>>>> >>>>> This program, when run between two nodes, hangs. The command was: >>>>> [durga@b-1 ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp >>>>> -mca pml ob1 -mca btl_tcp_if_include eno1 ./mpitest >>>>> >>>>> And the hang is with the following output: (eno1 is one of the gigEth >>>>> interfaces, that takes OOB traffic as well) >>>>> >>>>> Hello from b-1 >>>>> The world has 2 nodes >>>>> My rank is 0 >>>>> Hello from b-2 >>>>> The world has 2 nodes >>>>> My rank is 1 >>>>> >>>>> Note that if I uncomment the #if 0 - #endif (i.e. comment out the >>>>> MPI_Send()/MPI_Recv() part, the program runs to completion. Also note that >>>>> the printfs following MPI_Send()/MPI_Recv() do not show up on console. >>>>> >>>>> Upon attaching gdb, the stack trace from the master node is as follows: >>>>> >>>>> Missing separate debuginfos, use: debuginfo-install >>>>> glibc-2.17-78.el7.x86_64 libpciaccess-0.13.4-2.el7.x86_64 >>>>> (gdb) bt >>>>> #0 0x00007f72a533eb7d in poll () from /lib64/libc.so.6 >>>>> #1 0x00007f72a4cb7146 in poll_dispatch (base=0xee33d0, >>>>> tv=0x7fff81057b70) >>>>> at poll.c:165 >>>>> #2 0x00007f72a4caede0 in opal_libevent2022_event_base_loop >>>>> (base=0xee33d0, >>>>> flags=2) at event.c:1630 >>>>> #3 0x00007f72a4c4e692 in opal_progress () at >>>>> runtime/opal_progress.c:171 >>>>> #4 0x00007f72a0d07ac1 in opal_condition_wait ( >>>>> c=0x7f72a5bb1e00 <ompi_request_cond>, m=0x7f72a5bb1d80 >>>>> <ompi_request_lock>) >>>>> at ../../../../opal/threads/condition.h:76 >>>>> #5 0x00007f72a0d07ca2 in ompi_request_wait_completion (req=0x113eb80) >>>>> at ../../../../ompi/request/request.h:383 >>>>> #6 0x00007f72a0d09cd5 in mca_pml_ob1_send (buf=0x7fff81057db0, >>>>> count=4, >>>>> datatype=0x601080 <ompi_mpi_char>, dst=1, tag=1, >>>>> sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x601280 >>>>> <ompi_mpi_comm_world>) >>>>> at pml_ob1_isend.c:251 >>>>> #7 0x00007f72a58d6be3 in PMPI_Send (buf=0x7fff81057db0, count=4, >>>>> type=0x601080 <ompi_mpi_char>, dest=1, tag=1, >>>>> comm=0x601280 <ompi_mpi_comm_world>) at psend.c:78 >>>>> #8 0x0000000000400afa in main (argc=1, argv=0x7fff81057f18) at >>>>> mpitest.c:19 >>>>> (gdb) >>>>> >>>>> And the backtrace on the non-master node is: >>>>> >>>>> (gdb) bt >>>>> #0 0x00007ff3b377e48d in nanosleep () from /lib64/libc.so.6 >>>>> #1 0x00007ff3b37af014 in usleep () from /lib64/libc.so.6 >>>>> #2 0x00007ff3b0c922de in OPAL_PMIX_PMIX120_PMIx_Fence (procs=0x0, >>>>> nprocs=0, >>>>> info=0x0, ninfo=0) at src/client/pmix_client_fence.c:100 >>>>> #3 0x00007ff3b0c5f1a6 in pmix120_fence (procs=0x0, collect_data=0) >>>>> at pmix120_client.c:258 >>>>> #4 0x00007ff3b3cf8f4b in ompi_mpi_finalize () >>>>> at runtime/ompi_mpi_finalize.c:242 >>>>> #5 0x00007ff3b3d23295 in PMPI_Finalize () at pfinalize.c:47 >>>>> #6 0x0000000000400958 in main (argc=1, argv=0x7fff785e8788) at >>>>> mpitest.c:30 >>>>> (gdb) >>>>> >>>>> The hostfile is as follows: >>>>> >>>>> [durga@b-1 ~]$ cat hostfile >>>>> 10.4.70.10 slots=1 >>>>> 10.4.70.11 slots=1 >>>>> #10.4.70.12 slots=1 >>>>> >>>>> And the ifconfig output from the master node is as follows (the other >>>>> node is similar; all the IP interfaces are in their respective subnets) : >>>>> >>>>> [durga@b-1 ~]$ ifconfig >>>>> eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 >>>>> inet 10.4.70.10 netmask 255.255.255.0 broadcast 10.4.70.255 >>>>> inet6 fe80::21e:c9ff:fefe:13df prefixlen 64 scopeid >>>>> 0x20<link> >>>>> ether 00:1e:c9:fe:13:df txqueuelen 1000 (Ethernet) >>>>> RX packets 48215 bytes 27842846 (26.5 MiB) >>>>> RX errors 0 dropped 0 overruns 0 frame 0 >>>>> TX packets 52746 bytes 7817568 (7.4 MiB) >>>>> TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 >>>>> device interrupt 16 >>>>> >>>>> eno2: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 >>>>> ether 00:1e:c9:fe:13:e0 txqueuelen 1000 (Ethernet) >>>>> RX packets 0 bytes 0 (0.0 B) >>>>> RX errors 0 dropped 0 overruns 0 frame 0 >>>>> TX packets 0 bytes 0 (0.0 B) >>>>> TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 >>>>> device interrupt 17 >>>>> >>>>> lf0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2016 >>>>> inet 192.168.1.2 netmask 255.255.255.0 broadcast >>>>> 192.168.1.255 >>>>> inet6 fe80::3002:ff:fe33:3333 prefixlen 64 scopeid 0x20<link> >>>>> ether 32:02:00:33:33:33 txqueuelen 1000 (Ethernet) >>>>> RX packets 10 bytes 512 (512.0 B) >>>>> RX errors 0 dropped 0 overruns 0 frame 0 >>>>> TX packets 22 bytes 1536 (1.5 KiB) >>>>> TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 >>>>> >>>>> lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 >>>>> inet 127.0.0.1 netmask 255.0.0.0 >>>>> inet6 ::1 prefixlen 128 scopeid 0x10<host> >>>>> loop txqueuelen 0 (Local Loopback) >>>>> RX packets 26 bytes 1378 (1.3 KiB) >>>>> RX errors 0 dropped 0 overruns 0 frame 0 >>>>> TX packets 26 bytes 1378 (1.3 KiB) >>>>> TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 >>>>> >>>>> Please help me with this. I am stuck with the TCP transport, which is >>>>> the most basic of all transports. >>>>> >>>>> Thanks in advance >>>>> Durga >>>>> >>>>> >>>>> 1% of the executables have 99% of CPU privilege! >>>>> Userspace code! Unite!! Occupy the kernel!!! >>>>> >>>>> On Tue, Apr 12, 2016 at 9:32 PM, Gilles Gouaillardet < >>>>> <gil...@rist.or.jp>gil...@rist.or.jp> wrote: >>>>> >>>>>> This is quite unlikely, and fwiw, your test program works for me. >>>>>> >>>>>> i suggest you check your 3 TCP networks are usable, for example >>>>>> >>>>>> $ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp -mca pml ob1 >>>>>> --mca btl_tcp_if_include xxx ./mpitest >>>>>> >>>>>> in which xxx is a [list of] interface name : >>>>>> eth0 >>>>>> eth1 >>>>>> ib0 >>>>>> eth0,eth1 >>>>>> eth0,ib0 >>>>>> ... >>>>>> eth0,eth1,ib0 >>>>>> >>>>>> and see where problem start occuring. >>>>>> >>>>>> btw, are your 3 interfaces in 3 different subnet ? is routing >>>>>> required between two interfaces of the same type ? >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Gilles >>>>>> >>>>>> On 4/13/2016 7:15 AM, dpchoudh . wrote: >>>>>> >>>>>> Hi all >>>>>> >>>>>> I have reported this issue before, but then had brushed it off as >>>>>> something that was caused by my modifications to the source tree. It >>>>>> looks >>>>>> like that is not the case. >>>>>> >>>>>> Just now, I did the following: >>>>>> >>>>>> 1. Cloned a fresh copy from master. >>>>>> 2. Configured with the following flags, built and installed it in my >>>>>> two-node "cluster". >>>>>> --enable-debug --enable-debug-symbols --disable-dlopen >>>>>> 3. Compiled the following program, mpitest.c with these flags: -g3 >>>>>> -Wall -Wextra >>>>>> 4. Ran it like this: >>>>>> [durga@smallMPI ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl >>>>>> self,tcp -mca pml ob1 ./mpitest >>>>>> >>>>>> With this, the code hangs at MPI_Barrier() on both nodes, after >>>>>> generating the following output: >>>>>> >>>>>> Hello world from processor smallMPI, rank 0 out of 2 processors >>>>>> Hello world from processor bigMPI, rank 1 out of 2 processors >>>>>> smallMPI sent haha! >>>>>> bigMPI received haha! >>>>>> <Hangs until killed by ^C> >>>>>> Attaching to the hung process at one node gives the following >>>>>> backtrace: >>>>>> >>>>>> (gdb) bt >>>>>> #0 0x00007f55b0f41c3d in poll () from /lib64/libc.so.6 >>>>>> #1 0x00007f55b03ccde6 in poll_dispatch (base=0x70e7b0, >>>>>> tv=0x7ffd1bb551c0) at poll.c:165 >>>>>> #2 0x00007f55b03c4a90 in opal_libevent2022_event_base_loop >>>>>> (base=0x70e7b0, flags=2) at event.c:1630 >>>>>> #3 0x00007f55b02f0144 in opal_progress () at >>>>>> runtime/opal_progress.c:171 >>>>>> #4 0x00007f55b14b4d8b in opal_condition_wait (c=0x7f55b19fec40 >>>>>> <ompi_request_cond>, m=0x7f55b19febc0 <ompi_request_lock>) at >>>>>> ../opal/threads/condition.h:76 >>>>>> #5 0x00007f55b14b531b in ompi_request_default_wait_all (count=2, >>>>>> requests=0x7ffd1bb55370, statuses=0x7ffd1bb55340) at >>>>>> request/req_wait.c:287 >>>>>> #6 0x00007f55b157a225 in ompi_coll_base_sendrecv_zero (dest=1, >>>>>> stag=-16, source=1, rtag=-16, comm=0x601280 <ompi_mpi_comm_world>) >>>>>> at base/coll_base_barrier.c:63 >>>>>> #7 0x00007f55b157a92a in ompi_coll_base_barrier_intra_two_procs >>>>>> (comm=0x601280 <ompi_mpi_comm_world>, module=0x7c2630) at >>>>>> base/coll_base_barrier.c:308 >>>>>> #8 0x00007f55b15aafec in ompi_coll_tuned_barrier_intra_dec_fixed >>>>>> (comm=0x601280 <ompi_mpi_comm_world>, module=0x7c2630) at >>>>>> coll_tuned_decision_fixed.c:196 >>>>>> #9 0x00007f55b14d36fd in PMPI_Barrier (comm=0x601280 >>>>>> <ompi_mpi_comm_world>) at pbarrier.c:63 >>>>>> #10 0x0000000000400b0b in main (argc=1, argv=0x7ffd1bb55658) at >>>>>> mpitest.c:26 >>>>>> (gdb) >>>>>> >>>>>> Thinking that this might be a bug in tuned collectives, since that is >>>>>> what the stack shows, I ran the program like this (basically adding the >>>>>> ^tuned part) >>>>>> >>>>>> [durga@smallMPI ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl >>>>>> self,tcp -mca pml ob1 -mca coll ^tuned ./mpitest >>>>>> >>>>>> It still hangs, but now with a different stack trace: >>>>>> (gdb) bt >>>>>> #0 0x00007f910d38ac3d in poll () from /lib64/libc.so.6 >>>>>> #1 0x00007f910c815de6 in poll_dispatch (base=0x1a317b0, >>>>>> tv=0x7fff43ee3610) at poll.c:165 >>>>>> #2 0x00007f910c80da90 in opal_libevent2022_event_base_loop >>>>>> (base=0x1a317b0, flags=2) at event.c:1630 >>>>>> #3 0x00007f910c739144 in opal_progress () at >>>>>> runtime/opal_progress.c:171 >>>>>> #4 0x00007f910db130f7 in opal_condition_wait (c=0x7f910de47c40 >>>>>> <ompi_request_cond>, m=0x7f910de47bc0 <ompi_request_lock>) >>>>>> at ../../../../opal/threads/condition.h:76 >>>>>> #5 0x00007f910db132d8 in ompi_request_wait_completion >>>>>> (req=0x1b07680) at ../../../../ompi/request/request.h:383 >>>>>> #6 0x00007f910db1533b in mca_pml_ob1_send (buf=0x0, count=0, >>>>>> datatype=0x7f910de1e340 <ompi_mpi_byte>, dst=1, tag=-16, >>>>>> sendmode=MCA_PML_BASE_SEND_STANDARD, >>>>>> comm=0x601280 <ompi_mpi_comm_world>) at pml_ob1_isend.c:259 >>>>>> #7 0x00007f910d9c3b38 in ompi_coll_base_barrier_intra_basic_linear >>>>>> (comm=0x601280 <ompi_mpi_comm_world>, module=0x1b092c0) at >>>>>> base/coll_base_barrier.c:368 >>>>>> #8 0x00007f910d91c6fd in PMPI_Barrier (comm=0x601280 >>>>>> <ompi_mpi_comm_world>) at pbarrier.c:63 >>>>>> #9 0x0000000000400b0b in main (argc=1, argv=0x7fff43ee3a58) at >>>>>> mpitest.c:26 >>>>>> (gdb) >>>>>> >>>>>> The mpitest.c program is as follows: >>>>>> #include <mpi.h> >>>>>> #include <stdio.h> >>>>>> #include <string.h> >>>>>> >>>>>> int main(int argc, char** argv) >>>>>> { >>>>>> int world_size, world_rank, name_len; >>>>>> char hostname[MPI_MAX_PROCESSOR_NAME], buf[8]; >>>>>> >>>>>> MPI_Init(&argc, &argv); >>>>>> MPI_Comm_size(MPI_COMM_WORLD, &world_size); >>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); >>>>>> MPI_Get_processor_name(hostname, &name_len); >>>>>> printf("Hello world from processor %s, rank %d out of %d >>>>>> processors\n", hostname, world_rank, world_size); >>>>>> if (world_rank == 1) >>>>>> { >>>>>> MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD, >>>>>> MPI_STATUS_IGNORE); >>>>>> printf("%s received %s\n", hostname, buf); >>>>>> } >>>>>> else >>>>>> { >>>>>> strcpy(buf, "haha!"); >>>>>> MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD); >>>>>> printf("%s sent %s\n", hostname, buf); >>>>>> } >>>>>> MPI_Barrier(MPI_COMM_WORLD); >>>>>> MPI_Finalize(); >>>>>> return 0; >>>>>> } >>>>>> >>>>>> The hostfile is as follows: >>>>>> 10.10.10.10 slots=1 >>>>>> 10.10.10.11 slots=1 >>>>>> >>>>>> The two nodes are connected by three physical and 3 logical networks: >>>>>> Physical: Gigabit Ethernet, 10G iWARP, 20G Infiniband >>>>>> Logical: IP (all 3), PSM (Qlogic Infiniband), Verbs (iWARP and >>>>>> Infiniband) >>>>>> >>>>>> Please note again that this is a fresh, brand new clone. >>>>>> >>>>>> Is this a bug (perhaps a side effect of --disable-dlopen) or >>>>>> something I am doing wrong? >>>>>> >>>>>> Thanks >>>>>> Durga >>>>>> >>>>>> We learn from history that we never learn from history. >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing listus...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/users/2016/04/28930.php >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> <us...@open-mpi.org>us...@open-mpi.org >>>>>> Subscription: <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> Link to this post: >>>>>> <http://www.open-mpi.org/community/lists/users/2016/04/28932.php> >>>>> >>>>> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/04/28951.php >> ... > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/04/28954.php >