Re: [OMPI users] OMPI users] Possible bug in MPI_Barrier() ?

George Bosilca Mon, 18 Apr 2016 08:04:19 -0400 (EDT)

Durga,

Can you run a simple netpipe over TCP using any of the two interfaces you
mentioned?


George
On Apr 18, 2016 11:08 AM, "Gilles Gouaillardet" <
gilles.gouaillar...@gmail.com> wrote:

> An other test is to swap the hostnames.
> If the single barrier test fails, this can hint to a firewall.
>
> Cheers,
>
> Gilles
>
> Gilles Gouaillardet <gil...@rist.or.jp> wrote:
> sudo make uninstall
> will not remove modules that are no more built
> sudo rm -rf /usr/local/lib/openmpi
> is safe thought
>
> i confirm i did not see any issue on a system with two networks
>
> Cheers,
>
> Gilles
>
> On 4/18/2016 2:53 PM, dpchoudh . wrote:
>
> Hello Gilles
>
> I did a
> sudo make uninstall
> followed by a
> sudo make install
> on both nodes. But that did not make a difference. I will try your tarball
> build suggestion a bit later.
>
> What I find a bit strange is that only I seem to be getting into this
> issue. What could I be doing wrong? Or am I discovering an obscure bug?
>
> Thanks
> Durga
>
> 1% of the executables have 99% of CPU privilege!
> Userspace code! Unite!! Occupy the kernel!!!
>
> On Mon, Apr 18, 2016 at 1:21 AM, Gilles Gouaillardet <gil...@rist.or.jp>
> wrote:
>
>> so you might want to
>> rm -rf /usr/local/lib/openmpi
>> and run
>> make install
>> again, just to make sure old stuff does not get in the way
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> On 4/18/2016 2:12 PM, dpchoudh . wrote:
>>
>> Hello Gilles
>>
>> Thank you very much for your feedback. You are right that my original
>> stack trace was on code that was several weeks behind, but updating it just
>> now did not seem to make a difference: I am copying the stack from the
>> latest code below:
>>
>> On the master node:
>>
>> (gdb) bt
>> #0  0x00007fc0524cbb7d in poll () from /lib64/libc.so.6
>> #1  0x00007fc051e53116 in poll_dispatch (base=0x1aabbe0,
>> tv=0x7fff29fcb240) at poll.c:165
>> #2  0x00007fc051e4adb0 in opal_libevent2022_event_base_loop
>> (base=0x1aabbe0, flags=2) at event.c:1630
>> #3  0x00007fc051de9a00 in opal_progress () at runtime/opal_progress.c:171
>> #4  0x00007fc04ce46b0b in opal_condition_wait (c=0x7fc052d3cde0
>> <ompi_request_cond>,
>>     m=0x7fc052d3cd60 <ompi_request_lock>) at
>> ../../../../opal/threads/condition.h:76
>> #5  0x00007fc04ce46cec in ompi_request_wait_completion (req=0x1b7b580)
>>     at ../../../../ompi/request/request.h:383
>> #6  0x00007fc04ce48d4f in mca_pml_ob1_send (buf=0x7fff29fcb480, count=4,
>>     datatype=0x601080 <ompi_mpi_char>, dst=1, tag=1,
>> sendmode=MCA_PML_BASE_SEND_STANDARD,
>>     comm=0x601280 <ompi_mpi_comm_world>) at pml_ob1_isend.c:259
>> #7  0x00007fc052a62d73 in PMPI_Send (buf=0x7fff29fcb480, count=4,
>> type=0x601080 <ompi_mpi_char>, dest=1,
>>     tag=1, comm=0x601280 <ompi_mpi_comm_world>) at psend.c:78
>> #8  0x0000000000400afa in main (argc=1, argv=0x7fff29fcb5e8) at
>> mpitest.c:19
>> (gdb)
>>
>> And on the non-master node
>>
>> (gdb) bt
>> #0  0x00007fad2c32148d in nanosleep () from /lib64/libc.so.6
>> #1  0x00007fad2c352014 in usleep () from /lib64/libc.so.6
>> #2  0x00007fad296412de in OPAL_PMIX_PMIX120_PMIx_Fence (procs=0x0,
>> nprocs=0, info=0x0, ninfo=0)
>>     at src/client/pmix_client_fence.c:100
>> #3  0x00007fad2960e1a6 in pmix120_fence (procs=0x0, collect_data=0) at
>> pmix120_client.c:258
>> #4  0x00007fad2c89b2da in ompi_mpi_finalize () at
>> runtime/ompi_mpi_finalize.c:242
>> #5  0x00007fad2c8c5849 in PMPI_Finalize () at pfinalize.c:47
>> #6  0x0000000000400958 in main (argc=1, argv=0x7fff163879c8) at
>> mpitest.c:30
>> (gdb)
>>
>> And my configuration was done as follows:
>>
>>  $ ./configure --enable-debug --enable-debug-symbols
>>
>> I double checked to ensure that there is not an older installation of
>> OpenMPI that is getting mixed up with the master branch.
>> sudo yum list installed | grep -i mpi
>> shows nothing on both nodes, and pmap -p <pid> shows that all the
>> libraries are coming from /usr/local/lib, which seems to be correct. I
>> am also quite sure about the firewall issue (that there is none). I will
>> try out your suggestion on installing from a tarball and see how it goes.
>>
>> Thanks
>> Durga
>>
>> 1% of the executables have 99% of CPU privilege!
>> Userspace code! Unite!! Occupy the kernel!!!
>>
>> On Mon, Apr 18, 2016 at 12:47 AM, Gilles Gouaillardet <gil...@rist.or.jp>
>> wrote:
>>
>>> here is your stack trace
>>>
>>> #6  0x00007f72a0d09cd5 in mca_pml_ob1_send (buf=0x7fff81057db0, count=4,
>>>     datatype=0x601080 <ompi_mpi_char>, dst=1, tag=1,
>>>     sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x601280
>>> <ompi_mpi_comm_world>)
>>>
>>> at line 251
>>>
>>>
>>> that would be line 259 in current master, and this file was updated 21
>>> days ago
>>> and that suggests your master is not quite up to date.
>>>
>>> even if the message is sent eagerly, the ob1 pml does use an internal
>>> request it will wait for.
>>>
>>> btw, did you configure with --enable-mpi-thread-multiple ?
>>> did you configure with --enable-mpirun-prefix-by-default ?
>>> did you configure with --disable-dlopen ?
>>>
>>> at first, i d recommend you download a tarball from
>>> https://www.open-mpi.org/nightly/master,
>>> configure && make && make install
>>> using a new install dir, and check if the issue is still here or not.
>>>
>>> there could be some side effects if some old modules were not removed
>>> and/or if you are
>>> not using the modules you expect.
>>> /* when it hangs, you can pmap <pid> and check the path of the openmpi
>>> libraries are the one you expect */
>>>
>>> what if you do not send/recv but invoke MPI_Barrier multiple times ?
>>> what if you send/recv a one byte message instead ?
>>> did you double check there is no firewall running on your nodes ?
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 4/18/2016 1:06 PM, dpchoudh . wrote:
>>>
>>> Thank you for your suggestion, Ralph. But it did not make any difference.
>>>
>>> Let me say that my code is about a week stale. I just did a git pull and
>>> am building it right now. The build takes quite a bit of time, so I avoid
>>> doing that unless there is a reason. But what I am trying out is the most
>>> basic functionality, so I'd think a week or so of lag would not make a
>>> difference.
>>>
>>> Does the stack trace suggest something to you? It seems that the send
>>> hangs; but a 4 byte send should be sent eagerly.
>>>
>>> Best regards
>>> 'Durga
>>>
>>> 1% of the executables have 99% of CPU privilege!
>>> Userspace code! Unite!! Occupy the kernel!!!
>>>
>>> On Sun, Apr 17, 2016 at 11:55 PM, Ralph Castain <r...@open-mpi.org>
>>> wrote:
>>>
>>>> Try adding -mca oob_tcp_if_include eno1 to your cmd line and see if
>>>> that makes a difference
>>>>
>>>> On Apr 17, 2016, at 8:43 PM, dpchoudh . <dpcho...@gmail.com> wrote:
>>>>
>>>> Hello Gilles and all
>>>>
>>>> I am sorry to be bugging the developers, but this issue seems to be
>>>> nagging me, and I am surprised it does not seem to affect anybody else. But
>>>> then again, I am using the master branch, and most users are probably using
>>>> a released version.
>>>>
>>>> This time I am using a totally different cluster. This has NO verbs
>>>> capable interface; just 2 Ethernet (1 of which has no IP address and hence
>>>> is unusable) plus 1 proprietary interface that currently supports only IP
>>>> traffic. The two IP interfaces (Ethernet and proprietary) are on different
>>>> IP subnets.
>>>>
>>>> My test program is as follows:
>>>>
>>>> #include <stdio.h>
>>>> #include <string.h>
>>>> #include "mpi.h"
>>>> int main(int argc, char *argv[])
>>>> {
>>>> char host[128];
>>>> int n;
>>>> MPI_Init(&argc, &argv);
>>>> MPI_Get_processor_name(host, &n);
>>>> printf("Hello from %s\n", host);
>>>> MPI_Comm_size(MPI_COMM_WORLD, &n);
>>>> printf("The world has %d nodes\n", n);
>>>> MPI_Comm_rank(MPI_COMM_WORLD, &n);
>>>> printf("My rank is %d\n",n);
>>>> //#if 0
>>>> if (n == 0)
>>>> {
>>>> strcpy(host, "ha!");
>>>> MPI_Send(host, strlen(host) + 1, MPI_CHAR, 1, 1, MPI_COMM_WORLD);
>>>> printf("sent %s\n", host);
>>>> }
>>>> else
>>>> {
>>>> //int len = strlen(host) + 1;
>>>> bzero(host, 128);
>>>> MPI_Recv(host,  4, MPI_CHAR, 0, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
>>>> printf("Received %s from rank 0\n", host);
>>>> }
>>>> //#endif
>>>> MPI_Finalize();
>>>> return 0;
>>>> }
>>>>
>>>> This program, when run between two nodes, hangs. The command was:
>>>> [durga@b-1 ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp
>>>> -mca pml ob1 -mca btl_tcp_if_include eno1 ./mpitest
>>>>
>>>> And the hang is with the following output: (eno1 is one of the gigEth
>>>> interfaces, that takes OOB traffic as well)
>>>>
>>>> Hello from b-1
>>>> The world has 2 nodes
>>>> My rank is 0
>>>> Hello from b-2
>>>> The world has 2 nodes
>>>> My rank is 1
>>>>
>>>> Note that if I uncomment the #if 0 - #endif (i.e. comment out the
>>>> MPI_Send()/MPI_Recv() part, the program runs to completion. Also note that
>>>> the printfs following MPI_Send()/MPI_Recv() do not show up on console.
>>>>
>>>> Upon attaching gdb, the stack trace from the master node is as follows:
>>>>
>>>> Missing separate debuginfos, use: debuginfo-install
>>>> glibc-2.17-78.el7.x86_64 libpciaccess-0.13.4-2.el7.x86_64
>>>> (gdb) bt
>>>> #0  0x00007f72a533eb7d in poll () from /lib64/libc.so.6
>>>> #1  0x00007f72a4cb7146 in poll_dispatch (base=0xee33d0,
>>>> tv=0x7fff81057b70)
>>>>     at poll.c:165
>>>> #2  0x00007f72a4caede0 in opal_libevent2022_event_base_loop
>>>> (base=0xee33d0,
>>>>     flags=2) at event.c:1630
>>>> #3  0x00007f72a4c4e692 in opal_progress () at
>>>> runtime/opal_progress.c:171
>>>> #4  0x00007f72a0d07ac1 in opal_condition_wait (
>>>>     c=0x7f72a5bb1e00 <ompi_request_cond>, m=0x7f72a5bb1d80
>>>> <ompi_request_lock>)
>>>>     at ../../../../opal/threads/condition.h:76
>>>> #5  0x00007f72a0d07ca2 in ompi_request_wait_completion (req=0x113eb80)
>>>>     at ../../../../ompi/request/request.h:383
>>>> #6  0x00007f72a0d09cd5 in mca_pml_ob1_send (buf=0x7fff81057db0, count=4,
>>>>     datatype=0x601080 <ompi_mpi_char>, dst=1, tag=1,
>>>>     sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x601280
>>>> <ompi_mpi_comm_world>)
>>>>     at pml_ob1_isend.c:251
>>>> #7  0x00007f72a58d6be3 in PMPI_Send (buf=0x7fff81057db0, count=4,
>>>>     type=0x601080 <ompi_mpi_char>, dest=1, tag=1,
>>>>     comm=0x601280 <ompi_mpi_comm_world>) at psend.c:78
>>>> #8  0x0000000000400afa in main (argc=1, argv=0x7fff81057f18) at
>>>> mpitest.c:19
>>>> (gdb)
>>>>
>>>> And the backtrace on the non-master node is:
>>>>
>>>> (gdb) bt
>>>> #0  0x00007ff3b377e48d in nanosleep () from /lib64/libc.so.6
>>>> #1  0x00007ff3b37af014 in usleep () from /lib64/libc.so.6
>>>> #2  0x00007ff3b0c922de in OPAL_PMIX_PMIX120_PMIx_Fence (procs=0x0,
>>>> nprocs=0,
>>>>     info=0x0, ninfo=0) at src/client/pmix_client_fence.c:100
>>>> #3  0x00007ff3b0c5f1a6 in pmix120_fence (procs=0x0, collect_data=0)
>>>>     at pmix120_client.c:258
>>>> #4  0x00007ff3b3cf8f4b in ompi_mpi_finalize ()
>>>>     at runtime/ompi_mpi_finalize.c:242
>>>> #5  0x00007ff3b3d23295 in PMPI_Finalize () at pfinalize.c:47
>>>> #6  0x0000000000400958 in main (argc=1, argv=0x7fff785e8788) at
>>>> mpitest.c:30
>>>> (gdb)
>>>>
>>>> The hostfile is as follows:
>>>>
>>>> [durga@b-1 ~]$ cat hostfile
>>>> 10.4.70.10 slots=1
>>>> 10.4.70.11 slots=1
>>>> #10.4.70.12 slots=1
>>>>
>>>> And the ifconfig output from the master node is as follows (the other
>>>> node is similar; all the IP interfaces are in their respective subnets) :
>>>>
>>>> [durga@b-1 ~]$ ifconfig
>>>> eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
>>>>         inet 10.4.70.10  netmask 255.255.255.0  broadcast 10.4.70.255
>>>>         inet6 fe80::21e:c9ff:fefe:13df  prefixlen 64  scopeid 0x20<link>
>>>>         ether 00:1e:c9:fe:13:df  txqueuelen 1000  (Ethernet)
>>>>         RX packets 48215  bytes 27842846 (26.5 MiB)
>>>>         RX errors 0  dropped 0  overruns 0  frame 0
>>>>         TX packets 52746  bytes 7817568 (7.4 MiB)
>>>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>>         device interrupt 16
>>>>
>>>> eno2: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
>>>>         ether 00:1e:c9:fe:13:e0  txqueuelen 1000  (Ethernet)
>>>>         RX packets 0  bytes 0 (0.0 B)
>>>>         RX errors 0  dropped 0  overruns 0  frame 0
>>>>         TX packets 0  bytes 0 (0.0 B)
>>>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>>         device interrupt 17
>>>>
>>>> lf0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2016
>>>>         inet 192.168.1.2  netmask 255.255.255.0  broadcast 192.168.1.255
>>>>         inet6 fe80::3002:ff:fe33:3333  prefixlen 64  scopeid 0x20<link>
>>>>         ether 32:02:00:33:33:33  txqueuelen 1000  (Ethernet)
>>>>         RX packets 10  bytes 512 (512.0 B)
>>>>         RX errors 0  dropped 0  overruns 0  frame 0
>>>>         TX packets 22  bytes 1536 (1.5 KiB)
>>>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>>
>>>> lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
>>>>         inet 127.0.0.1  netmask 255.0.0.0
>>>>         inet6 ::1  prefixlen 128  scopeid 0x10<host>
>>>>         loop  txqueuelen 0  (Local Loopback)
>>>>         RX packets 26  bytes 1378 (1.3 KiB)
>>>>         RX errors 0  dropped 0  overruns 0  frame 0
>>>>         TX packets 26  bytes 1378 (1.3 KiB)
>>>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>>
>>>> Please help me with this. I am stuck with the TCP transport, which is
>>>> the most basic of all transports.
>>>>
>>>> Thanks in advance
>>>> Durga
>>>>
>>>>
>>>> 1% of the executables have 99% of CPU privilege!
>>>> Userspace code! Unite!! Occupy the kernel!!!
>>>>
>>>> On Tue, Apr 12, 2016 at 9:32 PM, Gilles Gouaillardet <gil...@rist.or.jp
>>>> > wrote:
>>>>
>>>>> This is quite unlikely, and fwiw, your test program works for me.
>>>>>
>>>>> i suggest you check your 3 TCP networks are usable, for example
>>>>>
>>>>> $ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp -mca pml ob1
>>>>> --mca btl_tcp_if_include xxx ./mpitest
>>>>>
>>>>> in which xxx is a [list of] interface name :
>>>>> eth0
>>>>> eth1
>>>>> ib0
>>>>> eth0,eth1
>>>>> eth0,ib0
>>>>> ...
>>>>> eth0,eth1,ib0
>>>>>
>>>>> and see where problem start occuring.
>>>>>
>>>>> btw, are your 3 interfaces in 3 different subnet ? is routing required
>>>>> between two interfaces of the same type ?
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Gilles
>>>>>
>>>>> On 4/13/2016 7:15 AM, dpchoudh . wrote:
>>>>>
>>>>> Hi all
>>>>>
>>>>> I have reported this issue before, but then had brushed it off as
>>>>> something that was caused by my modifications to the source tree. It looks
>>>>> like that is not the case.
>>>>>
>>>>> Just now, I did the following:
>>>>>
>>>>> 1. Cloned a fresh copy from master.
>>>>> 2. Configured with the following flags, built and installed it in my
>>>>> two-node "cluster".
>>>>> --enable-debug --enable-debug-symbols --disable-dlopen
>>>>> 3. Compiled the following program, mpitest.c with these flags: -g3
>>>>> -Wall -Wextra
>>>>> 4. Ran it like this:
>>>>> [durga@smallMPI ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl
>>>>> self,tcp -mca pml ob1 ./mpitest
>>>>>
>>>>> With this, the code hangs at MPI_Barrier() on both nodes, after
>>>>> generating the following output:
>>>>>
>>>>> Hello world from processor smallMPI, rank 0 out of 2 processors
>>>>> Hello world from processor bigMPI, rank 1 out of 2 processors
>>>>> smallMPI sent haha!
>>>>> bigMPI received haha!
>>>>> <Hangs until killed by ^C>
>>>>> Attaching to the hung process at one node gives the following
>>>>> backtrace:
>>>>>
>>>>> (gdb) bt
>>>>> #0  0x00007f55b0f41c3d in poll () from /lib64/libc.so.6
>>>>> #1  0x00007f55b03ccde6 in poll_dispatch (base=0x70e7b0,
>>>>> tv=0x7ffd1bb551c0) at poll.c:165
>>>>> #2  0x00007f55b03c4a90 in opal_libevent2022_event_base_loop
>>>>> (base=0x70e7b0, flags=2) at event.c:1630
>>>>> #3  0x00007f55b02f0144 in opal_progress () at
>>>>> runtime/opal_progress.c:171
>>>>> #4  0x00007f55b14b4d8b in opal_condition_wait (c=0x7f55b19fec40
>>>>> <ompi_request_cond>, m=0x7f55b19febc0 <ompi_request_lock>) at
>>>>> ../opal/threads/condition.h:76
>>>>> #5  0x00007f55b14b531b in ompi_request_default_wait_all (count=2,
>>>>> requests=0x7ffd1bb55370, statuses=0x7ffd1bb55340) at 
>>>>> request/req_wait.c:287
>>>>> #6  0x00007f55b157a225 in ompi_coll_base_sendrecv_zero (dest=1,
>>>>> stag=-16, source=1, rtag=-16, comm=0x601280 <ompi_mpi_comm_world>)
>>>>>     at base/coll_base_barrier.c:63
>>>>> #7  0x00007f55b157a92a in ompi_coll_base_barrier_intra_two_procs
>>>>> (comm=0x601280 <ompi_mpi_comm_world>, module=0x7c2630) at
>>>>> base/coll_base_barrier.c:308
>>>>> #8  0x00007f55b15aafec in ompi_coll_tuned_barrier_intra_dec_fixed
>>>>> (comm=0x601280 <ompi_mpi_comm_world>, module=0x7c2630) at
>>>>> coll_tuned_decision_fixed.c:196
>>>>> #9  0x00007f55b14d36fd in PMPI_Barrier (comm=0x601280
>>>>> <ompi_mpi_comm_world>) at pbarrier.c:63
>>>>> #10 0x0000000000400b0b in main (argc=1, argv=0x7ffd1bb55658) at
>>>>> mpitest.c:26
>>>>> (gdb)
>>>>>
>>>>> Thinking that this might be a bug in tuned collectives, since that is
>>>>> what the stack shows, I ran the program like this (basically adding the
>>>>> ^tuned part)
>>>>>
>>>>> [durga@smallMPI ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl
>>>>> self,tcp -mca pml ob1 -mca coll ^tuned ./mpitest
>>>>>
>>>>> It still hangs, but now with a different stack trace:
>>>>> (gdb) bt
>>>>> #0  0x00007f910d38ac3d in poll () from /lib64/libc.so.6
>>>>> #1  0x00007f910c815de6 in poll_dispatch (base=0x1a317b0,
>>>>> tv=0x7fff43ee3610) at poll.c:165
>>>>> #2  0x00007f910c80da90 in opal_libevent2022_event_base_loop
>>>>> (base=0x1a317b0, flags=2) at event.c:1630
>>>>> #3  0x00007f910c739144 in opal_progress () at
>>>>> runtime/opal_progress.c:171
>>>>> #4  0x00007f910db130f7 in opal_condition_wait (c=0x7f910de47c40
>>>>> <ompi_request_cond>, m=0x7f910de47bc0 <ompi_request_lock>)
>>>>>     at ../../../../opal/threads/condition.h:76
>>>>> #5  0x00007f910db132d8 in ompi_request_wait_completion (req=0x1b07680)
>>>>> at ../../../../ompi/request/request.h:383
>>>>> #6  0x00007f910db1533b in mca_pml_ob1_send (buf=0x0, count=0,
>>>>> datatype=0x7f910de1e340 <ompi_mpi_byte>, dst=1, tag=-16,
>>>>> sendmode=MCA_PML_BASE_SEND_STANDARD,
>>>>>     comm=0x601280 <ompi_mpi_comm_world>) at pml_ob1_isend.c:259
>>>>> #7  0x00007f910d9c3b38 in ompi_coll_base_barrier_intra_basic_linear
>>>>> (comm=0x601280 <ompi_mpi_comm_world>, module=0x1b092c0) at
>>>>> base/coll_base_barrier.c:368
>>>>> #8  0x00007f910d91c6fd in PMPI_Barrier (comm=0x601280
>>>>> <ompi_mpi_comm_world>) at pbarrier.c:63
>>>>> #9  0x0000000000400b0b in main (argc=1, argv=0x7fff43ee3a58) at
>>>>> mpitest.c:26
>>>>> (gdb)
>>>>>
>>>>> The mpitest.c program is as follows:
>>>>> #include <mpi.h>
>>>>> #include <stdio.h>
>>>>> #include <string.h>
>>>>>
>>>>> int main(int argc, char** argv)
>>>>> {
>>>>>     int world_size, world_rank, name_len;
>>>>>     char hostname[MPI_MAX_PROCESSOR_NAME], buf[8];
>>>>>
>>>>>     MPI_Init(&argc, &argv);
>>>>>     MPI_Comm_size(MPI_COMM_WORLD, &world_size);
>>>>>     MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
>>>>>     MPI_Get_processor_name(hostname, &name_len);
>>>>>     printf("Hello world from processor %s, rank %d out of %d
>>>>> processors\n", hostname, world_rank, world_size);
>>>>>     if (world_rank == 1)
>>>>>     {
>>>>>     MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD,
>>>>> MPI_STATUS_IGNORE);
>>>>>     printf("%s received %s\n", hostname, buf);
>>>>>     }
>>>>>     else
>>>>>     {
>>>>>     strcpy(buf, "haha!");
>>>>>     MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD);
>>>>>     printf("%s sent %s\n", hostname, buf);
>>>>>     }
>>>>>     MPI_Barrier(MPI_COMM_WORLD);
>>>>>     MPI_Finalize();
>>>>>     return 0;
>>>>> }
>>>>>
>>>>> The hostfile is as follows:
>>>>> 10.10.10.10 slots=1
>>>>> 10.10.10.11 slots=1
>>>>>
>>>>> The two nodes are connected by three physical and 3 logical networks:
>>>>> Physical: Gigabit Ethernet, 10G iWARP, 20G Infiniband
>>>>> Logical: IP (all 3), PSM (Qlogic Infiniband), Verbs (iWARP and
>>>>> Infiniband)
>>>>>
>>>>> Please note again that this is a fresh, brand new clone.
>>>>>
>>>>> Is this a bug (perhaps a side effect of --disable-dlopen) or something
>>>>> I am doing wrong?
>>>>>
>>>>> Thanks
>>>>> Durga
>>>>>
>>>>> We learn from history that we never learn from history.
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing listus...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2016/04/28930.php
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post:
>>>>> <http://www.open-mpi.org/community/lists/users/2016/04/28932.php>
>>>>
>>>>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/04/28951.php
> ...

Re: [OMPI users] OMPI users] Possible bug in MPI_Barrier() ?

Reply via email to