Hello Gilles Thank you for finding the bug; it was not there in the original code; I added it while trying to 'simplify' the code.
With the bug fixed, the code now runs in the last scenario. But it still hangs with the following command line (even after updating to latest git tree, rebuilding and reinstalling): mpirun -np 2 -mca btl self,sm ./mpi_hello_master_slave and the stack is still as before: (gdb) bt #0 0x00007f4e4bd60117 in sched_yield () from /lib64/libc.so.6 #1 0x00007f4e4ba3d875 in amsh_ep_connreq_wrap () from /lib64/libpsm_infinipath.so.1 #2 0x00007f4e4ba3e254 in amsh_ep_connect () from /lib64/libpsm_infinipath.so.1 #3 0x00007f4e4ba470df in psm_ep_connect () from /lib64/libpsm_infinipath.so.1 #4 0x00007f4e4c4c8975 in ompi_mtl_psm_add_procs (mtl=0x7f4e4c846500 <ompi_mtl_psm>, nprocs=2, procs=0x23bb420) at mtl_psm.c:312 #5 0x00007f4e4c52ef6b in mca_pml_cm_add_procs (procs=0x23bb420, nprocs=2) at pml_cm.c:134 #6 0x00007f4e4c2e7d0f in ompi_mpi_init (argc=1, argv=0x7fffe930f9b8, requested=0, provided=0x7fffe930f78c) at runtime/ompi_mpi_init.c:770 #7 0x00007f4e4c324aff in PMPI_Init (argc=0x7fffe930f7bc, argv=0x7fffe930f7b0) at pinit.c:66 #8 0x000000000040101f in main (argc=1, argv=0x7fffe930f9b8) at mpi_hello_master_slave.c:94 As you can see, OMPI is trying the PSM link to communicate, even though the link is down and it is not mentioned in the arguments to mpirun. (There are not even multiple nodes mentioned in the arguments.) Is this the expected behaviour or is it a bug? Thanks Durga 1% of the executables have 99% of CPU privilege! Userspace code! Unite!! Occupy the kernel!!! On Sun, Apr 24, 2016 at 8:12 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > two comments : > > - the program is incorrect : slave() should MPI_Recv(..., MPI_ANY_TAG, ...) > > - current master uses pmix114, and your traces mention pmix120 > so your master is out of sync, or pmix120 is an old module that was not > manually removed. > fwiw, once in a while, i > rm -rf /.../ompi_install_dir/lib/openmpi > to get rid of the removed modules > > Cheers, > > Gilles > > > On 4/25/2016 7:34 AM, dpchoudh . wrote: > > Hello all > > Attached is a simple MPI program (a modified version of a similar program > that was posted by another user). This program, when run on a single node > machine, hangs most of the time, as follows: (in all cases, OS was CentOS 7) > > Scenario 1: OMPI v 1.10, single socket quad core machine, with Chelsio T3 > card, link down, and GigE, link up > > mpirun -np 2 <progname> > Backtrace of the two spawned processes as follows: > > (gdb) bt > #0 0x00007f6471647aba in mca_btl_vader_component_progress () at > btl_vader_component.c:708 > #1 0x00007f6475c6722a in opal_progress () at runtime/opal_progress.c:187 > #2 0x00007f64767b7685 in opal_condition_wait (c=<optimized out>, > m=<optimized out>) > at ../opal/threads/condition.h:78 > #3 ompi_request_default_wait_all (count=2, requests=0x7ffd1d921530, > statuses=0x7ffd1d921540) > at request/req_wait.c:281 > #4 0x00007f64709dd591 in ompi_coll_tuned_sendrecv_zero (stag=-16, > rtag=-16, > comm=<optimized out>, source=1, dest=1) at coll_tuned_barrier.c:78 > #5 ompi_coll_tuned_barrier_intra_two_procs (comm=0x6022c0 > <ompi_mpi_comm_world>, > module=<optimized out>) at coll_tuned_barrier.c:324 > #6 0x00007f64767c92e6 in PMPI_Barrier (comm=0x6022c0 > <ompi_mpi_comm_world>) at pbarrier.c:70 > #7 0x00000000004010bd in main (argc=1, argv=0x7ffd1d9217d8) at > mpi_hello_master_slave.c:115 > (gdb) > > > (gdb) bt > #0 mca_pml_ob1_progress () at pml_ob1_progress.c:45 > #1 0x00007feeae7dc22a in opal_progress () at runtime/opal_progress.c:187 > #2 0x00007feea9e125c5 in opal_condition_wait (c=<optimized out>, > m=<optimized out>) > at ../../../../opal/threads/condition.h:78 > #3 ompi_request_wait_completion (req=0xe55200) at > ../../../../ompi/request/request.h:381 > #4 mca_pml_ob1_recv (addr=<optimized out>, count=255, datatype=<optimized > out>, > src=<optimized out>, tag=<optimized out>, comm=<optimized out>, > status=0x7fff4a618000) > at pml_ob1_irecv.c:118 > #5 0x00007feeaf35068f in PMPI_Recv (buf=0x7fff4a618020, count=255, > type=0x6020c0 <ompi_mpi_char>, source=<optimized out>, tag=<optimized > out>, > comm=0x6022c0 <ompi_mpi_comm_world>, status=0x7fff4a618000) at > precv.c:78 > #6 0x0000000000400d49 in slave () at mpi_hello_master_slave.c:67 > #7 0x00000000004010b3 in main (argc=1, argv=0x7fff4a6184d8) at > mpi_hello_master_slave.c:113 > (gdb) > > > Scenario 2: > Dual socket Hexcore machine with Qlogic IB, Chelsio iWARP and Fibre > Channel, all link down, GigE, link up, OpenMPI compiled from master branch, > crashes as follows: > > [durga@smallMPI Desktop]$ mpirun -np 2 ./mpi_hello_master_slave > > mpi_hello_master_slave:39570 terminated with signal 11 at PC=20 > SP=7ffd438c00b8. Backtrace: > > mpi_hello_master_slave:39571 terminated with signal 11 at PC=20 > SP=7ffee5903e08. Backtrace: > ------------------------------------------------------- > Primary job terminated normally, but 1 process returned > a non-zero exit code. Per user-direction, the job has been aborted. > ------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun noticed that process rank 0 with PID 0 on node smallMPI exited on > signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > > Scenario 3: > Exactly same as scenario 2, but with command line more explicit as follows: > > [durga@smallMPI Desktop]$ mpirun -np 2 -mca btl self,sm > ./mpi_hello_master_slave > This hangs with the following backtrace: > > (gdb) bt > #0 0x00007ff6639f049d in nanosleep () from /lib64/libc.so.6 > #1 0x00007ff663a210d4 in usleep () from /lib64/libc.so.6 > #2 0x00007ff662f72796 in OPAL_PMIX_PMIX120_PMIx_Fence (procs=0x0, > nprocs=0, info=0x0, ninfo=0) > at src/client/pmix_client_fence.c:100 > #3 0x00007ff662f4f0bc in pmix120_fence (procs=0x0, collect_data=0) at > pmix120_client.c:255 > #4 0x00007ff663f941af in ompi_mpi_init (argc=1, argv=0x7ffc18c9afd8, > requested=0, provided=0x7ffc18c9adac) > at runtime/ompi_mpi_init.c:813 > #5 0x00007ff663fc9c33 in PMPI_Init (argc=0x7ffc18c9addc, > argv=0x7ffc18c9add0) at pinit.c:66 > #6 0x000000000040101f in main (argc=1, argv=0x7ffc18c9afd8) at > mpi_hello_master_slave.c:94 > (gdb) q > > (gdb) bt > #0 0x00007f5af7646117 in sched_yield () from /lib64/libc.so.6 > #1 0x00007f5af7323875 in amsh_ep_connreq_wrap () from > /lib64/libpsm_infinipath.so.1 > #2 0x00007f5af7324254 in amsh_ep_connect () from > /lib64/libpsm_infinipath.so.1 > #3 0x00007f5af732d0df in psm_ep_connect () from > /lib64/libpsm_infinipath.so.1 > #4 0x00007f5af7d94a69 in ompi_mtl_psm_add_procs (mtl=0x7f5af80f8500 > <ompi_mtl_psm>, nprocs=2, procs=0xf53e60) > at mtl_psm.c:312 > #5 0x00007f5af7df3630 in mca_pml_cm_add_procs (procs=0xf53e60, nprocs=2) > at pml_cm.c:134 > #6 0x00007f5af7bcc0d1 in ompi_mpi_init (argc=1, argv=0x7ffc485a2f98, > requested=0, provided=0x7ffc485a2d6c) > at runtime/ompi_mpi_init.c:777 > #7 0x00007f5af7c01c33 in PMPI_Init (argc=0x7ffc485a2d9c, > argv=0x7ffc485a2d90) at pinit.c:66 > #8 0x000000000040101f in main (argc=1, argv=0x7ffc485a2f98) at > mpi_hello_master_slave.c:94 > > This seems to suggest that it is trying PSM to connect even when the link > was down and it was not mentioned in the command line. Is this behavior > expected? > > > Scenario 4: > Exactly same as scenario 3, but with even more explicit command line: > > [durga@smallMPI Desktop]$ mpirun -np 2 -mca btl self,sm -mca pml ob1 > ./mpi_hello_master_slave > > This hangs towards the end, after printing the output (as opposed to > scenario 3 where it hangs at the connection setup stage, without printing > anything.) > > Process 0 of 2 running on host smallMPI > > > Now 1 slave tasks are sending greetings. > > Process 1 of 2 running on host smallMPI > Greetings from task 1: > message type: 3 > msg length: 141 characters > message: > hostname: smallMPI > operating system: Linux > release: 3.10.0-327.13.1.el7.x86_64 > processor: x86_64 > > > Backtraces of the two processes are as follows: > > (gdb) bt > #0 opal_timer_base_get_usec_clock_gettime () at > timer_linux_component.c:180 > #1 0x00007f10f46e50e4 in opal_progress () at runtime/opal_progress.c:161 > #2 0x00007f10f58a9d8b in opal_condition_wait (c=0x7f10f5df3c40 > <ompi_request_cond>, > m=0x7f10f5df3bc0 <ompi_request_lock>) at ../opal/threads/condition.h:76 > #3 0x00007f10f58aa31b in ompi_request_default_wait_all (count=2, > requests=0x7ffe7edd5a80, > statuses=0x7ffe7edd5a50) at request/req_wait.c:287 > #4 0x00007f10f596f225 in ompi_coll_base_sendrecv_zero (dest=1, stag=-16, > source=1, rtag=-16, > comm=0x6022c0 <ompi_mpi_comm_world>) at base/coll_base_barrier.c:63 > #5 0x00007f10f596f92a in ompi_coll_base_barrier_intra_two_procs > (comm=0x6022c0 <ompi_mpi_comm_world>, > module=0xd5a7f0) at base/coll_base_barrier.c:308 > #6 0x00007f10f599ffec in ompi_coll_tuned_barrier_intra_dec_fixed > (comm=0x6022c0 <ompi_mpi_comm_world>, > module=0xd5a7f0) at coll_tuned_decision_fixed.c:196 > #7 0x00007f10f58c86fd in PMPI_Barrier (comm=0x6022c0 > <ompi_mpi_comm_world>) at pbarrier.c:63 > #8 0x00000000004010bd in main (argc=1, argv=0x7ffe7edd5d48) at > mpi_hello_master_slave.c:115 > > > (gdb) bt > #0 0x00007fffe9d6a988 in clock_gettime () > #1 0x00007f704bf64edd in clock_gettime () from /lib64/libc.so.6 > #2 0x00007f704b4deea5 in opal_timer_base_get_usec_clock_gettime () at > timer_linux_component.c:183 > #3 0x00007f704b2f50e4 in opal_progress () at runtime/opal_progress.c:161 > #4 0x00007f704c6cc39c in opal_condition_wait (c=0x7f704ca03c40 > <ompi_request_cond>, > m=0x7f704ca03bc0 <ompi_request_lock>) at > ../../../../opal/threads/condition.h:76 > #5 0x00007f704c6cc560 in ompi_request_wait_completion (req=0x165e580) at > ../../../../ompi/request/request.h:383 > #6 0x00007f704c6cd724 in mca_pml_ob1_recv (addr=0x7fffe9cafa10, > count=255, datatype=0x6020c0 <ompi_mpi_char>, > src=0, tag=1, comm=0x6022c0 <ompi_mpi_comm_world>, > status=0x7fffe9caf9f0) at pml_ob1_irecv.c:123 > #7 0x00007f704c4ff434 in PMPI_Recv (buf=0x7fffe9cafa10, count=255, > type=0x6020c0 <ompi_mpi_char>, source=0, > tag=1, comm=0x6022c0 <ompi_mpi_comm_world>, status=0x7fffe9caf9f0) at > precv.c:79 > #8 0x0000000000400d49 in slave () at mpi_hello_master_slave.c:67 > #9 0x00000000004010b3 in main (argc=1, argv=0x7fffe9cafec8) at > mpi_hello_master_slave.c:113 > (gdb) q > > I am going to try the tarball shortly, but hopefully someone can get some > insight out of this much information. BTW, the code was compiled with the > following flags: > > -Wall -Wextra -g3 -O0 > > Let me rehash that NO network communication was involved in any of these > experiments; they were all single node shared memory (sm btl) jobs. > > Thanks > Durga > > > > 1% of the executables have 99% of CPU privilege! > Userspace code! Unite!! Occupy the kernel!!! > > > _______________________________________________ > users mailing listus...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/04/29018.php > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/04/29019.php >