Hello all
Attached is a simple MPI program (a modified version of a similar
program that was posted by another user). This program, when run
on a single node machine, hangs most of the time, as follows: (in
all cases, OS was CentOS 7)
Scenario 1: OMPI v 1.10, single socket quad core machine, with
Chelsio T3 card, link down, and GigE, link up
mpirun -np 2 <progname>
Backtrace of the two spawned processes as follows:
(gdb) bt
#0 0x00007f6471647aba in mca_btl_vader_component_progress () at
btl_vader_component.c:708
#1 0x00007f6475c6722a in opal_progress () at
runtime/opal_progress.c:187
#2 0x00007f64767b7685 in opal_condition_wait (c=<optimized out>,
m=<optimized out>)
at ../opal/threads/condition.h:78
#3 ompi_request_default_wait_all (count=2,
requests=0x7ffd1d921530, statuses=0x7ffd1d921540)
at request/req_wait.c:281
#4 0x00007f64709dd591 in ompi_coll_tuned_sendrecv_zero
(stag=-16, rtag=-16,
comm=<optimized out>, source=1, dest=1) at
coll_tuned_barrier.c:78
#5 ompi_coll_tuned_barrier_intra_two_procs (comm=0x6022c0
<ompi_mpi_comm_world>,
module=<optimized out>) at coll_tuned_barrier.c:324
#6 0x00007f64767c92e6 in PMPI_Barrier (comm=0x6022c0
<ompi_mpi_comm_world>) at pbarrier.c:70
#7 0x00000000004010bd in main (argc=1, argv=0x7ffd1d9217d8) at
mpi_hello_master_slave.c:115
(gdb)
(gdb) bt
#0 mca_pml_ob1_progress () at pml_ob1_progress.c:45
#1 0x00007feeae7dc22a in opal_progress () at
runtime/opal_progress.c:187
#2 0x00007feea9e125c5 in opal_condition_wait (c=<optimized out>,
m=<optimized out>)
at ../../../../opal/threads/condition.h:78
#3 ompi_request_wait_completion (req=0xe55200) at
../../../../ompi/request/request.h:381
#4 mca_pml_ob1_recv (addr=<optimized out>, count=255,
datatype=<optimized out>,
src=<optimized out>, tag=<optimized out>, comm=<optimized
out>, status=0x7fff4a618000)
at pml_ob1_irecv.c:118
#5 0x00007feeaf35068f in PMPI_Recv (buf=0x7fff4a618020, count=255,
type=0x6020c0 <ompi_mpi_char>, source=<optimized out>,
tag=<optimized out>,
comm=0x6022c0 <ompi_mpi_comm_world>, status=0x7fff4a618000)
at precv.c:78
#6 0x0000000000400d49 in slave () at mpi_hello_master_slave.c:67
#7 0x00000000004010b3 in main (argc=1, argv=0x7fff4a6184d8) at
mpi_hello_master_slave.c:113
(gdb)
Scenario 2:
Dual socket Hexcore machine with Qlogic IB, Chelsio iWARP and
Fibre Channel, all link down, GigE, link up, OpenMPI compiled
from master branch, crashes as follows:
[durga@smallMPI Desktop]$ mpirun -np 2 ./mpi_hello_master_slave
mpi_hello_master_slave:39570 terminated with signal 11 at PC=20
SP=7ffd438c00b8. Backtrace:
mpi_hello_master_slave:39571 terminated with signal 11 at PC=20
SP=7ffee5903e08. Backtrace:
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node smallMPI
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Scenario 3:
Exactly same as scenario 2, but with command line more explicit
as follows:
[durga@smallMPI Desktop]$ mpirun -np 2 -mca btl self,sm
./mpi_hello_master_slave
This hangs with the following backtrace:
(gdb) bt
#0 0x00007ff6639f049d in nanosleep () from /lib64/libc.so.6
#1 0x00007ff663a210d4 in usleep () from /lib64/libc.so.6
#2 0x00007ff662f72796 in OPAL_PMIX_PMIX120_PMIx_Fence
(procs=0x0, nprocs=0, info=0x0, ninfo=0)
at src/client/pmix_client_fence.c:100
#3 0x00007ff662f4f0bc in pmix120_fence (procs=0x0,
collect_data=0) at pmix120_client.c:255
#4 0x00007ff663f941af in ompi_mpi_init (argc=1,
argv=0x7ffc18c9afd8, requested=0, provided=0x7ffc18c9adac)
at runtime/ompi_mpi_init.c:813
#5 0x00007ff663fc9c33 in PMPI_Init (argc=0x7ffc18c9addc,
argv=0x7ffc18c9add0) at pinit.c:66
#6 0x000000000040101f in main (argc=1, argv=0x7ffc18c9afd8) at
mpi_hello_master_slave.c:94
(gdb) q
(gdb) bt
#0 0x00007f5af7646117 in sched_yield () from /lib64/libc.so.6
#1 0x00007f5af7323875 in amsh_ep_connreq_wrap () from
/lib64/libpsm_infinipath.so.1
#2 0x00007f5af7324254 in amsh_ep_connect () from
/lib64/libpsm_infinipath.so.1
#3 0x00007f5af732d0df in psm_ep_connect () from
/lib64/libpsm_infinipath.so.1
#4 0x00007f5af7d94a69 in ompi_mtl_psm_add_procs
(mtl=0x7f5af80f8500 <ompi_mtl_psm>, nprocs=2, procs=0xf53e60)
at mtl_psm.c:312
#5 0x00007f5af7df3630 in mca_pml_cm_add_procs (procs=0xf53e60,
nprocs=2) at pml_cm.c:134
#6 0x00007f5af7bcc0d1 in ompi_mpi_init (argc=1,
argv=0x7ffc485a2f98, requested=0, provided=0x7ffc485a2d6c)
at runtime/ompi_mpi_init.c:777
#7 0x00007f5af7c01c33 in PMPI_Init (argc=0x7ffc485a2d9c,
argv=0x7ffc485a2d90) at pinit.c:66
#8 0x000000000040101f in main (argc=1, argv=0x7ffc485a2f98) at
mpi_hello_master_slave.c:94
This seems to suggest that it is trying PSM to connect even when
the link was down and it was not mentioned in the command line.
Is this behavior expected?
Scenario 4:
Exactly same as scenario 3, but with even more explicit command line:
[durga@smallMPI Desktop]$ mpirun -np 2 -mca btl self,sm -mca pml
ob1 ./mpi_hello_master_slave
This hangs towards the end, after printing the output (as opposed
to scenario 3 where it hangs at the connection setup stage,
without printing anything.)
Process 0 of 2 running on host smallMPI
Now 1 slave tasks are sending greetings.
Process 1 of 2 running on host smallMPI
Greetings from task 1:
message type: 3
msg length: 141 characters
message:
hostname: smallMPI
operating system: Linux
release: 3.10.0-327.13.1.el7.x86_64
processor: x86_64
Backtraces of the two processes are as follows:
(gdb) bt
#0 opal_timer_base_get_usec_clock_gettime () at
timer_linux_component.c:180
#1 0x00007f10f46e50e4 in opal_progress () at
runtime/opal_progress.c:161
#2 0x00007f10f58a9d8b in opal_condition_wait (c=0x7f10f5df3c40
<ompi_request_cond>,
m=0x7f10f5df3bc0 <ompi_request_lock>) at
../opal/threads/condition.h:76
#3 0x00007f10f58aa31b in ompi_request_default_wait_all (count=2,
requests=0x7ffe7edd5a80,
statuses=0x7ffe7edd5a50) at request/req_wait.c:287
#4 0x00007f10f596f225 in ompi_coll_base_sendrecv_zero (dest=1,
stag=-16, source=1, rtag=-16,
comm=0x6022c0 <ompi_mpi_comm_world>) at
base/coll_base_barrier.c:63
#5 0x00007f10f596f92a in ompi_coll_base_barrier_intra_two_procs
(comm=0x6022c0 <ompi_mpi_comm_world>,
module=0xd5a7f0) at base/coll_base_barrier.c:308
#6 0x00007f10f599ffec in ompi_coll_tuned_barrier_intra_dec_fixed
(comm=0x6022c0 <ompi_mpi_comm_world>,
module=0xd5a7f0) at coll_tuned_decision_fixed.c:196
#7 0x00007f10f58c86fd in PMPI_Barrier (comm=0x6022c0
<ompi_mpi_comm_world>) at pbarrier.c:63
#8 0x00000000004010bd in main (argc=1, argv=0x7ffe7edd5d48) at
mpi_hello_master_slave.c:115
(gdb) bt
#0 0x00007fffe9d6a988 in clock_gettime ()
#1 0x00007f704bf64edd in clock_gettime () from /lib64/libc.so.6
#2 0x00007f704b4deea5 in opal_timer_base_get_usec_clock_gettime
() at timer_linux_component.c:183
#3 0x00007f704b2f50e4 in opal_progress () at
runtime/opal_progress.c:161
#4 0x00007f704c6cc39c in opal_condition_wait (c=0x7f704ca03c40
<ompi_request_cond>,
m=0x7f704ca03bc0 <ompi_request_lock>) at
../../../../opal/threads/condition.h:76
#5 0x00007f704c6cc560 in ompi_request_wait_completion
(req=0x165e580) at ../../../../ompi/request/request.h:383
#6 0x00007f704c6cd724 in mca_pml_ob1_recv (addr=0x7fffe9cafa10,
count=255, datatype=0x6020c0 <ompi_mpi_char>,
src=0, tag=1, comm=0x6022c0 <ompi_mpi_comm_world>,
status=0x7fffe9caf9f0) at pml_ob1_irecv.c:123
#7 0x00007f704c4ff434 in PMPI_Recv (buf=0x7fffe9cafa10,
count=255, type=0x6020c0 <ompi_mpi_char>, source=0,
tag=1, comm=0x6022c0 <ompi_mpi_comm_world>,
status=0x7fffe9caf9f0) at precv.c:79
#8 0x0000000000400d49 in slave () at mpi_hello_master_slave.c:67
#9 0x00000000004010b3 in main (argc=1, argv=0x7fffe9cafec8) at
mpi_hello_master_slave.c:113
(gdb) q
I am going to try the tarball shortly, but hopefully someone can
get some insight out of this much information. BTW, the code was
compiled with the following flags:
-Wall -Wextra -g3 -O0
Let me rehash that NO network communication was involved in any
of these experiments; they were all single node shared memory (sm
btl) jobs.
Thanks
Durga
1% of the executables have 99% of CPU privilege!
Userspace code! Unite!! Occupy the kernel!!!
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this
post:http://www.open-mpi.org/community/lists/users/2016/04/29018.php