George Bosilca wrote:
[.....]
I don't think the root crashed. I guess that one of the other nodes
crashed, the root got a bad socket (which is what the first error
message seems to indicate), and get terminated. As the output is not
synchronized between the nodes, one cannot rely on its order nor
contents. Moreover, mpirun report that the root was killed with signal
15, which is how we cleanup the remaining processes when we detect
that something really bad (like a seg fault) happened in the parallel
application.
Sorry, I should have rephrased that as a question ("is it the root?").
I'm not that familiar with the debug output of OpenMPI yet, so I
included it in case somebody made more sense of it than me.
There are many differences between the routed and non routed
collectives. All errors that you reported so far are related to rooted
collectives, which make sense. I didn't state that it is normal that
Open MPI do not behave [sic]. I wonder if you can get such errors with
non routed collectives (such as allreduce, allgather and alltoall), or
with messages larger than the eager size ?
You're right, I haven't seen any crashes with the All*-variants.
TCP eager limit is set to 65536 (output from ompi_info):
MCA btl: parameter "btl_tcp_eager_limit" (current value: "65536")
MCA btl: parameter "btl_tcp_min_send_size" (current value: "65536")
MCA btl: parameter "btl_tcp_max_send_size" (current value: "131072")
I observed crashes with Broadcasts and Reduces of 131072 bytes. I'm
playing around with larger messages now, and while Reduce with 16 nodes
seem stable at 262144 byte messages, it still crashes with 44 nodes.
If you type "ompi_info --param btl tcp", you will see what is the
eager size for the TCP BTL. Everything smaller than this size will be
send eagerly; have the opportunity to became unexpected on the
receiver side and can lead to this problem. As a quick test, you can
add "--mca btl_tcp_eager_limit 2048" to your mpirun command line, and
this problem will not happen with for size over the 2K. This was the
original solution for the flow control problem. If you know your
application will generate thousands of unexpected messages, then you
should set the eager limit to zero.
I tried running Reduce with 4096 ints (16384 bytes), 16 nodes and eager
limit 2048:
mpirun -hostfile lamhosts.all.r360 -np 16 --mca btl_tcp_eager_limit 2048
./ompi-crash 4096 2 3000
{ 'groupsize' : 16, 'count' : 4096, 'bytes' : 16384, 'bufbytes' :
262144, 'iters' : 3000, 'bmno' : 2
[compute-2-2][0,1,10][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
[compute-3-2][0,1,14][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed with errno=104
mca_btl_tcp_frag_recv: readv failed with errno=104
mpirun noticed that job rank 0 with PID 30407 on node compute-0-0 exited
on signal 15 (Terminated).
15 additional processes aborted (not shown)
This one tries to run Reduce with 1 integer per node and also crashes
(with eager size 0):
/mpirun -hostfile lamhosts.all.r360 -np 16 --mca btl_tcp_eager_limit 0
./ompi-crash 1 2 3000
...
This is puzzling.
I'm mostly familiarizing myself with OpenMPI at the moment as well as
poking around to see how the collective operations work and perform
compared to LAM. Partly because I have some ideas I'd like to test out,
and partly because I'm considering to move some student exercises over
from LAM to OpenMPI. I don't expect to write actual applications that
treat MPI like this myself, but on the other hand, not having to do
throttling on top of MPI could be an advantage in some application
patterns.
Regards,
--
// John Markus Bjørndalen
// http://www.cs.uit.no/~johnm/