George Bosilca wrote:

[.....]

I don't think the root crashed. I guess that one of the other nodes crashed, the root got a bad socket (which is what the first error message seems to indicate), and get terminated. As the output is not synchronized between the nodes, one cannot rely on its order nor contents. Moreover, mpirun report that the root was killed with signal 15, which is how we cleanup the remaining processes when we detect that something really bad (like a seg fault) happened in the parallel application.

Sorry, I should have rephrased that as a question ("is it the root?"). I'm not that familiar with the debug output of OpenMPI yet, so I included it in case somebody made more sense of it than me.


There are many differences between the routed and non routed collectives. All errors that you reported so far are related to rooted collectives, which make sense. I didn't state that it is normal that Open MPI do not behave [sic]. I wonder if you can get such errors with non routed collectives (such as allreduce, allgather and alltoall), or with messages larger than the eager size ?
You're right, I haven't seen any crashes with the All*-variants.

TCP eager limit is set to 65536 (output from ompi_info):

    MCA btl: parameter "btl_tcp_eager_limit" (current value: "65536")
    MCA btl: parameter "btl_tcp_min_send_size" (current value: "65536")
    MCA btl: parameter "btl_tcp_max_send_size" (current value: "131072")

I observed crashes with Broadcasts and Reduces of 131072 bytes. I'm playing around with larger messages now, and while Reduce with 16 nodes seem stable at 262144 byte messages, it still crashes with 44 nodes.


If you type "ompi_info --param btl tcp", you will see what is the eager size for the TCP BTL. Everything smaller than this size will be send eagerly; have the opportunity to became unexpected on the receiver side and can lead to this problem. As a quick test, you can add "--mca btl_tcp_eager_limit 2048" to your mpirun command line, and this problem will not happen with for size over the 2K. This was the original solution for the flow control problem. If you know your application will generate thousands of unexpected messages, then you should set the eager limit to zero.
I tried running Reduce with 4096 ints (16384 bytes), 16 nodes and eager limit 2048:

mpirun -hostfile lamhosts.all.r360 -np 16 --mca btl_tcp_eager_limit 2048 ./ompi-crash 4096 2 3000 { 'groupsize' : 16, 'count' : 4096, 'bytes' : 16384, 'bufbytes' : 262144, 'iters' : 3000, 'bmno' : 2 [compute-2-2][0,1,10][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] [compute-3-2][0,1,14][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=104
mca_btl_tcp_frag_recv: readv failed with errno=104
mpirun noticed that job rank 0 with PID 30407 on node compute-0-0 exited on signal 15 (Terminated).
15 additional processes aborted (not shown)

This one tries to run Reduce with 1 integer per node and also crashes (with eager size 0):

/mpirun -hostfile lamhosts.all.r360 -np 16 --mca btl_tcp_eager_limit 0 ./ompi-crash 1 2 3000
...

This is puzzling.


I'm mostly familiarizing myself with OpenMPI at the moment as well as poking around to see how the collective operations work and perform compared to LAM. Partly because I have some ideas I'd like to test out, and partly because I'm considering to move some student exercises over from LAM to OpenMPI. I don't expect to write actual applications that treat MPI like this myself, but on the other hand, not having to do throttling on top of MPI could be an advantage in some application patterns.


Regards,

--
// John Markus Bjørndalen
// http://www.cs.uit.no/~johnm/


Reply via email to