Hi, and thanks for the feedback everyone.

George Bosilca wrote:
Brian is completely right. Here is a more detailed description of this problem.
[....]
On the other side, I hope that not many users write such applications. This is the best way to completely kill the performances of any MPI implementation, by overloading one process with messages. This is exactly what MPI_Reduce and MPI_Gather do, one process will get the final result and all other processes only have to send some data. This behavior only arises when the gather or the reduce use a very flat tree, and only for short messages. Because of the short messages there is no handshake between the sender and the receiver, which will make all messages unexpected, and the flat tree guarantee that there will be a lot of small messages. If you add a barrier every now and then (100 iterations) this problem will never happens.
I have done some more testing. Of the tested parameters, I'm observing this behaviour with group sizes from 16-44, and from 1 to 32768 integers in MPI_Reduce. For MPI_Gather, I'm observing crashes with group sizes 16-44 and from 1 to 4096 integers (per node).

In other words, it actually happens with other tree configurations and larger packet sizes :-/

By the way, I'm also observing crashes with MPI_Broadcast (groups of size 4-44 with the root process (rank 0) broadcasting integer arrays of size 16384 and 32768). It looks like the root process is crashing. Can a sender crash because it runs out of buffer space as well?

---------- snip --------------
/home/johnm/local/ompi/bin/mpirun -hostfile lamhosts.all.r360 -np 4 ./ompi-crash 16384 1 3000 { 'groupsize' : 4, 'count' : 16384, 'bytes' : 65536, 'bufbytes' : 262144, 'iters' : 3000, 'bmno' : 1 [compute-0-0][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=104 mpirun noticed that job rank 0 with PID 16366 on node compute-0-0 exited on signal 15 (Terminated).
3 additional processes aborted (not shown)
---------- snip --------------

One more thing, doing a lot of collective in a loop and computing the total time is not the correct way to evaluate the cost of any collective communication, simply because you will favor all algorithms based on pipelining. There is plenty of literature about this topic.

  george.
As I said in the original e-mail: I had only thrown them in for a bit of sanity checking. I expected funny numbers, but not that OpenMPI would crash.

The original idea was just to make a quick comparison of Allreduce, Allgather and Alltoall in LAM and OpenMPI. The opportunity for pipelining the operations there is rather small since they can't get much out of phase with each other.


Regards,

--
// John Markus Bjørndalen
// http://www.cs.uit.no/~johnm/


Reply via email to