Hi all - and sorry for the multiple postings, but I have more information. 1: After a reboot of two nodes I ran again, and the inter-node freeze didn't happen until the third iteration. I take that to mean that the basic communication works, but that something is saturating. Is there some notion of buffer size somewhere in the MPI system that could explain this? 2: The nodes have 4 ethernet cards each. Could the mapping be a problem? 3: The cpus are running at a 100% for all processes involved in the freeze 4: The same test program ( http://code.google.com/p/pypar/source/browse/source/mpi_test.c) works fine when run within one node so the problem must be with MPI and/or our network.
5: The network and ssh works otherwise fine. Again many thanks for any hint that can get us going again. The main thing we need is some diagnostics that may point to what causes this problem for MPI. Cheers Ole Nielsen ------ Here's the output which shows the freeze in the third iteration: nielso@alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host node5,node6 --npernode 2 a.out Number of processes = 4 Test repeated 3 times for reliability I am process 2 on node node6 P2: Waiting to receive from to P1 P2: Sending to to P3 I am process 3 on node node6 P3: Waiting to receive from to P2 I am process 1 on node node5 P1: Waiting to receive from to P0 P1: Sending to to P2 P1: Waiting to receive from to P0 I am process 0 on node node5 Run 1 of 3 P0: Sending to P1 P0: Waiting to receive from P3 P2: Waiting to receive from to P1 P3: Sending to to P0 P3: Waiting to receive from to P2 P1: Sending to to P2 P0: Received from to P3 Run 2 of 3 P0: Sending to P1 P0: Waiting to receive from P3 P1: Waiting to receive from to P0