On Sep 25, 2007, at 4:25 AM, Rayne wrote:
Hi all, I'm using the SGE system on my school network,
and would like to know if the errors I received below
means there's something wrong with my MPI_Recv
function.
[0,1,3][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed with errno=104
[0,1,2][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed with errno=104
Generally, these indicate that the remote process has died.
Generally, that means an abnormal termination due to segmentation
faults or the like. You might want to run the code under a debugger
to see if it shows anything useful. If your cluster doesn't have a
parallel debugger like TotalView or DDT available, you can (for small
numbers of processes) get away with using xterm and gdb, something like:
mpirun -np X -d xterm -e gdb <application>
It'll open X xterms, each with a gdb running one instance of the
application.
Good luck,
Brian