Support for failure scenarios is something that is getting better over
time in Open MPI.
It looks like the version you are using either didn't properly catch
that there was a failure and/or then cleanly exit all MPI processes.
On Nov 6, 2007, at 9:01 PM, Teng Lin wrote:
Hi,
Just realize I have a job run for a long time, while some of the nodes
already die. Is there any way to ask other nodes to quit ?
[kyla-0-1.local:09741] mca_btl_tcp_frag_send: writev failed with
errno=104
[kyla-0-1.local:09742] mca_btl_tcp_frag_send: writev failed with
errno=104
The FAQ does mention it is related to :
Connection reset by peer: These types of errors usually occur after
MPI_INIT has completed, and typically indicate that an MPI process has
died unexpectedly (e.g., due to a seg fault). The specific error
message indicates that a peer MPI process tried to write to the now-
dead MPI process and failed.
Thanks,
Teng
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems