Hi,

Just realize I have a job run for a long time, while some of the nodes already die. Is there any way to ask other nodes to quit ?


[kyla-0-1.local:09741] mca_btl_tcp_frag_send: writev failed with errno=104 [kyla-0-1.local:09742] mca_btl_tcp_frag_send: writev failed with errno=104

The FAQ does mention it is related  to :
Connection reset by peer: These types of errors usually occur after MPI_INIT has completed, and typically indicate that an MPI process has died unexpectedly (e.g., due to a seg fault). The specific error message indicates that a peer MPI process tried to write to the now- dead MPI process and failed.

Thanks,
Teng

Reply via email to