Hi, An MPI job is running on two nodes and everything seems to be fine. However, in the middle of the run, the program aborts with the following error
[compute-0-1.local][[47664,1],14][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [compute-0-3.local][[47664,1],11][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [compute-0-3.local][[47664,1],13][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 4989 on node compute-0-1 exited on signal 4 (Illegal instruction). -------------------------------------------------------------------------- There are 8 processes on that node and each consumes about 150MB of memory. The total memory usage is about 1% of the memory. There are some discussions on the web about memory error but there is no clear answer for that. What does that illegal instruction mean? Regards, Mahmood
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users