Hi,
An MPI job is running on two nodes and everything seems to be fine.
However, in the middle of the run, the program aborts with the following
error


[compute-0-1.local][[47664,1],14][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[compute-0-3.local][[47664,1],11][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[compute-0-3.local][[47664,1],13][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 4989 on node compute-0-1 exited
on signal 4 (Illegal instruction).
--------------------------------------------------------------------------


There are 8 processes on that node and each consumes about 150MB of memory.
The total memory usage is about 1% of the memory.

There are some discussions on the web about memory error but there is no
clear answer for that. What does that illegal instruction mean?




Regards,
Mahmood
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to