In absence of a clear error message, the btl_tcp_frag related error messages can suggest a process was killed by the oom-killer. This is not your case, since rank 0 died because of an illegal instruction.
Are you running under a batch manager ? On which architecture ? do your compute node have the very same architecture than the node used to compile your libs and apps ? That kind of error can occur if your app was built with AVX2 instructions (e.g. latest Intel xeon) but runs on a previous generation processor that is not AVX2 capable. I guess the same thing can occur if different arm versions are involved. can you ulimit -c unlimited and mpirun again ? Hopefully you will get a core file that points you to the illegal instruction Cheers, Gilles On Tuesday, August 30, 2016, Mahmood Naderan <mahmood...@gmail.com> wrote: > Hi, > An MPI job is running on two nodes and everything seems to be fine. > However, in the middle of the run, the program aborts with the following > error > > > [compute-0-1.local][[47664,1],14][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) > [compute-0-3.local][[47664,1],11][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) > [compute-0-3.local][[47664,1],13][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) > -------------------------------------------------------------------------- > mpirun noticed that process rank 0 with PID 4989 on node compute-0-1 > exited on signal 4 (Illegal instruction). > -------------------------------------------------------------------------- > > > There are 8 processes on that node and each consumes about 150MB of > memory. The total memory usage is about 1% of the memory. > > There are some discussions on the web about memory error but there is no > clear answer for that. What does that illegal instruction mean? > > > > > Regards, > Mahmood > > >
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users