In absence of a clear error message, the btl_tcp_frag related error
messages can suggest a process was killed by the oom-killer.
This is not your case, since rank 0 died because of an illegal instruction.

Are you running under a batch manager ?
On which architecture ?
do your compute node have the very same architecture than the node used to
compile your libs and apps ?
That kind of error can occur if your app was built with AVX2 instructions
(e.g. latest Intel xeon) but runs on a previous generation processor that
is not AVX2 capable.
I guess the same thing can occur if different arm versions are involved.

can you
ulimit -c unlimited
and mpirun again ?
Hopefully you will get a core file that points you to the illegal
instruction



Cheers,

Gilles

On Tuesday, August 30, 2016, Mahmood Naderan <mahmood...@gmail.com> wrote:

> Hi,
> An MPI job is running on two nodes and everything seems to be fine.
> However, in the middle of the run, the program aborts with the following
> error
>
>
> [compute-0-1.local][[47664,1],14][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [compute-0-3.local][[47664,1],11][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [compute-0-3.local][[47664,1],13][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 4989 on node compute-0-1
> exited on signal 4 (Illegal instruction).
> --------------------------------------------------------------------------
>
>
> There are 8 processes on that node and each consumes about 150MB of
> memory. The total memory usage is about 1% of the memory.
>
> There are some discussions on the web about memory error but there is no
> clear answer for that. What does that illegal instruction mean?
>
>
>
>
> Regards,
> Mahmood
>
>
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to