John --

+1 on what Gilles said.

The initial error says that a broadcast message was truncated.  This likely 
indicates that someone is calling MPI_Bcast with a different size than its 
peers (it *could* indicate what Giles mentioned about 
different-but-supposed-to-be-compatible-datatypes, but more often than not, 
it's a simple accounting error in message lengths).

Also, as a sidenote: I notice you're running Open MPI 1.6.5.  That's pretty 
ancient.  Any chance you can upgrade to something more modern, like Open MPI 
1.10.x?



On February 15, 2016 at 7:04:15 PM, Gilles Gouaillardet (gil...@rist.or.jp) 
wrote:
> John,
>  
> the readv error is likely a consequence of the abort, and not the root
> cause of the issue.
>  
> an obvious user error is if not all MPI tasks MPI_Bcast with non
> compatible signatures.
>  
> coll/tuned module is known to be broken when using different but
> compatible signatures.
> for example, one process MPI_Bcast one vector of N MPI_DOUBLE, and one
> other process MPI_Bcast N MPI_DOUBLE.
>  
> you can try to
>  
> mpirun --mca coll ^tuned ...
>  
> and see if it helps
>  
> fwiw, OpenMPI 1.6.5 is quite old nowadays...
>  
> Cheers,
>  
> Gilles
> On 2/16/2016 7:28 AM, JR Cary wrote:
> > We have distributed a binary to a person with a Linux cluster. When
> > he runs our binary, he gets
> >
> > [server1:10978] *** An error occurred in MPI_Bcast
> > [server1:10978] *** on communicator MPI COMMUNICATOR 8 DUP FROM 7
> > [server1:10978] *** MPI_ERR_TRUNCATE: message truncated
> > [server1:10978] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
> > [server2][[14125,1],2][/..../openmpi-1.6.5/ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> >   
> > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> >
> > Anyone have any ideas on how to debug this?
> >
> > Thanks......John Cary
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> > http://www.open-mpi.org/community/lists/users/2016/02/28534.php
> >
>  
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/02/28535.php  
>  

--  
Jeff Squyres
jsquy...@cisco.com  
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to