John -- +1 on what Gilles said.
The initial error says that a broadcast message was truncated. This likely indicates that someone is calling MPI_Bcast with a different size than its peers (it *could* indicate what Giles mentioned about different-but-supposed-to-be-compatible-datatypes, but more often than not, it's a simple accounting error in message lengths). Also, as a sidenote: I notice you're running Open MPI 1.6.5. That's pretty ancient. Any chance you can upgrade to something more modern, like Open MPI 1.10.x? On February 15, 2016 at 7:04:15 PM, Gilles Gouaillardet (gil...@rist.or.jp) wrote: > John, > > the readv error is likely a consequence of the abort, and not the root > cause of the issue. > > an obvious user error is if not all MPI tasks MPI_Bcast with non > compatible signatures. > > coll/tuned module is known to be broken when using different but > compatible signatures. > for example, one process MPI_Bcast one vector of N MPI_DOUBLE, and one > other process MPI_Bcast N MPI_DOUBLE. > > you can try to > > mpirun --mca coll ^tuned ... > > and see if it helps > > fwiw, OpenMPI 1.6.5 is quite old nowadays... > > Cheers, > > Gilles > On 2/16/2016 7:28 AM, JR Cary wrote: > > We have distributed a binary to a person with a Linux cluster. When > > he runs our binary, he gets > > > > [server1:10978] *** An error occurred in MPI_Bcast > > [server1:10978] *** on communicator MPI COMMUNICATOR 8 DUP FROM 7 > > [server1:10978] *** MPI_ERR_TRUNCATE: message truncated > > [server1:10978] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort > > [server2][[14125,1],2][/..../openmpi-1.6.5/ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] > > > > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) > > > > Anyone have any ideas on how to debug this? > > > > Thanks......John Cary > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > > http://www.open-mpi.org/community/lists/users/2016/02/28534.php > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/02/28535.php > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/