On Apr 20, 2010, at 8:55 AM, Jonathan Dursi wrote: > We've got OpenMPI 1.4.1 and Intel MPI running on our 3000 node system. We > like OpenMPI for large jobs, because the startup time is much faster (and > startup is more reliable) than the current defaults with IntelMPI; but we're > having some pretty serious problems when the jobs are actually running. > When running medium- to large- sized jobs (say, anything over 500 cores) over > ethernet using OpenMPI, several of our users, using a variety of very > different sorts of codes, report errors like this: > > [gpc-f102n010][[30331,1],212][btl_tcp_frag.c:214:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)
That's odd -- this error message indicates that the TCP BTL had previously successfully established the connection and was trying to receive an MPI message on the socket. But then reading from the socket timed out. Hmm. > which sometimes hang the job, or sometimes kill it outright: > > [gpc-f114n073][[23186,1],109][btl_tcp_frag.c:214:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection timed out (110) > [gpc-f114n075][[23186,1],125][btl_tcp_frag.c:214:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) This one is a little different -- it indicates that n075 saw its peer hang up the socket, which it determined to be a fatal error and therefore aborted. (sidenote: I'm not sure why we don't judge a connection timeout to be the same fatal error... hmmm...) It's likely that n073 saw the timeout, closed the socket, and then n075 saw that hangup. It might not happen in all scenarios, because n075 might not see the hangup unless it's actively trying to read something from that peer's socket. Regardless, the real question is: why is the socket timing out? > Unfortunately, this only happens intermittently, and only with large jobs, so > it is hard to track down. It seems to happen more reliably with larger > numbers of processors, but I don't know if that tells us something real about > the issue, or just that larger N -> better statistics. For one users > code, it definitely occurs during an MPI_Wait (this particular code has been > run on a wide variety of machines with a wide variety of MPIs -- which isn't > proof of correctness of course, but everything looks fine), for others it is > less clear. I think it's reasonable to see this in MPI_Wait -- it means that OMPI was notified that there was something to read of a particular socket file descriptor and was trying to read it (and then timed out). I'll bet that the others all died in some kind of communication with a specific peer (regardless of whether it was in a collective or point-to-point communication call). > I don't know if it's an OpenMPI issue, or just represents a network issue > which Intel's MPI happens to be more tolerant of with the default set of > parameters. It's also unclear whether or not this issue occurred with > earlier OpenMPI versions. > > Where should I start looking to find out what is going on? Are there > parameters that can be adjusted to play with timeouts to see if the issue can > be localized, or worked around? Can you see if there's any kernel parameters to adjust how many fd's you can have open simultaneously, and the length of TCP socket timeouts? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/