Re: [OMPI users] 'readv failed: Connection timed out' issue

2010-05-15 Thread Jeff Squyres
On May 10, 2010, at 11:00 AM, Guanyinzhu wrote: > Did "--mca mpi_preconnect_all 1" work? > > I also face this problem "readv failed: connection time out" in the > production environment, and our engineer has reproduced this scenario at 20 > nodes with gigabye ethernet and limit one ethernet sp

Re: [OMPI users] 'readv failed: Connection timed out' issue

2010-05-10 Thread Guanyinzhu
endpoint_recv_handler(), the nonblocking readv() call on a failed connected fd, so it return -1, and set the errorno to 110 which means connection timed out. > From: ljdu...@scinet.utoronto.ca > Date: Tue, 20 Apr 2010 09:24:17 -0400 > To: us...@open-mpi.org > Subject: Re: [OMPI us

Re: [OMPI users] 'readv failed: Connection timed out' issue

2010-04-21 Thread Jeff Squyres
On Apr 20, 2010, at 8:55 AM, Jonathan Dursi wrote: > We've got OpenMPI 1.4.1 and Intel MPI running on our 3000 node system. We > like OpenMPI for large jobs, because the startup time is much faster (and > startup is more reliable) than the current defaults with IntelMPI; but we're > having so

Re: [OMPI users] 'readv failed: Connection timed out' issue

2010-04-20 Thread Jonathan Dursi
On 2010-04-20, at 9:18AM, Terry Dontje wrote: > Hi Jonathan, > > Do you know what the top level function is or communication pattern? Is it > some type of collective or a pattern that has a many to one. Ah, should have mentioned. The best-characterized code that we're seeing this with is an

Re: [OMPI users] 'readv failed: Connection timed out' issue

2010-04-20 Thread Terry Dontje
Hi Jonathan, Do you know what the top level function is or communication pattern? Is it some type of collective or a pattern that has a many to one. What might be happening is that since OMPI uses a lazy connections by default if all processes are trying to establish communications to the same

[OMPI users] 'readv failed: Connection timed out' issue

2010-04-20 Thread Jonathan Dursi
Hi: We've got OpenMPI 1.4.1 and Intel MPI running on our 3000 node system. We like OpenMPI for large jobs, because the startup time is much faster (and startup is more reliable) than the current defaults with IntelMPI; but we're having some pretty serious problems when the jobs are actually r