Re: [OMPI users] MPI daemon error
There are some timeout issues you can see with large clusters on Torque - check the Torque web site for explanations and instructions on what to do about it. However, that doesn't appear to be the problem here. If our daemon doesn't report back, it is typically due to one or more of the following reasons: 1. it couldn't start because it didn't find the required libraries. 2. it couldn't report back because it hit a firewall 3. it couldn't report back because it didn't find a network that would get it back to mpirun >From your other note, it sounds like #3 might be the problem here. Do you have >some nodes that are configured with "eth0" pointing to your 10.x network, and >other nodes with "eth0" pointing to your 192.x network? I have found that >having interfaces that share a name but are on different IP addresses >sometimes causes OMPI to miss-connect. If you randomly got some of those nodes in your allocation, that might explain why your jobs sometimes work and sometimes don't. On May 28, 2010, at 3:23 PM, Rahul Nabar wrote: > On Fri, May 28, 2010 at 3:53 PM, Ralph Castain wrote: >> What environment are you running on the cluster, and what version of OMPI? >> Not sure that error message is coming from us. > > openmpi-1.4.1 > The cluster runs PBS-Torque. So I guess, that could be the other error source. > > -- > Rahul > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI daemon error
On Sat, May 29, 2010 at 8:19 AM, Ralph Castain wrote: > > >From your other note, it sounds like #3 might be the problem here. Do you > >have some nodes that are configured with "eth0" pointing to your 10.x > >network, and other nodes with "eth0" pointing to your 192.x network? I have > >found that having interfaces that share a name but are on different IP > >addresses sometimes causes OMPI to miss-connect. > > If you randomly got some of those nodes in your allocation, that might > explain why your jobs sometimes work and sometimes don't. That is exactly true. On some nodes eth0 is 1Gig and on others 10Gig and vice versa. Is that going to be a problem and is there a workaround? I mean 192.168 is always the 10Gig and 10.0 the 1 Gig but the correspondence with eth0 vs eth1 is not consistent. I'd have liked that but couldn't figure out a way to guarantee the order of the eth interfaces. -- Rahul
Re: [OMPI users] MPI daemon error
On May 29, 2010, at 11:35 AM, Rahul Nabar wrote: > On Sat, May 29, 2010 at 8:19 AM, Ralph Castain wrote: > >> >>> From your other note, it sounds like #3 might be the problem here. Do you >>> have some nodes that are configured with "eth0" pointing to your 10.x >>> network, and other nodes with "eth0" pointing to your 192.x network? I have >>> found that having interfaces that share a name but are on different IP >>> addresses sometimes causes OMPI to miss-connect. >> >> If you randomly got some of those nodes in your allocation, that might >> explain why your jobs sometimes work and sometimes don't. > > That is exactly true. On some nodes eth0 is 1Gig and on others 10Gig > and vice versa. Is that going to be a problem and is there a > workaround? I mean 192.168 is always the 10Gig and 10.0 the 1 Gig but > the correspondence with eth0 vs eth1 is not consistent. I'd have liked > that but couldn't figure out a way to guarantee the order of the eth > interfaces. Just set the mca param oob_tcp_if_include 192.168 and you should be okay. I forget the exact param syntax for specifying an IP network instead of an interface name, but you can get it by using ompi_info --param oob tcp > > -- > Rahul > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users