Re: [OMPI users] MPI daemon error

2010-05-29 Thread Ralph Castain
There are some timeout issues you can see with large clusters on Torque - check 
the Torque web site for explanations and instructions on what to do about it. 
However, that doesn't appear to be the problem here.

If our daemon doesn't report back, it is typically due to one or more of the 
following reasons:

1. it couldn't start because it didn't find the required libraries.

2. it couldn't report back because it hit a firewall

3. it couldn't report back because it didn't find a network that would get it 
back to mpirun

>From your other note, it sounds like #3 might be the problem here. Do you have 
>some nodes that are configured with "eth0" pointing to your 10.x network, and 
>other nodes with "eth0" pointing to your 192.x network? I have found that 
>having interfaces that share a name but are on different IP addresses 
>sometimes causes OMPI to miss-connect.

If you randomly got some of those nodes in your allocation, that might explain 
why your jobs sometimes work and sometimes don't.


On May 28, 2010, at 3:23 PM, Rahul Nabar wrote:

> On Fri, May 28, 2010 at 3:53 PM, Ralph Castain  wrote:
>> What environment are you running on the cluster, and what version of OMPI? 
>> Not sure that error message is coming from us.
> 
> openmpi-1.4.1
> The cluster runs PBS-Torque. So I guess, that could be the other error source.
> 
> -- 
> Rahul
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] MPI daemon error

2010-05-29 Thread Rahul Nabar
On Sat, May 29, 2010 at 8:19 AM, Ralph Castain  wrote:

>
> >From your other note, it sounds like #3 might be the problem here. Do you 
> >have some nodes that are configured with "eth0" pointing to your 10.x 
> >network, and other nodes with "eth0" pointing to your 192.x network? I have 
> >found that having interfaces that share a name but are on different IP 
> >addresses sometimes causes OMPI to miss-connect.
>
> If you randomly got some of those nodes in your allocation, that might 
> explain why your jobs sometimes work and sometimes don't.

That is exactly true. On some nodes eth0 is 1Gig and on others 10Gig
and vice versa. Is that going to be a problem and is there a
workaround? I mean 192.168 is always the 10Gig and 10.0 the 1 Gig but
the correspondence with eth0 vs eth1 is not consistent. I'd have liked
that but couldn't figure out a way to guarantee the order of the eth
interfaces.

-- 
Rahul



Re: [OMPI users] MPI daemon error

2010-05-29 Thread Ralph Castain

On May 29, 2010, at 11:35 AM, Rahul Nabar wrote:

> On Sat, May 29, 2010 at 8:19 AM, Ralph Castain  wrote:
> 
>> 
>>> From your other note, it sounds like #3 might be the problem here. Do you 
>>> have some nodes that are configured with "eth0" pointing to your 10.x 
>>> network, and other nodes with "eth0" pointing to your 192.x network? I have 
>>> found that having interfaces that share a name but are on different IP 
>>> addresses sometimes causes OMPI to miss-connect.
>> 
>> If you randomly got some of those nodes in your allocation, that might 
>> explain why your jobs sometimes work and sometimes don't.
> 
> That is exactly true. On some nodes eth0 is 1Gig and on others 10Gig
> and vice versa. Is that going to be a problem and is there a
> workaround? I mean 192.168 is always the 10Gig and 10.0 the 1 Gig but
> the correspondence with eth0 vs eth1 is not consistent. I'd have liked
> that but couldn't figure out a way to guarantee the order of the eth
> interfaces.

Just set the mca param oob_tcp_if_include 192.168 and you should be okay. I 
forget the exact param syntax for specifying an IP network instead of an 
interface name, but you can get it by using

ompi_info --param oob tcp


> 
> -- 
> Rahul
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users