On May 29, 2010, at 11:35 AM, Rahul Nabar wrote:
> On Sat, May 29, 2010 at 8:19 AM, Ralph Castain wrote:
>
>>
>>> From your other note, it sounds like #3 might be the problem here. Do you
>>> have some nodes that are configured with "eth0" pointing to your 10.x
>>> network, and other nodes w
On Sat, May 29, 2010 at 8:19 AM, Ralph Castain wrote:
>
> >From your other note, it sounds like #3 might be the problem here. Do you
> >have some nodes that are configured with "eth0" pointing to your 10.x
> >network, and other nodes with "eth0" pointing to your 192.x network? I have
> >found
There are some timeout issues you can see with large clusters on Torque - check
the Torque web site for explanations and instructions on what to do about it.
However, that doesn't appear to be the problem here.
If our daemon doesn't report back, it is typically due to one or more of the
followi
On Fri, May 28, 2010 at 3:53 PM, Ralph Castain wrote:
> What environment are you running on the cluster, and what version of OMPI?
> Not sure that error message is coming from us.
openmpi-1.4.1
The cluster runs PBS-Torque. So I guess, that could be the other error source.
--
Rahul
What environment are you running on the cluster, and what version of OMPI? Not
sure that error message is coming from us.
On May 28, 2010, at 1:18 PM, Rahul Nabar wrote:
> Often when I try and run larger jobs on our cluster I get the error of
> the sort from some of the compute-servers:
>
>