I am trying to debug an issue which is really weird. I have
simple MPI hello world application (attached) that hangs when I
try to run on our cluster using 256 nodes with 16 cores on each
node. The cluster uses QDR IB.
I am able to run the test over ethernet by excluding openib from
the btl. Howev
On our cluster we are noticing intermediate job launch failure when
using OpenMPI. We are currently using OpenMPI-1.6.1 on our cluster and
it is integrated with Torque-4.1.3. It failes even for a simple MPI
hello world applications. The issue is that orted gets launched on all
the nodes but the
#x27;t heard of it to-date.
>
>
> On Feb 14, 2013, at 7:21 AM, Bharath Ramesh wrote:
>
> > On our cluster we are noticing intermediate job launch failure when using
> > OpenMPI. We are currently using OpenMPI-1.6.1 on our cluster and it is
> > integrated with Torq
each orted
> report back to mpirun after it launches. If not, then it is likely that
> something is blocking it.
>
> You could also try updating to 1.6.3/4 in case there is some race condition
> in 1.6.1, though we haven't heard of it to-date.
>
>
> On Feb 14, 2013,
.
>
> Might be worth a try?
>
>
> On Feb 14, 2013, at 8:38 AM, Bharath Ramesh wrote:
>
> > Is there any way to prevent the output of more than one node
> > written to the same line. I tried setting --output-filename,
> > which didnt help. For some reason only
ething in 1.6 that might relate to this - a race
> condition in the startup. You might try updating to the 1.6.4 release
> candidate.
>
>
> On Feb 14, 2013, at 11:04 AM, Bharath Ramesh wrote:
>
> > When I set the OPAL_OUTPUT_STDERR_FD=0 I receive a whole bunch of
>