[OMPI users] Simple MPI hello world hangs over IB

2013-02-04 Thread Bharath Ramesh
I am trying to debug an issue which is really weird. I have simple MPI hello world application (attached) that hangs when I try to run on our cluster using 256 nodes with 16 cores on each node. The cluster uses QDR IB. I am able to run the test over ethernet by excluding openib from the btl. Howev

[OMPI users] OpenMPI job launch failures

2013-02-14 Thread Bharath Ramesh
On our cluster we are noticing intermediate job launch failure when using OpenMPI. We are currently using OpenMPI-1.6.1 on our cluster and it is integrated with Torque-4.1.3. It failes even for a simple MPI hello world applications. The issue is that orted gets launched on all the nodes but the

Re: [OMPI users] OpenMPI job launch failures

2013-02-14 Thread Bharath Ramesh
#x27;t heard of it to-date. > > > On Feb 14, 2013, at 7:21 AM, Bharath Ramesh wrote: > > > On our cluster we are noticing intermediate job launch failure when using > > OpenMPI. We are currently using OpenMPI-1.6.1 on our cluster and it is > > integrated with Torq

Re: [OMPI users] OpenMPI job launch failures

2013-02-14 Thread Bharath Ramesh
each orted > report back to mpirun after it launches. If not, then it is likely that > something is blocking it. > > You could also try updating to 1.6.3/4 in case there is some race condition > in 1.6.1, though we haven't heard of it to-date. > > > On Feb 14, 2013,

Re: [OMPI users] OpenMPI job launch failures

2013-02-14 Thread Bharath Ramesh
. > > Might be worth a try? > > > On Feb 14, 2013, at 8:38 AM, Bharath Ramesh wrote: > > > Is there any way to prevent the output of more than one node > > written to the same line. I tried setting --output-filename, > > which didnt help. For some reason only

Re: [OMPI users] OpenMPI job launch failures

2013-02-14 Thread Bharath Ramesh
ething in 1.6 that might relate to this - a race > condition in the startup. You might try updating to the 1.6.4 release > candidate. > > > On Feb 14, 2013, at 11:04 AM, Bharath Ramesh wrote: > > > When I set the OPAL_OUTPUT_STDERR_FD=0 I receive a whole bunch of >