Re: [OMPI users] OpenMPI job launch failures

2013-02-14 Thread Bharath Ramesh
I ran 15 test jobs using 1.6.4rc3 all of them successful. Unlike 1.6.1 where I would have around 40% of my jobs fail. Thanks for the help really appreciate it. -- Bharath On Thu, Feb 14, 2013 at 11:59:06AM -0800, Ralph Castain wrote: > Rats - sorry. > > I seem to recall fixing something in 1.6

Re: [OMPI users] OpenMPI job launch failures

2013-02-14 Thread Ralph Castain
Rats - sorry. I seem to recall fixing something in 1.6 that might relate to this - a race condition in the startup. You might try updating to the 1.6.4 release candidate. On Feb 14, 2013, at 11:04 AM, Bharath Ramesh wrote: > When I set the OPAL_OUTPUT_STDERR_FD=0 I receive a whole bunch of >

Re: [OMPI users] OpenMPI job launch failures

2013-02-14 Thread Bharath Ramesh
When I set the OPAL_OUTPUT_STDERR_FD=0 I receive a whole bunch of mca_oob_tcp_message_recv_complete: invalid message type errors and the job just hangs even when all the nodes have fired off the MPI application. -- Bharath On Thu, Feb 14, 2013 at 09:51:50AM -0800, Ralph Castain wrote: > I don't

Re: [OMPI users] OpenMPI job launch failures

2013-02-14 Thread Bharath Ramesh
After manually fixing some of the issues I see that the failed nodes never receive commands to launch the local processes. I am going to request the admins to look into the logs for any dropped connections. On Thu, Feb 14, 2013 at 07:35:02AM -0800, Ralph Castain wrote: > Sounds like the orteds are

Re: [OMPI users] OpenMPI job launch failures

2013-02-14 Thread Ralph Castain
I don't think this is documented anywhere, but it is an available trick (not sure if it is in 1.6.1, but might be): if you set OPAL_OUTPUT_STDERR_FD=N in your environment, we will direct all our error outputs to that file descriptor. If it is "0", then it goes to stdout. Might be worth a try?

Re: [OMPI users] OpenMPI job launch failures

2013-02-14 Thread Bharath Ramesh
Is there any way to prevent the output of more than one node written to the same line. I tried setting --output-filename, which didnt help. For some reason only stdout was written to the files. Making it little bit hard to read close to a 6M output file. -- Bharath On Thu, Feb 14, 2013 at 07:35:

Re: [OMPI users] OpenMPI job launch failures

2013-02-14 Thread Ralph Castain
Sounds like the orteds aren't reporting back to mpirun after launch. The MPI_proctable observation just means that the procs didn't launch in those cases where it is absent, which is something you already observed. Set "-mca plm_base_verbose 5" on your cmd line. You should see each orted report

[OMPI users] OpenMPI job launch failures

2013-02-14 Thread Bharath Ramesh
On our cluster we are noticing intermediate job launch failure when using OpenMPI. We are currently using OpenMPI-1.6.1 on our cluster and it is integrated with Torque-4.1.3. It failes even for a simple MPI hello world applications. The issue is that orted gets launched on all the nodes but the