I ran 15 test jobs using 1.6.4rc3 all of them successful. Unlike
1.6.1 where I would have around 40% of my jobs fail. Thanks for
the help really appreciate it.

-- 
Bharath


On Thu, Feb 14, 2013 at 11:59:06AM -0800, Ralph Castain wrote:
> Rats - sorry.
> 
> I seem to recall fixing something in 1.6 that might relate to this - a race 
> condition in the startup. You might try updating to the 1.6.4 release 
> candidate.
> 
> 
> On Feb 14, 2013, at 11:04 AM, Bharath Ramesh <bram...@vt.edu> wrote:
> 
> > When I set the OPAL_OUTPUT_STDERR_FD=0 I receive a whole bunch of
> > mca_oob_tcp_message_recv_complete: invalid message type errors
> > and the job just hangs even when all the nodes have fired off the
> > MPI application.
> > 
> > 
> > -- 
> > Bharath
> > 
> > On Thu, Feb 14, 2013 at 09:51:50AM -0800, Ralph Castain wrote:
> >> I don't think this is documented anywhere, but it is an available trick 
> >> (not sure if it is in 1.6.1, but might be): if you set 
> >> OPAL_OUTPUT_STDERR_FD=N in your environment, we will direct all our error 
> >> outputs to that file descriptor. If it is "0", then it goes to stdout.
> >> 
> >> Might be worth a try?
> >> 
> >> 
> >> On Feb 14, 2013, at 8:38 AM, Bharath Ramesh <bram...@vt.edu> wrote:
> >> 
> >>> Is there any way to prevent the output of more than one node
> >>> written to the same line. I tried setting --output-filename,
> >>> which didnt help. For some reason only stdout was written to the
> >>> files. Making it little bit hard to read close to a 6M output
> >>> file.
> >>> 
> >>> -- 
> >>> Bharath
> >>> 
> >>> On Thu, Feb 14, 2013 at 07:35:02AM -0800, Ralph Castain wrote:
> >>>> Sounds like the orteds aren't reporting back to mpirun after launch. The 
> >>>> MPI_proctable observation just means that the procs didn't launch in 
> >>>> those cases where it is absent, which is something you already observed.
> >>>> 
> >>>> Set "-mca plm_base_verbose 5" on your cmd line. You should see each 
> >>>> orted report back to mpirun after it launches. If not, then it is likely 
> >>>> that something is blocking it.
> >>>> 
> >>>> You could also try updating to 1.6.3/4 in case there is some race 
> >>>> condition in 1.6.1, though we haven't heard of it to-date.
> >>>> 
> >>>> 
> >>>> On Feb 14, 2013, at 7:21 AM, Bharath Ramesh <bram...@vt.edu> wrote:
> >>>> 
> >>>>> On our cluster we are noticing intermediate job launch failure when 
> >>>>> using OpenMPI. We are currently using OpenMPI-1.6.1 on our cluster and 
> >>>>> it is integrated with Torque-4.1.3. It failes even for a simple MPI 
> >>>>> hello world applications. The issue is that orted gets launched on all 
> >>>>> the nodes but there are a bunch of nodes that dont launch the actual 
> >>>>> MPI application. There are no errors reported when the job gets killed 
> >>>>> because the walltime expires. Enabling --debug-daemons doesnt show any 
> >>>>> errors either. The only difference being that successful runs have 
> >>>>> MPI_proctable listed and for failures this is absent. Any help in 
> >>>>> debugging this issue is greatly appreciated.
> >>>>> 
> >>>>> -- 
> >>>>> Bharath
> >>>>> 
> >>>>> 
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> us...@open-mpi.org
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>> 
> >>>> 
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> 
> >> 
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to