Vipul,

You can also use the launch_agent to debug that.

Long story short
mpirun --mca orte_launch_agent /.../agent.sh a.out
will
qrsh ... /.../agent.sh <orted params>
instead of
qrsh ... orted <orted params>

at first, you can write a trivial agent that simply dumps the command line.
you might also want to dump the environment and run ldd /.../orted to
make sure there is not an accidental mix of libraries.

Cheers,

Gilles

On Tue, Jun 2, 2020 at 6:20 AM Ralph Castain via users
<users@lists.open-mpi.org> wrote:
>
> Afraid I have no real ideas here. Best I can suggest is taking the qrsh cmd 
> line from the prior debug output and try running it manually. This might give 
> you a chance to manipulate it and see if you can identify what is causing it 
> an issue, if anything. Without mpirun executing, the daemons will bark about 
> being unable to connect back, so you might need to use some other test 
> program for this purpose.
>
> I agree with Jeff - you should check to see where these messages are coming 
> from:
>
>
> >> Server daemon successfully started with task id "1.cod4"
> >> Server daemon successfully started with task id "1.cod5"
> >> Server daemon successfully started with task id "1.cod6"
> >> Server daemon successfully started with task id "1.has6"
> >> Server daemon successfully started with task id "1.hpb12"
> >> Server daemon successfully started with task id "1.has4"
> >
> >> Unmatched ".
> >> Unmatched ".
> >> Unmatched ".
> >
>
>
> Could be a clue as to what is actually happening.
>
>
> > On Jun 1, 2020, at 1:57 PM, Kulshrestha, Vipul via users 
> > <users@lists.open-mpi.org> wrote:
> >
> > Thank Jeff & Ralph for your responses.
> >
> > I tried changing the verbose level to 5 using the option suggested by 
> > Ralph, but there was no difference in the output (so no additional 
> > information in the output).
> >
> > I also tried to replace the grid submission script to a command line qsub 
> > job submission, but got the same issue. Removing the use of job submission 
> > script, the qsub command looks like below. This uses mpirun option "--N 1" 
> > to ensure that only 1 process is launched by mpirun on one host.
> >
> > Do you have some suggestion on how I can go about investigating the root 
> > cause of the problem I am facing? I am able to run mpirun successfully, if 
> > I specify the same set of hosts (as allocated by grid) using mpirun host 
> > file. I have also pasted the verbose output with host file and the orted 
> > command looks very similar to the one generated for grid submission (except 
> > that it uses /usr/bin/ssh instead of /grid2/sge/bin/lx-amd64/qrsh.
> >
> > Thanks,
> > Vipul
> >
> >
> > qsub -N velsyn -pe orte2 10 -V -b y -cwd -j y -o $cwd/a -l "os=redhat6.7*" 
> > -q all /build/openmpi/openmpi-4.0.1/rhel6/bin/mpirun --N 1  -x 
> > LD_LIBRARY_PATH=/build/openmpi/openmpi-4.0.1/rhel6/lib -x PATH=$PATH 
> > --merge-stderr-to-stdout --output-filename 
> > ./veloce.log/velsyn/dvelsyn:nojobid,nocopy -np 5 --mca 
> > orte_base_help_aggregate 0 --mca plm_base_verbose 5 --mca 
> > plm_rsh_no_tree_spawn 1 <application with arguments>
> >
> >
> > $ /build/openmpi/openmpi-4.0.1/rhel6/bin/mpirun --hostfile host.txt -x 
> > VMW_HOME=$VMW_HOME -x VMW_BIN=$VMW_BIN -x 
> > LD_LIBRARY_PATH=/build/openmpi/openmpi-4.0.1/rhel6/lib -x PATH=$PATH 
> > --merge-stderr-to-stdout --output-filename 
> > ./veloce.log/velsyn/dvelsyn:nojobid,nocopy -np 5 --mca 
> > orte_base_help_aggregate 0 --mca plm_base_verbose 5 --mca 
> > plm_rsh_no_tree_spawn 1 <application with arguments>
> >
> > [sox3:24416] [[26562,0],0] plm:rsh: final template argv:
> >        /usr/bin/ssh <template>     set path = ( 
> > /build/openmpi/openmpi-4.0.1/rhel6/bin $path ) ; if ( $?LD_LIBRARY_PATH == 
> > 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH == 0 ) setenv 
> > LD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( 
> > $?OMPI_have_llp == 1 ) setenv LD_LIBRARY_PATH 
> > /build/openmpi/openmpi-4.0.1/rhel6/lib:$LD_LIBRARY_PATH ; if ( 
> > $?DYLD_LIBRARY_PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 
> > 0 ) setenv DYLD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( 
> > $?OMPI_have_dllp == 1 ) setenv DYLD_LIBRARY_PATH 
> > /build/openmpi/openmpi-4.0.1/rhel6/lib:$DYLD_LIBRARY_PATH ;   
> > /build/openmpi/openmpi-4.0.1/rhel6/bin/orted -mca ess "env" -mca 
> > ess_base_jobid "1740767232" -mca ess_base_vpid "<template>" -mca 
> > ess_base_num_procs "6" -mca orte_node_regex 
> > "sox[1:3],bos[1:3],bos[2:15],bos[1:9],bos[2:12],bos[1:7]@0(6)" -mca 
> > orte_hnp_uri "1740767232.0;tcp://147.34.216.21:54496" --mca 
> > orte_base_help_aggregate "0" --mca plm_base_verbose "5" --mca 
> > plm_rsh_no_tree_spawn "1" -mca plm "rsh" -mca orte_output_filename 
> > "./veloce.log/velsyn/dvelsyn:nojobid,nocopy" -mca pmix 
> > "^s1,s2,cray,isolated"
> > [sox3:24416] [[26562,0],0] complete_setup on job [26562,1]
> > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from 
> > [[26562,0],5]
> > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job 
> > [26562,1]
> > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from 
> > [[26562,0],4]
> > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job 
> > [26562,1]
> > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from 
> > [[26562,0],1]
> > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job 
> > [26562,1]
> > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from 
> > [[26562,0],2]
> > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job 
> > [26562,1]
> > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from 
> > [[26562,0],3]
> > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job 
> > [26562,1]
> >
> > -----Original Message-----
> > From: Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com]
> > Sent: Monday, June 1, 2020 4:15 PM
> > To: Open MPI User's List <users@lists.open-mpi.org>
> > Cc: Kulshrestha, Vipul <vipul_kulshres...@mentor.com>
> > Subject: Re: [OMPI users] Running mpirun with grid
> >
> > On top of what Ralph said, I think that this output is unexpected:
> >
> >> Starting server daemon at host "cod5"Starting server daemon at host
> >> "cod6"Starting server daemon at host "has4"Starting server daemon at host 
> >> "co d4"
> >>
> >>
> >>
> >> Starting server daemon at host "hpb12"Starting server daemon at host "has6"
> >>
> >> Server daemon successfully started with task id "1.cod4"
> >> Server daemon successfully started with task id "1.cod5"
> >> Server daemon successfully started with task id "1.cod6"
> >> Server daemon successfully started with task id "1.has6"
> >> Server daemon successfully started with task id "1.hpb12"
> >> Server daemon successfully started with task id "1.has4"
> >
> > I don't think that's coming from Open MPI.
> >
> > My guess is that something is apparently trying to parse (or run?) that 
> > output, and it's getting confused because that output is unexpected, and 
> > then you get these errors:
> >
> >> Unmatched ".
> >> Unmatched ".
> >> Unmatched ".
> >
> > And the Open MPI helper daemon doesn't actually start.  Therefore you get 
> > this error:
> >
> >> ----------------------------------------------------------------------
> >> ---- ORTE was unable to reliably start one or more daemons.
> >> This usually is caused by:
> > ...etc.
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> >
>
>

Reply via email to