Vipul, You can also use the launch_agent to debug that.
Long story short mpirun --mca orte_launch_agent /.../agent.sh a.out will qrsh ... /.../agent.sh <orted params> instead of qrsh ... orted <orted params> at first, you can write a trivial agent that simply dumps the command line. you might also want to dump the environment and run ldd /.../orted to make sure there is not an accidental mix of libraries. Cheers, Gilles On Tue, Jun 2, 2020 at 6:20 AM Ralph Castain via users <users@lists.open-mpi.org> wrote: > > Afraid I have no real ideas here. Best I can suggest is taking the qrsh cmd > line from the prior debug output and try running it manually. This might give > you a chance to manipulate it and see if you can identify what is causing it > an issue, if anything. Without mpirun executing, the daemons will bark about > being unable to connect back, so you might need to use some other test > program for this purpose. > > I agree with Jeff - you should check to see where these messages are coming > from: > > > >> Server daemon successfully started with task id "1.cod4" > >> Server daemon successfully started with task id "1.cod5" > >> Server daemon successfully started with task id "1.cod6" > >> Server daemon successfully started with task id "1.has6" > >> Server daemon successfully started with task id "1.hpb12" > >> Server daemon successfully started with task id "1.has4" > > > >> Unmatched ". > >> Unmatched ". > >> Unmatched ". > > > > > Could be a clue as to what is actually happening. > > > > On Jun 1, 2020, at 1:57 PM, Kulshrestha, Vipul via users > > <users@lists.open-mpi.org> wrote: > > > > Thank Jeff & Ralph for your responses. > > > > I tried changing the verbose level to 5 using the option suggested by > > Ralph, but there was no difference in the output (so no additional > > information in the output). > > > > I also tried to replace the grid submission script to a command line qsub > > job submission, but got the same issue. Removing the use of job submission > > script, the qsub command looks like below. This uses mpirun option "--N 1" > > to ensure that only 1 process is launched by mpirun on one host. > > > > Do you have some suggestion on how I can go about investigating the root > > cause of the problem I am facing? I am able to run mpirun successfully, if > > I specify the same set of hosts (as allocated by grid) using mpirun host > > file. I have also pasted the verbose output with host file and the orted > > command looks very similar to the one generated for grid submission (except > > that it uses /usr/bin/ssh instead of /grid2/sge/bin/lx-amd64/qrsh. > > > > Thanks, > > Vipul > > > > > > qsub -N velsyn -pe orte2 10 -V -b y -cwd -j y -o $cwd/a -l "os=redhat6.7*" > > -q all /build/openmpi/openmpi-4.0.1/rhel6/bin/mpirun --N 1 -x > > LD_LIBRARY_PATH=/build/openmpi/openmpi-4.0.1/rhel6/lib -x PATH=$PATH > > --merge-stderr-to-stdout --output-filename > > ./veloce.log/velsyn/dvelsyn:nojobid,nocopy -np 5 --mca > > orte_base_help_aggregate 0 --mca plm_base_verbose 5 --mca > > plm_rsh_no_tree_spawn 1 <application with arguments> > > > > > > $ /build/openmpi/openmpi-4.0.1/rhel6/bin/mpirun --hostfile host.txt -x > > VMW_HOME=$VMW_HOME -x VMW_BIN=$VMW_BIN -x > > LD_LIBRARY_PATH=/build/openmpi/openmpi-4.0.1/rhel6/lib -x PATH=$PATH > > --merge-stderr-to-stdout --output-filename > > ./veloce.log/velsyn/dvelsyn:nojobid,nocopy -np 5 --mca > > orte_base_help_aggregate 0 --mca plm_base_verbose 5 --mca > > plm_rsh_no_tree_spawn 1 <application with arguments> > > > > [sox3:24416] [[26562,0],0] plm:rsh: final template argv: > > /usr/bin/ssh <template> set path = ( > > /build/openmpi/openmpi-4.0.1/rhel6/bin $path ) ; if ( $?LD_LIBRARY_PATH == > > 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH == 0 ) setenv > > LD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( > > $?OMPI_have_llp == 1 ) setenv LD_LIBRARY_PATH > > /build/openmpi/openmpi-4.0.1/rhel6/lib:$LD_LIBRARY_PATH ; if ( > > $?DYLD_LIBRARY_PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == > > 0 ) setenv DYLD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( > > $?OMPI_have_dllp == 1 ) setenv DYLD_LIBRARY_PATH > > /build/openmpi/openmpi-4.0.1/rhel6/lib:$DYLD_LIBRARY_PATH ; > > /build/openmpi/openmpi-4.0.1/rhel6/bin/orted -mca ess "env" -mca > > ess_base_jobid "1740767232" -mca ess_base_vpid "<template>" -mca > > ess_base_num_procs "6" -mca orte_node_regex > > "sox[1:3],bos[1:3],bos[2:15],bos[1:9],bos[2:12],bos[1:7]@0(6)" -mca > > orte_hnp_uri "1740767232.0;tcp://147.34.216.21:54496" --mca > > orte_base_help_aggregate "0" --mca plm_base_verbose "5" --mca > > plm_rsh_no_tree_spawn "1" -mca plm "rsh" -mca orte_output_filename > > "./veloce.log/velsyn/dvelsyn:nojobid,nocopy" -mca pmix > > "^s1,s2,cray,isolated" > > [sox3:24416] [[26562,0],0] complete_setup on job [26562,1] > > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from > > [[26562,0],5] > > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job > > [26562,1] > > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from > > [[26562,0],4] > > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job > > [26562,1] > > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from > > [[26562,0],1] > > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job > > [26562,1] > > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from > > [[26562,0],2] > > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job > > [26562,1] > > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from > > [[26562,0],3] > > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job > > [26562,1] > > > > -----Original Message----- > > From: Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com] > > Sent: Monday, June 1, 2020 4:15 PM > > To: Open MPI User's List <users@lists.open-mpi.org> > > Cc: Kulshrestha, Vipul <vipul_kulshres...@mentor.com> > > Subject: Re: [OMPI users] Running mpirun with grid > > > > On top of what Ralph said, I think that this output is unexpected: > > > >> Starting server daemon at host "cod5"Starting server daemon at host > >> "cod6"Starting server daemon at host "has4"Starting server daemon at host > >> "co d4" > >> > >> > >> > >> Starting server daemon at host "hpb12"Starting server daemon at host "has6" > >> > >> Server daemon successfully started with task id "1.cod4" > >> Server daemon successfully started with task id "1.cod5" > >> Server daemon successfully started with task id "1.cod6" > >> Server daemon successfully started with task id "1.has6" > >> Server daemon successfully started with task id "1.hpb12" > >> Server daemon successfully started with task id "1.has4" > > > > I don't think that's coming from Open MPI. > > > > My guess is that something is apparently trying to parse (or run?) that > > output, and it's getting confused because that output is unexpected, and > > then you get these errors: > > > >> Unmatched ". > >> Unmatched ". > >> Unmatched ". > > > > And the Open MPI helper daemon doesn't actually start. Therefore you get > > this error: > > > >> ---------------------------------------------------------------------- > >> ---- ORTE was unable to reliably start one or more daemons. > >> This usually is caused by: > > ...etc. > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > > >