As a suggestion can we see the configuration of your Parallel Environment? qconf -spl
qconf -sp orte2 On Mon, 1 Jun 2020 at 22:20, Ralph Castain via users < users@lists.open-mpi.org> wrote: > Afraid I have no real ideas here. Best I can suggest is taking the qrsh > cmd line from the prior debug output and try running it manually. This > might give you a chance to manipulate it and see if you can identify what > is causing it an issue, if anything. Without mpirun executing, the daemons > will bark about being unable to connect back, so you might need to use some > other test program for this purpose. > > I agree with Jeff - you should check to see where these messages are > coming from: > > > >> Server daemon successfully started with task id "1.cod4" > >> Server daemon successfully started with task id "1.cod5" > >> Server daemon successfully started with task id "1.cod6" > >> Server daemon successfully started with task id "1.has6" > >> Server daemon successfully started with task id "1.hpb12" > >> Server daemon successfully started with task id "1.has4" > > > >> Unmatched ". > >> Unmatched ". > >> Unmatched ". > > > > > Could be a clue as to what is actually happening. > > > > On Jun 1, 2020, at 1:57 PM, Kulshrestha, Vipul via users < > users@lists.open-mpi.org> wrote: > > > > Thank Jeff & Ralph for your responses. > > > > I tried changing the verbose level to 5 using the option suggested by > Ralph, but there was no difference in the output (so no additional > information in the output). > > > > I also tried to replace the grid submission script to a command line > qsub job submission, but got the same issue. Removing the use of job > submission script, the qsub command looks like below. This uses mpirun > option "--N 1" to ensure that only 1 process is launched by mpirun on one > host. > > > > Do you have some suggestion on how I can go about investigating the root > cause of the problem I am facing? I am able to run mpirun successfully, if > I specify the same set of hosts (as allocated by grid) using mpirun host > file. I have also pasted the verbose output with host file and the orted > command looks very similar to the one generated for grid submission (except > that it uses /usr/bin/ssh instead of /grid2/sge/bin/lx-amd64/qrsh. > > > > Thanks, > > Vipul > > > > > > qsub -N velsyn -pe orte2 10 -V -b y -cwd -j y -o $cwd/a -l > "os=redhat6.7*" -q all /build/openmpi/openmpi-4.0.1/rhel6/bin/mpirun --N 1 > -x LD_LIBRARY_PATH=/build/openmpi/openmpi-4.0.1/rhel6/lib -x PATH=$PATH > --merge-stderr-to-stdout --output-filename > ./veloce.log/velsyn/dvelsyn:nojobid,nocopy -np 5 --mca > orte_base_help_aggregate 0 --mca plm_base_verbose 5 --mca > plm_rsh_no_tree_spawn 1 <application with arguments> > > > > > > $ /build/openmpi/openmpi-4.0.1/rhel6/bin/mpirun --hostfile host.txt -x > VMW_HOME=$VMW_HOME -x VMW_BIN=$VMW_BIN -x > LD_LIBRARY_PATH=/build/openmpi/openmpi-4.0.1/rhel6/lib -x PATH=$PATH > --merge-stderr-to-stdout --output-filename > ./veloce.log/velsyn/dvelsyn:nojobid,nocopy -np 5 --mca > orte_base_help_aggregate 0 --mca plm_base_verbose 5 --mca > plm_rsh_no_tree_spawn 1 <application with arguments> > > > > [sox3:24416] [[26562,0],0] plm:rsh: final template argv: > > /usr/bin/ssh <template> set path = ( > /build/openmpi/openmpi-4.0.1/rhel6/bin $path ) ; if ( $?LD_LIBRARY_PATH == > 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH == 0 ) setenv > LD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( > $?OMPI_have_llp == 1 ) setenv LD_LIBRARY_PATH > /build/openmpi/openmpi-4.0.1/rhel6/lib:$LD_LIBRARY_PATH ; if ( > $?DYLD_LIBRARY_PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == > 0 ) setenv DYLD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( > $?OMPI_have_dllp == 1 ) setenv DYLD_LIBRARY_PATH > /build/openmpi/openmpi-4.0.1/rhel6/lib:$DYLD_LIBRARY_PATH ; > /build/openmpi/openmpi-4.0.1/rhel6/bin/orted -mca ess "env" -mca > ess_base_jobid "1740767232" -mca ess_base_vpid "<template>" -mca > ess_base_num_procs "6" -mca orte_node_regex > "sox[1:3],bos[1:3],bos[2:15],bos[1:9],bos[2:12],bos[1:7]@0(6)" -mca > orte_hnp_uri "1740767232.0;tcp://147.34.216.21:54496" --mca > orte_base_help_aggregate "0" --mca plm_base_verbose "5" --mca > plm_rsh_no_tree_spawn "1" -mca plm "rsh" -mca orte_output_filename > "./veloce.log/velsyn/dvelsyn:nojobid,nocopy" -mca pmix > "^s1,s2,cray,isolated" > > [sox3:24416] [[26562,0],0] complete_setup on job [26562,1] > > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command > from [[26562,0],5] > > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for > job [26562,1] > > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command > from [[26562,0],4] > > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for > job [26562,1] > > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command > from [[26562,0],1] > > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for > job [26562,1] > > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command > from [[26562,0],2] > > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for > job [26562,1] > > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command > from [[26562,0],3] > > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for > job [26562,1] > > > > -----Original Message----- > > From: Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com] > > Sent: Monday, June 1, 2020 4:15 PM > > To: Open MPI User's List <users@lists.open-mpi.org> > > Cc: Kulshrestha, Vipul <vipul_kulshres...@mentor.com> > > Subject: Re: [OMPI users] Running mpirun with grid > > > > On top of what Ralph said, I think that this output is unexpected: > > > >> Starting server daemon at host "cod5"Starting server daemon at host > >> "cod6"Starting server daemon at host "has4"Starting server daemon at > host "co d4" > >> > >> > >> > >> Starting server daemon at host "hpb12"Starting server daemon at host > "has6" > >> > >> Server daemon successfully started with task id "1.cod4" > >> Server daemon successfully started with task id "1.cod5" > >> Server daemon successfully started with task id "1.cod6" > >> Server daemon successfully started with task id "1.has6" > >> Server daemon successfully started with task id "1.hpb12" > >> Server daemon successfully started with task id "1.has4" > > > > I don't think that's coming from Open MPI. > > > > My guess is that something is apparently trying to parse (or run?) that > output, and it's getting confused because that output is unexpected, and > then you get these errors: > > > >> Unmatched ". > >> Unmatched ". > >> Unmatched ". > > > > And the Open MPI helper daemon doesn't actually start. Therefore you > get this error: > > > >> ---------------------------------------------------------------------- > >> ---- ORTE was unable to reliably start one or more daemons. > >> This usually is caused by: > > ...etc. > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > > > >