Thanks. $ qconf -spl OpenMP dist make mc4 oneper orte orte2 perf run15 run25 run5 run50 thread turbo $ qconf -sp orte2 pe_name orte2 slots 99999 used_slots 0 bound_slots 0 user_lists NONE xuser_lists NONE start_proc_args NONE stop_proc_args NONE per_pe_task_prolog NONE per_pe_task_epilog NONE allocation_rule 2 control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary FALSE daemon_forks_slaves FALSE master_forks_slaves FALSE
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of John Hearns via users Sent: Tuesday, June 2, 2020 2:25 AM To: Open MPI Users <users@lists.open-mpi.org> Cc: John Hearns <hear...@gmail.com> Subject: Re: [OMPI users] Running mpirun with grid As a suggestion can we see the configuration of your Parallel Environment? qconf -spl qconf -sp orte2 On Mon, 1 Jun 2020 at 22:20, Ralph Castain via users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote: Afraid I have no real ideas here. Best I can suggest is taking the qrsh cmd line from the prior debug output and try running it manually. This might give you a chance to manipulate it and see if you can identify what is causing it an issue, if anything. Without mpirun executing, the daemons will bark about being unable to connect back, so you might need to use some other test program for this purpose. I agree with Jeff - you should check to see where these messages are coming from: >> Server daemon successfully started with task id "1.cod4" >> Server daemon successfully started with task id "1.cod5" >> Server daemon successfully started with task id "1.cod6" >> Server daemon successfully started with task id "1.has6" >> Server daemon successfully started with task id "1.hpb12" >> Server daemon successfully started with task id "1.has4" > >> Unmatched ". >> Unmatched ". >> Unmatched ". > Could be a clue as to what is actually happening. > On Jun 1, 2020, at 1:57 PM, Kulshrestha, Vipul via users > <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote: > > Thank Jeff & Ralph for your responses. > > I tried changing the verbose level to 5 using the option suggested by Ralph, > but there was no difference in the output (so no additional information in > the output). > > I also tried to replace the grid submission script to a command line qsub job > submission, but got the same issue. Removing the use of job submission > script, the qsub command looks like below. This uses mpirun option "--N 1" to > ensure that only 1 process is launched by mpirun on one host. > > Do you have some suggestion on how I can go about investigating the root > cause of the problem I am facing? I am able to run mpirun successfully, if I > specify the same set of hosts (as allocated by grid) using mpirun host file. > I have also pasted the verbose output with host file and the orted command > looks very similar to the one generated for grid submission (except that it > uses /usr/bin/ssh instead of /grid2/sge/bin/lx-amd64/qrsh. > > Thanks, > Vipul > > > qsub -N velsyn -pe orte2 10 -V -b y -cwd -j y -o $cwd/a -l "os=redhat6.7*" -q > all /build/openmpi/openmpi-4.0.1/rhel6/bin/mpirun --N 1 -x > LD_LIBRARY_PATH=/build/openmpi/openmpi-4.0.1/rhel6/lib -x PATH=$PATH > --merge-stderr-to-stdout --output-filename > ./veloce.log/velsyn/dvelsyn:nojobid,nocopy -np 5 --mca > orte_base_help_aggregate 0 --mca plm_base_verbose 5 --mca > plm_rsh_no_tree_spawn 1 <application with arguments> > > > $ /build/openmpi/openmpi-4.0.1/rhel6/bin/mpirun --hostfile host.txt -x > VMW_HOME=$VMW_HOME -x VMW_BIN=$VMW_BIN -x > LD_LIBRARY_PATH=/build/openmpi/openmpi-4.0.1/rhel6/lib -x PATH=$PATH > --merge-stderr-to-stdout --output-filename > ./veloce.log/velsyn/dvelsyn:nojobid,nocopy -np 5 --mca > orte_base_help_aggregate 0 --mca plm_base_verbose 5 --mca > plm_rsh_no_tree_spawn 1 <application with arguments> > > [sox3:24416] [[26562,0],0] plm:rsh: final template argv: > /usr/bin/ssh <template> set path = ( > /build/openmpi/openmpi-4.0.1/rhel6/bin $path ) ; if ( $?LD_LIBRARY_PATH == 1 > ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH == 0 ) setenv LD_LIBRARY_PATH > /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( $?OMPI_have_llp == 1 ) setenv > LD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib:$LD_LIBRARY_PATH ; if > ( $?DYLD_LIBRARY_PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == > 0 ) setenv DYLD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( > $?OMPI_have_dllp == 1 ) setenv DYLD_LIBRARY_PATH > /build/openmpi/openmpi-4.0.1/rhel6/lib:$DYLD_LIBRARY_PATH ; > /build/openmpi/openmpi-4.0.1/rhel6/bin/orted -mca ess "env" -mca > ess_base_jobid "1740767232" -mca ess_base_vpid "<template>" -mca > ess_base_num_procs "6" -mca orte_node_regex > "sox[1:3],bos[1:3],bos[2:15],bos[1:9],bos[2:12],bos[1:7]@0(6)" -mca > orte_hnp_uri > "1740767232.0;tcp://147.34.216.21:54496<http://147.34.216.21:54496>" --mca > orte_base_help_aggregate "0" --mca plm_base_verbose "5" --mca > plm_rsh_no_tree_spawn "1" -mca plm "rsh" -mca orte_output_filename > "./veloce.log/velsyn/dvelsyn:nojobid,nocopy" -mca pmix "^s1,s2,cray,isolated" > [sox3:24416] [[26562,0],0] complete_setup on job [26562,1] > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from > [[26562,0],5] > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job > [26562,1] > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from > [[26562,0],4] > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job > [26562,1] > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from > [[26562,0],1] > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job > [26562,1] > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from > [[26562,0],2] > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job > [26562,1] > [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from > [[26562,0],3] > [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job > [26562,1] > > -----Original Message----- > From: Jeff Squyres (jsquyres) > [mailto:jsquy...@cisco.com<mailto:jsquy...@cisco.com>] > Sent: Monday, June 1, 2020 4:15 PM > To: Open MPI User's List > <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> > Cc: Kulshrestha, Vipul > <vipul_kulshres...@mentor.com<mailto:vipul_kulshres...@mentor.com>> > Subject: Re: [OMPI users] Running mpirun with grid > > On top of what Ralph said, I think that this output is unexpected: > >> Starting server daemon at host "cod5"Starting server daemon at host >> "cod6"Starting server daemon at host "has4"Starting server daemon at host >> "co d4" >> >> >> >> Starting server daemon at host "hpb12"Starting server daemon at host "has6" >> >> Server daemon successfully started with task id "1.cod4" >> Server daemon successfully started with task id "1.cod5" >> Server daemon successfully started with task id "1.cod6" >> Server daemon successfully started with task id "1.has6" >> Server daemon successfully started with task id "1.hpb12" >> Server daemon successfully started with task id "1.has4" > > I don't think that's coming from Open MPI. > > My guess is that something is apparently trying to parse (or run?) that > output, and it's getting confused because that output is unexpected, and then > you get these errors: > >> Unmatched ". >> Unmatched ". >> Unmatched ". > > And the Open MPI helper daemon doesn't actually start. Therefore you get > this error: > >> ---------------------------------------------------------------------- >> ---- ORTE was unable to reliably start one or more daemons. >> This usually is caused by: > ...etc. > > -- > Jeff Squyres > jsquy...@cisco.com<mailto:jsquy...@cisco.com> >