Thank Jeff & Ralph for your responses. I tried changing the verbose level to 5 using the option suggested by Ralph, but there was no difference in the output (so no additional information in the output).
I also tried to replace the grid submission script to a command line qsub job submission, but got the same issue. Removing the use of job submission script, the qsub command looks like below. This uses mpirun option "--N 1" to ensure that only 1 process is launched by mpirun on one host. Do you have some suggestion on how I can go about investigating the root cause of the problem I am facing? I am able to run mpirun successfully, if I specify the same set of hosts (as allocated by grid) using mpirun host file. I have also pasted the verbose output with host file and the orted command looks very similar to the one generated for grid submission (except that it uses /usr/bin/ssh instead of /grid2/sge/bin/lx-amd64/qrsh. Thanks, Vipul qsub -N velsyn -pe orte2 10 -V -b y -cwd -j y -o $cwd/a -l "os=redhat6.7*" -q all /build/openmpi/openmpi-4.0.1/rhel6/bin/mpirun --N 1 -x LD_LIBRARY_PATH=/build/openmpi/openmpi-4.0.1/rhel6/lib -x PATH=$PATH --merge-stderr-to-stdout --output-filename ./veloce.log/velsyn/dvelsyn:nojobid,nocopy -np 5 --mca orte_base_help_aggregate 0 --mca plm_base_verbose 5 --mca plm_rsh_no_tree_spawn 1 <application with arguments> $ /build/openmpi/openmpi-4.0.1/rhel6/bin/mpirun --hostfile host.txt -x VMW_HOME=$VMW_HOME -x VMW_BIN=$VMW_BIN -x LD_LIBRARY_PATH=/build/openmpi/openmpi-4.0.1/rhel6/lib -x PATH=$PATH --merge-stderr-to-stdout --output-filename ./veloce.log/velsyn/dvelsyn:nojobid,nocopy -np 5 --mca orte_base_help_aggregate 0 --mca plm_base_verbose 5 --mca plm_rsh_no_tree_spawn 1 <application with arguments> [sox3:24416] [[26562,0],0] plm:rsh: final template argv: /usr/bin/ssh <template> set path = ( /build/openmpi/openmpi-4.0.1/rhel6/bin $path ) ; if ( $?LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH == 0 ) setenv LD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( $?OMPI_have_llp == 1 ) setenv LD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib:$LD_LIBRARY_PATH ; if ( $?DYLD_LIBRARY_PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 0 ) setenv DYLD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( $?OMPI_have_dllp == 1 ) setenv DYLD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib:$DYLD_LIBRARY_PATH ; /build/openmpi/openmpi-4.0.1/rhel6/bin/orted -mca ess "env" -mca ess_base_jobid "1740767232" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "6" -mca orte_node_regex "sox[1:3],bos[1:3],bos[2:15],bos[1:9],bos[2:12],bos[1:7]@0(6)" -mca orte_hnp_uri "1740767232.0;tcp://147.34.216.21:54496" --mca orte_base_help_aggregate "0" --mca plm_base_verbose "5" --mca plm_rsh_no_tree_spawn "1" -mca plm "rsh" -mca orte_output_filename "./veloce.log/velsyn/dvelsyn:nojobid,nocopy" -mca pmix "^s1,s2,cray,isolated" [sox3:24416] [[26562,0],0] complete_setup on job [26562,1] [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from [[26562,0],5] [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job [26562,1] [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from [[26562,0],4] [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job [26562,1] [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from [[26562,0],1] [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job [26562,1] [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from [[26562,0],2] [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job [26562,1] [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from [[26562,0],3] [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job [26562,1] -----Original Message----- From: Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com] Sent: Monday, June 1, 2020 4:15 PM To: Open MPI User's List <users@lists.open-mpi.org> Cc: Kulshrestha, Vipul <vipul_kulshres...@mentor.com> Subject: Re: [OMPI users] Running mpirun with grid On top of what Ralph said, I think that this output is unexpected: > Starting server daemon at host "cod5"Starting server daemon at host > "cod6"Starting server daemon at host "has4"Starting server daemon at host "co > d4" > > > > Starting server daemon at host "hpb12"Starting server daemon at host "has6" > > Server daemon successfully started with task id "1.cod4" > Server daemon successfully started with task id "1.cod5" > Server daemon successfully started with task id "1.cod6" > Server daemon successfully started with task id "1.has6" > Server daemon successfully started with task id "1.hpb12" > Server daemon successfully started with task id "1.has4" I don't think that's coming from Open MPI. My guess is that something is apparently trying to parse (or run?) that output, and it's getting confused because that output is unexpected, and then you get these errors: > Unmatched ". > Unmatched ". > Unmatched ". And the Open MPI helper daemon doesn't actually start. Therefore you get this error: > ---------------------------------------------------------------------- > ---- ORTE was unable to reliably start one or more daemons. > This usually is caused by: ...etc. -- Jeff Squyres jsquy...@cisco.com