Thank Jeff & Ralph for your responses.

I tried changing the verbose level to 5 using the option suggested by Ralph, 
but there was no difference in the output (so no additional information in the 
output).

I also tried to replace the grid submission script to a command line qsub job 
submission, but got the same issue. Removing the use of job submission script, 
the qsub command looks like below. This uses mpirun option "--N 1" to ensure 
that only 1 process is launched by mpirun on one host.

Do you have some suggestion on how I can go about investigating the root cause 
of the problem I am facing? I am able to run mpirun successfully, if I specify 
the same set of hosts (as allocated by grid) using mpirun host file. I have 
also pasted the verbose output with host file and the orted command looks very 
similar to the one generated for grid submission (except that it uses 
/usr/bin/ssh instead of /grid2/sge/bin/lx-amd64/qrsh.

Thanks,
Vipul


qsub -N velsyn -pe orte2 10 -V -b y -cwd -j y -o $cwd/a -l "os=redhat6.7*" -q 
all /build/openmpi/openmpi-4.0.1/rhel6/bin/mpirun --N 1  -x 
LD_LIBRARY_PATH=/build/openmpi/openmpi-4.0.1/rhel6/lib -x PATH=$PATH 
--merge-stderr-to-stdout --output-filename 
./veloce.log/velsyn/dvelsyn:nojobid,nocopy -np 5 --mca orte_base_help_aggregate 
0 --mca plm_base_verbose 5 --mca plm_rsh_no_tree_spawn 1 <application with 
arguments>


$ /build/openmpi/openmpi-4.0.1/rhel6/bin/mpirun --hostfile host.txt -x 
VMW_HOME=$VMW_HOME -x VMW_BIN=$VMW_BIN -x 
LD_LIBRARY_PATH=/build/openmpi/openmpi-4.0.1/rhel6/lib -x PATH=$PATH 
--merge-stderr-to-stdout --output-filename 
./veloce.log/velsyn/dvelsyn:nojobid,nocopy -np 5 --mca orte_base_help_aggregate 
0 --mca plm_base_verbose 5 --mca plm_rsh_no_tree_spawn 1 <application with 
arguments>
                      
[sox3:24416] [[26562,0],0] plm:rsh: final template argv:
        /usr/bin/ssh <template>     set path = ( 
/build/openmpi/openmpi-4.0.1/rhel6/bin $path ) ; if ( $?LD_LIBRARY_PATH == 1 ) 
set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH == 0 ) setenv LD_LIBRARY_PATH 
/build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( $?OMPI_have_llp == 1 ) setenv 
LD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib:$LD_LIBRARY_PATH ; if ( 
$?DYLD_LIBRARY_PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 0 ) 
setenv DYLD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( 
$?OMPI_have_dllp == 1 ) setenv DYLD_LIBRARY_PATH 
/build/openmpi/openmpi-4.0.1/rhel6/lib:$DYLD_LIBRARY_PATH ;   
/build/openmpi/openmpi-4.0.1/rhel6/bin/orted -mca ess "env" -mca ess_base_jobid 
"1740767232" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "6" -mca 
orte_node_regex "sox[1:3],bos[1:3],bos[2:15],bos[1:9],bos[2:12],bos[1:7]@0(6)" 
-mca orte_hnp_uri "1740767232.0;tcp://147.34.216.21:54496" --mca 
orte_base_help_aggregate "0" --mca plm_base_verbose "5" --mca 
plm_rsh_no_tree_spawn "1" -mca plm "rsh" -mca orte_output_filename 
"./veloce.log/velsyn/dvelsyn:nojobid,nocopy" -mca pmix "^s1,s2,cray,isolated"   
                               
[sox3:24416] [[26562,0],0] complete_setup on job [26562,1]
[sox3:24416] [[26562,0],0] plm:base:receive update proc state command from 
[[26562,0],5]
[sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job 
[26562,1]
[sox3:24416] [[26562,0],0] plm:base:receive update proc state command from 
[[26562,0],4]
[sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job 
[26562,1]
[sox3:24416] [[26562,0],0] plm:base:receive update proc state command from 
[[26562,0],1]
[sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job 
[26562,1]
[sox3:24416] [[26562,0],0] plm:base:receive update proc state command from 
[[26562,0],2]
[sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job 
[26562,1]
[sox3:24416] [[26562,0],0] plm:base:receive update proc state command from 
[[26562,0],3]
[sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job 
[26562,1]

-----Original Message-----
From: Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com] 
Sent: Monday, June 1, 2020 4:15 PM
To: Open MPI User's List <users@lists.open-mpi.org>
Cc: Kulshrestha, Vipul <vipul_kulshres...@mentor.com>
Subject: Re: [OMPI users] Running mpirun with grid

On top of what Ralph said, I think that this output is unexpected:

> Starting server daemon at host "cod5"Starting server daemon at host 
> "cod6"Starting server daemon at host "has4"Starting server daemon at host "co 
> d4"
> 
> 
> 
> Starting server daemon at host "hpb12"Starting server daemon at host "has6"
> 
> Server daemon successfully started with task id "1.cod4"
> Server daemon successfully started with task id "1.cod5"
> Server daemon successfully started with task id "1.cod6"
> Server daemon successfully started with task id "1.has6"
> Server daemon successfully started with task id "1.hpb12"
> Server daemon successfully started with task id "1.has4"

I don't think that's coming from Open MPI.

My guess is that something is apparently trying to parse (or run?) that output, 
and it's getting confused because that output is unexpected, and then you get 
these errors:

> Unmatched ".
> Unmatched ".
> Unmatched ".

And the Open MPI helper daemon doesn't actually start.  Therefore you get this 
error:

> ----------------------------------------------------------------------
> ---- ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
...etc.

--
Jeff Squyres
jsquy...@cisco.com

Reply via email to