Afraid I don't have much to offer. I suspect the problem is here:

> Unmatched ".
> Unmatched ".
> Unmatched ".

Something may be eating a quote, or mistakenly adding one, to the cmd line. You 
might try upping the verbosity: --mca plm_base_verbose 5



> On May 31, 2020, at 2:49 PM, Kulshrestha, Vipul 
> <vipul_kulshres...@mentor.com> wrote:
> 
> Hi Ralph,
> 
> Thanks for your response.
> 
> I added the option "--mca plm_rsh_no_tree_spawn 1" to mpirun command line, 
> but I get a similar error. (pasted below).
> 
> Regards,
> Vipul
> 
> Got 14 slots.
> tmpdir is /tmp/194954128.1.all.q
> pe_hostfile is /var/spool/sge/has2/active_jobs/194954128.1/pe_hostfile
> has2.org.com 2 al...@has2.org.com <NULL>
> has6.org.com 2 al...@has6.org.com <NULL>
> cod4.org.com 2 al...@cod4.org.com <NULL>
> cod6.org.com 2 al...@cod6.org.com <NULL>
> cod5.org.com 2 al...@cod5.org.com <NULL>
> hpb12.org.com 2 al...@hpb12.org.com <NULL>
> has4.org.com 2 al...@has4.org.com <NULL>
> [has2:08703] [[24953,0],0] plm:rsh: using "/grid2/sge/bin/lx-amd64/qrsh 
> -inherit -nostdin -V -verbose" for launching
> [has2:08703] [[24953,0],0] plm:rsh: final template argv:
>        /grid2/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose <template>  
>    set path = ( /build/openmpi/openmpi-4.0.1/rhel6/bin $path ) ; if ( 
> $?LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH == 0 ) 
> setenv LD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( 
> $?OMPI_have_llp == 1 ) setenv LD_LIBRARY_PATH 
> /build/openmpi/openmpi-4.0.1/rhel6/lib:$LD_LIBRARY_PATH ; if ( 
> $?DYLD_LIBRARY_PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 0 
> ) setenv DYLD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( 
> $?OMPI_have_dllp == 1 ) setenv DYLD_LIBRARY_PATH 
> /build/openmpi/openmpi-4.0.1/rhel6/lib:$DYLD_LIBRARY_PATH ;   
> /build/openmpi/openmpi-4.0.1/rhel6/bin/orted -mca ess "env" -mca 
> ess_base_jobid "1635319808" -mca ess_base_vpid "<template>" -mca 
> ess_base_num_procs "7" -mca orte_node_regex 
> "has[1:2,6],cod[1:4,6,5],hpb[2:12],has[1:4]@0(7)" -mca orte_hnp_uri 
> "1635319808.0;tcp://139.181.79.58:57879" --mca routed "direct" --mca 
> orte_base_help_aggregate "0" --mca plm_base_verbose "1" --mca 
> plm_rsh_no_tree_spawn "1" -mca plm "rsh" -mca orte_output_filename 
> "./veloce.log/velsyn/dvelsyn:nojobid,nocopy" -mca hwloc_base_binding_policy 
> "none" -mca pmix "^s1,s2,cray,isolated"
> Starting server daemon at host "cod5"Starting server daemon at host 
> "cod6"Starting server daemon at host "has4"Starting server daemon at host "co
> d4"
> 
> 
> 
> Starting server daemon at host "hpb12"Starting server daemon at host "has6"
> 
> Server daemon successfully started with task id "1.cod4"
> Server daemon successfully started with task id "1.cod5"
> Server daemon successfully started with task id "1.cod6"
> Server daemon successfully started with task id "1.has6"
> Server daemon successfully started with task id "1.hpb12"
> Server daemon successfully started with task id "1.has4"
> Unmatched ".
> Unmatched ".
> Unmatched ".
> --------------------------------------------------------------------------
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> 
> * not finding the required libraries and/or binaries on
>  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>  settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
> * lack of authority to execute on one or more specified nodes.
>  Please verify your allocation and authorities.
> 
> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
>  Please check with your sys admin to determine the correct location to use.
> 
> *  compilation of the orted with dynamic libraries when static are required
>  (e.g., on Cray). Please check your configure cmd line and consider using
>  one of the contrib/platform definitions for your system type.
> 
> * an inability to create a connection back to mpirun due to a
>  lack of common network interfaces and/or no route found between
>  them. Please check network connectivity (including firewalls
>  and network routing requirements).
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> ....
> ....
> ....
> 
> 
> 
> -----Original Message-----
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ralph 
> Castain via users
> Sent: Sunday, May 31, 2020 10:50 AM
> To: Open MPI Users <users@lists.open-mpi.org>
> Cc: Ralph Castain <r...@open-mpi.org>
> Subject: Re: [OMPI users] Running mpirun with grid
> 
> The messages about the daemons is coming from two different sources. Grid is 
> saying it was able to spawn the orted - then the orted is saying it doesn't 
> know how to communicate and fails.
> 
> I think the root of the problem lies in the plm output that shows the qrsh it 
> will use to start the job. For some reason, mpirun is still trying to "tree 
> spawn", which (IIRC) isn't allowed on grid (all the daemons have to be 
> launched in one shot by mpirun using qrsh). Try adding "--mca 
> plm_rsh_no_tree_spawn 1" to your mpirun cmd line.
> 
> 
>>> 
>>> 
>>> On Sat, 30 May 2020 at 00:41, Kulshrestha, Vipul via users 
>>> <users@lists.open-mpi.org> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> 
>>>> 
>>>> I need to launch my openmpi application on grid. My application is 
>>>> designed to run N processes, where each process would have M threads. I am 
>>>> using open MPI version 4.0.1
>>>> 
>>>> 
>>>> 
>>>> % /build/openmpi/openmpi-4.0.1/rhel6/bin/ompi_info | grep grid
>>>> 
>>>>                MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component 
>>>> v4.0.1)
>>>> 
>>>> 
>>>> 
>>>> To run it without grid, I run it as (say N = 7, M = 2)
>>>> 
>>>> % mpirun –np 7 <application with arguments>
>>>> 
>>>> 
>>>> 
>>>> The above works well and runs N processes. Based on some earlier advice on 
>>>> this forum, I have setup the grid submission using the a grid job 
>>>> submission script that modifies the grid slow allocation, so that mpirun 
>>>> launches only 1 application process copy on each host allocated by grid. I 
>>>> have some partial success. I think grid is able to start the job and then 
>>>> mpirun also starts to run, but then it errors out with below mentioned 
>>>> errors. Strangely, after giving message for having started all the 
>>>> daemons, it reports that it was not able to start one or more daemons.
>>>> 
>>>> 
>>>> 
>>>> I have setup a grid submission script that modifies the pe_hostfile and it 
>>>> appears that mpirun is able to take it and then is able use the host 
>>>> information to start launching the jobs. However, mpirun halts before it 
>>>> can start all the child processes. I enabled some debug logs but am not 
>>>> able to figure out a possible cause.
>>>> 
>>>> 
>>>> 
>>>> Could somebody look at this and guide how to resolve this issue?
>>>> 
>>>> 
>>>> 
>>>> I have pasted the detailed log as well as my job submission script below.
>>>> 
>>>> 
>>>> 
>>>> As a clarification, when I run the mpirun without grid, it (mpirun and my 
>>>> application) works on the same set of hosts without any problems.
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> 
>>>> Vipul
>>>> 
>>>> 
>>>> 
>>>> Job submission script:
>>>> 
>>>> #!/bin/sh
>>>> 
>>>> #$ -N velsyn
>>>> 
>>>> #$ -pe orte2 14
>>>> 
>>>> #$ -V -cwd -j y
>>>> 
>>>> #$ -o out.txt
>>>> 
>>>> #
>>>> 
>>>> echo "Got $NSLOTS slots."
>>>> 
>>>> echo "tmpdir is $TMPDIR"
>>>> 
>>>> echo "pe_hostfile is $PE_HOSTFILE"
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> cat $PE_HOSTFILE
>>>> 
>>>> newhostfile=/testdir/tmp/pe_hostfile
>>>> 
>>>> 
>>>> 
>>>> awk '{$2 = $2/2; print}' $PE_HOSTFILE > $newhostfile
>>>> 
>>>> 
>>>> 
>>>> export PE_HOSTFILE=$newhostfile
>>>> 
>>>> export LD_LIBRARY_PATH=/build/openmpi/openmpi-4.0.1/rhel6/lib
>>>> 
>>>> 
>>>> 
>>>> mpirun --merge-stderr-to-stdout --output-filename ./output:nojobid,nocopy 
>>>> --mca routed direct --mca orte_base_help_aggregate 0 --mca 
>>>> plm_base_verbose 1 --bind-to none --report-bindings -np 7 <application 
>>>> with args>
>>>> 
>>>> 
>>>> 
>>>> The out.txt content is:
>>>> 
>>>> Got 14 slots.
>>>> 
>>>> tmpdir is /tmp/182117160.1.all.q
>>>> 
>>>> pe_hostfile is /var/spool/sge/bos2/active_jobs/182117160.1/pe_hostfile
>>>> 
>>>> bos2.wv.org.com 2 al...@bos2.wv.org.com <NULL> art8.wv.org.com 2 
>>>> al...@art8.wv.org.com <NULL> art10.wv.org.com 2 al...@art10.wv.org.com 
>>>> <NULL> hpb7.wv.org.com 2 al...@hpb7.wv.org.com <NULL> bos15.wv.org.com 2 
>>>> al...@bos15.wv.org.com <NULL> bos1.wv.org.com 2 al...@bos1.wv.org.com 
>>>> <NULL> hpb11.wv.org.com 2 al...@hpb11.wv.org.com <NULL> [bos2:22657] 
>>>> [[8251,0],0] plm:rsh: using "/wv/grid2/sge/bin/lx-amd64/qrsh -inherit 
>>>> -nostdin -V -verbose" for launching [bos2:22657] [[8251,0],0] plm:rsh: 
>>>> final template argv:
>>>> 
>>>> /grid2/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose <template>     
>>>> set path = ( /build/openm
>>>> 
>>>> pi/openmpi-4.0.1/rhel6/bin $path ) ; if ( $?LD_LIBRARY_PATH == 1 ) set 
>>>> OMPI_have_llp ; if ( $?LD_LIBR ARY_PATH == 0 ) setenv LD_LIBRARY_PATH 
>>>> /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( $?OMPI_have_llp == 1 ) 
>>>> setenv LD_LIBRARY_PATH 
>>>> /build/openmpi/openmpi-4.0.1/rhel6/lib:$LD_LIBRARY_PATH ; if ( $?DYLD_L 
>>>> IBRARY_PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 0 ) 
>>>> setenv DYLD_LIBRARY_PATH /bui ld/openmpi/openmpi-4.0.1/rhel6/lib ; if ( 
>>>> $?OMPI_have_dllp == 1 ) setenv DYLD_LIBRARY_PATH /build/ope
>>>> 
>>>> nmpi/openmpi-4.0.1/rhel6/lib:$DYLD_LIBRARY_PATH ;   
>>>> /build/openmpi/openmpi-4.0.1/rhel6/bin/orted -mca
>>>> 
>>>> orte_report_bindings "1" -mca ess "env" -mca ess_base_jobid "540737536" 
>>>> -mca ess_base_vpid "<templat
>>>> 
>>>> e>" -mca ess_base_num_procs "7" -mca orte_node_regex
>>>> 
>>>> e>"bos[1:2],art[1:8],art[2:10],hpb[1:7],bos[2:15],
>>>> 
>>>> bos[1:1],hpb[2:11]@0(7)" -mca orte_hnp_uri 
>>>> "540737536.0;tcp://147.34.116.60:50769" --mca routed "dire ct" --mca 
>>>> orte_base_help_aggregate "0" --mca plm_base_verbose "1" -mca plm "rsh" 
>>>> --tree-spawn -mca or te_parent_uri "540737536.0;tcp://147.34.116.60:50769" 
>>>> -mca orte_output_filename "./output:nojobid,noc opy" -mca 
>>>> hwloc_base_binding_policy "none" -mca hwloc_base_report_bindings "1" -mca 
>>>> pmix "^s1,s2,cray ,isolated"
>>>> 
>>>> Starting server daemon at host "art10"
>>>> 
>>>> Starting server daemon at host "art8"
>>>> 
>>>> Starting server daemon at host "bos1"
>>>> 
>>>> Starting server daemon at host "hpb7"
>>>> 
>>>> Starting server daemon at host "hpb11"
>>>> 
>>>> Starting server daemon at host "bos15"
>>>> 
>>>> Server daemon successfully started with task id "1.art8"
>>>> 
>>>> Server daemon successfully started with task id "1.bos1"
>>>> 
>>>> Server daemon successfully started with task id "1.art10"
>>>> 
>>>> Server daemon successfully started with task id "1.bos15"
>>>> 
>>>> Server daemon successfully started with task id "1.hpb7"
>>>> 
>>>> Server daemon successfully started with task id "1.hpb11"
>>>> 
>>>> Unmatched ".
>>>> 
>>>> --------------------------------------------------------------------------
>>>> 
>>>> ORTE was unable to reliably start one or more daemons.
>>>> 
>>>> This usually is caused by:
>>>> 
>>>> 
>>>> 
>>>> * not finding the required libraries and/or binaries on
>>>> 
>>>> one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>>>> 
>>>> settings, or configure OMPI with --enable-orterun-prefix-by-default
>>>> 
>>>> 
>>>> 
>>>> * lack of authority to execute on one or more specified nodes.
>>>> 
>>>> Please verify your allocation and authorities.
>>>> 
>>>> 
>>>> 
>>>> * the inability to write startup files into /tmp 
>>>> (--tmpdir/orte_tmpdir_base).
>>>> 
>>>> Please check with your sys admin to determine the correct location to use.
>>>> 
>>>> 
>>>> 
>>>> *  compilation of the orted with dynamic libraries when static are required
>>>> 
>>>> (e.g., on Cray). Please check your configure cmd line and consider using
>>>> 
>>>> one of the contrib/platform definitions for your system type.
>>>> 
>>>> 
>>>> 
>>>> * an inability to create a connection back to mpirun due to a
>>>> 
>>>> lack of common network interfaces and/or no route found between
>>>> 
>>>> them. Please check network connectivity (including firewalls
>>>> 
>>>> and network routing requirements).
>>>> 
>>>> --------------------------------------------------------------------------
>>>> 
>>>> --------------------------------------------------------------------------
>>>> 
>>>> ORTE does not know how to route a message to the specified daemon located 
>>>> on the indicated node:
>>>> 
>>>> 
>>>> 
>>>> my node:   bos2
>>>> 
>>>> target node:  art10
>>>> 
>>>> 
>>>> 
>>>> This is usually an internal programming error that should be reported to 
>>>> the developers. In the meantime, a workaround may be to set the MCA param 
>>>> routed=direct on the command line or in your environment. We apologize for 
>>>> the problem.
>>>> 
>>>> --------------------------------------------------------------------------
>>>> 
>>>> --------------------------------------------------------------------------
>>>> 
>>>> ORTE does not know how to route a message to the specified daemon located 
>>>> on the indicated node:
>>>> 
>>>> 
>>>> 
>>>> my node:   bos2
>>>> 
>>>> target node:  hpb7
>>>> 
>>>> 
>>>> 
>>>> This is usually an internal programming error that should be reported to 
>>>> the developers. In the meantime, a workaround may be to set the MCA param 
>>>> routed=direct on the command line or in your environment. We apologize for 
>>>> the problem.
>>>> 
>>>> --------------------------------------------------------------------------
>>>> 
>>>> --------------------------------------------------------------------------
>>>> 
>>>> ORTE does not know how to route a message to the specified daemon located 
>>>> on the indicated node:
>>>> 
>>>> 
>>>> 
>>>> my node:   bos2
>>>> 
>>>> target node:  bos15
>>>> 
>>>> 
>>>> 
>>>> This is usually an internal programming error that should be reported to 
>>>> the developers. In the meantime, a workaround may be to set the MCA param 
>>>> routed=direct on the command line or in your environment. We apologize for 
>>>> the problem.
>>>> 
>>>> --------------------------------------------------------------------------
>>>> 
>>>> --------------------------------------------------------------------------
>>>> 
>>>> ORTE does not know how to route a message to the specified daemon located 
>>>> on the indicated node:
>>>> 
>>>> 
>>>> 
>>>> my node:   bos2
>>>> 
>>>> target node:  bos1
>>>> 
>>>> 
>>>> 
>>>> This is usually an internal programming error that should be reported to 
>>>> the developers. In the meantime, a workaround may be to set the MCA param 
>>>> routed=direct on the command line or in your environment. We apologize for 
>>>> the problem.
>>>> 
>>>> --------------------------------------------------------------------------
>>>> 
>>>> --------------------------------------------------------------------------
>>>> 
>>>> ORTE does not know how to route a message to the specified daemon located 
>>>> on the indicated node:
>>>> 
>>>> 
>>>> 
>>>> my node:   bos2
>>>> 
>>>> target node:  hpb11
>>>> 
>>>> 
>>>> 
>>>> This is usually an internal programming error that should be reported to 
>>>> the developers. In the meantime, a workaround may be to set the MCA param 
>>>> routed=direct on the command line or in your environment. We apologize for 
>>>> the problem.
>>>> 
>>>> --------------------------------------------------------------------------
> 
> 


Reply via email to