Afraid I don't have much to offer. I suspect the problem is here: > Unmatched ". > Unmatched ". > Unmatched ".
Something may be eating a quote, or mistakenly adding one, to the cmd line. You might try upping the verbosity: --mca plm_base_verbose 5 > On May 31, 2020, at 2:49 PM, Kulshrestha, Vipul > <vipul_kulshres...@mentor.com> wrote: > > Hi Ralph, > > Thanks for your response. > > I added the option "--mca plm_rsh_no_tree_spawn 1" to mpirun command line, > but I get a similar error. (pasted below). > > Regards, > Vipul > > Got 14 slots. > tmpdir is /tmp/194954128.1.all.q > pe_hostfile is /var/spool/sge/has2/active_jobs/194954128.1/pe_hostfile > has2.org.com 2 al...@has2.org.com <NULL> > has6.org.com 2 al...@has6.org.com <NULL> > cod4.org.com 2 al...@cod4.org.com <NULL> > cod6.org.com 2 al...@cod6.org.com <NULL> > cod5.org.com 2 al...@cod5.org.com <NULL> > hpb12.org.com 2 al...@hpb12.org.com <NULL> > has4.org.com 2 al...@has4.org.com <NULL> > [has2:08703] [[24953,0],0] plm:rsh: using "/grid2/sge/bin/lx-amd64/qrsh > -inherit -nostdin -V -verbose" for launching > [has2:08703] [[24953,0],0] plm:rsh: final template argv: > /grid2/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose <template> > set path = ( /build/openmpi/openmpi-4.0.1/rhel6/bin $path ) ; if ( > $?LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH == 0 ) > setenv LD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( > $?OMPI_have_llp == 1 ) setenv LD_LIBRARY_PATH > /build/openmpi/openmpi-4.0.1/rhel6/lib:$LD_LIBRARY_PATH ; if ( > $?DYLD_LIBRARY_PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 0 > ) setenv DYLD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( > $?OMPI_have_dllp == 1 ) setenv DYLD_LIBRARY_PATH > /build/openmpi/openmpi-4.0.1/rhel6/lib:$DYLD_LIBRARY_PATH ; > /build/openmpi/openmpi-4.0.1/rhel6/bin/orted -mca ess "env" -mca > ess_base_jobid "1635319808" -mca ess_base_vpid "<template>" -mca > ess_base_num_procs "7" -mca orte_node_regex > "has[1:2,6],cod[1:4,6,5],hpb[2:12],has[1:4]@0(7)" -mca orte_hnp_uri > "1635319808.0;tcp://139.181.79.58:57879" --mca routed "direct" --mca > orte_base_help_aggregate "0" --mca plm_base_verbose "1" --mca > plm_rsh_no_tree_spawn "1" -mca plm "rsh" -mca orte_output_filename > "./veloce.log/velsyn/dvelsyn:nojobid,nocopy" -mca hwloc_base_binding_policy > "none" -mca pmix "^s1,s2,cray,isolated" > Starting server daemon at host "cod5"Starting server daemon at host > "cod6"Starting server daemon at host "has4"Starting server daemon at host "co > d4" > > > > Starting server daemon at host "hpb12"Starting server daemon at host "has6" > > Server daemon successfully started with task id "1.cod4" > Server daemon successfully started with task id "1.cod5" > Server daemon successfully started with task id "1.cod6" > Server daemon successfully started with task id "1.has6" > Server daemon successfully started with task id "1.hpb12" > Server daemon successfully started with task id "1.has4" > Unmatched ". > Unmatched ". > Unmatched ". > -------------------------------------------------------------------------- > ORTE was unable to reliably start one or more daemons. > This usually is caused by: > > * not finding the required libraries and/or binaries on > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > settings, or configure OMPI with --enable-orterun-prefix-by-default > > * lack of authority to execute on one or more specified nodes. > Please verify your allocation and authorities. > > * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). > Please check with your sys admin to determine the correct location to use. > > * compilation of the orted with dynamic libraries when static are required > (e.g., on Cray). Please check your configure cmd line and consider using > one of the contrib/platform definitions for your system type. > > * an inability to create a connection back to mpirun due to a > lack of common network interfaces and/or no route found between > them. Please check network connectivity (including firewalls > and network routing requirements). > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > .... > .... > .... > > > > -----Original Message----- > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ralph > Castain via users > Sent: Sunday, May 31, 2020 10:50 AM > To: Open MPI Users <users@lists.open-mpi.org> > Cc: Ralph Castain <r...@open-mpi.org> > Subject: Re: [OMPI users] Running mpirun with grid > > The messages about the daemons is coming from two different sources. Grid is > saying it was able to spawn the orted - then the orted is saying it doesn't > know how to communicate and fails. > > I think the root of the problem lies in the plm output that shows the qrsh it > will use to start the job. For some reason, mpirun is still trying to "tree > spawn", which (IIRC) isn't allowed on grid (all the daemons have to be > launched in one shot by mpirun using qrsh). Try adding "--mca > plm_rsh_no_tree_spawn 1" to your mpirun cmd line. > > >>> >>> >>> On Sat, 30 May 2020 at 00:41, Kulshrestha, Vipul via users >>> <users@lists.open-mpi.org> wrote: >>>> >>>> Hi, >>>> >>>> >>>> >>>> I need to launch my openmpi application on grid. My application is >>>> designed to run N processes, where each process would have M threads. I am >>>> using open MPI version 4.0.1 >>>> >>>> >>>> >>>> % /build/openmpi/openmpi-4.0.1/rhel6/bin/ompi_info | grep grid >>>> >>>> MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component >>>> v4.0.1) >>>> >>>> >>>> >>>> To run it without grid, I run it as (say N = 7, M = 2) >>>> >>>> % mpirun –np 7 <application with arguments> >>>> >>>> >>>> >>>> The above works well and runs N processes. Based on some earlier advice on >>>> this forum, I have setup the grid submission using the a grid job >>>> submission script that modifies the grid slow allocation, so that mpirun >>>> launches only 1 application process copy on each host allocated by grid. I >>>> have some partial success. I think grid is able to start the job and then >>>> mpirun also starts to run, but then it errors out with below mentioned >>>> errors. Strangely, after giving message for having started all the >>>> daemons, it reports that it was not able to start one or more daemons. >>>> >>>> >>>> >>>> I have setup a grid submission script that modifies the pe_hostfile and it >>>> appears that mpirun is able to take it and then is able use the host >>>> information to start launching the jobs. However, mpirun halts before it >>>> can start all the child processes. I enabled some debug logs but am not >>>> able to figure out a possible cause. >>>> >>>> >>>> >>>> Could somebody look at this and guide how to resolve this issue? >>>> >>>> >>>> >>>> I have pasted the detailed log as well as my job submission script below. >>>> >>>> >>>> >>>> As a clarification, when I run the mpirun without grid, it (mpirun and my >>>> application) works on the same set of hosts without any problems. >>>> >>>> >>>> >>>> Thanks, >>>> >>>> Vipul >>>> >>>> >>>> >>>> Job submission script: >>>> >>>> #!/bin/sh >>>> >>>> #$ -N velsyn >>>> >>>> #$ -pe orte2 14 >>>> >>>> #$ -V -cwd -j y >>>> >>>> #$ -o out.txt >>>> >>>> # >>>> >>>> echo "Got $NSLOTS slots." >>>> >>>> echo "tmpdir is $TMPDIR" >>>> >>>> echo "pe_hostfile is $PE_HOSTFILE" >>>> >>>> >>>> >>>> >>>> >>>> cat $PE_HOSTFILE >>>> >>>> newhostfile=/testdir/tmp/pe_hostfile >>>> >>>> >>>> >>>> awk '{$2 = $2/2; print}' $PE_HOSTFILE > $newhostfile >>>> >>>> >>>> >>>> export PE_HOSTFILE=$newhostfile >>>> >>>> export LD_LIBRARY_PATH=/build/openmpi/openmpi-4.0.1/rhel6/lib >>>> >>>> >>>> >>>> mpirun --merge-stderr-to-stdout --output-filename ./output:nojobid,nocopy >>>> --mca routed direct --mca orte_base_help_aggregate 0 --mca >>>> plm_base_verbose 1 --bind-to none --report-bindings -np 7 <application >>>> with args> >>>> >>>> >>>> >>>> The out.txt content is: >>>> >>>> Got 14 slots. >>>> >>>> tmpdir is /tmp/182117160.1.all.q >>>> >>>> pe_hostfile is /var/spool/sge/bos2/active_jobs/182117160.1/pe_hostfile >>>> >>>> bos2.wv.org.com 2 al...@bos2.wv.org.com <NULL> art8.wv.org.com 2 >>>> al...@art8.wv.org.com <NULL> art10.wv.org.com 2 al...@art10.wv.org.com >>>> <NULL> hpb7.wv.org.com 2 al...@hpb7.wv.org.com <NULL> bos15.wv.org.com 2 >>>> al...@bos15.wv.org.com <NULL> bos1.wv.org.com 2 al...@bos1.wv.org.com >>>> <NULL> hpb11.wv.org.com 2 al...@hpb11.wv.org.com <NULL> [bos2:22657] >>>> [[8251,0],0] plm:rsh: using "/wv/grid2/sge/bin/lx-amd64/qrsh -inherit >>>> -nostdin -V -verbose" for launching [bos2:22657] [[8251,0],0] plm:rsh: >>>> final template argv: >>>> >>>> /grid2/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose <template> >>>> set path = ( /build/openm >>>> >>>> pi/openmpi-4.0.1/rhel6/bin $path ) ; if ( $?LD_LIBRARY_PATH == 1 ) set >>>> OMPI_have_llp ; if ( $?LD_LIBR ARY_PATH == 0 ) setenv LD_LIBRARY_PATH >>>> /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( $?OMPI_have_llp == 1 ) >>>> setenv LD_LIBRARY_PATH >>>> /build/openmpi/openmpi-4.0.1/rhel6/lib:$LD_LIBRARY_PATH ; if ( $?DYLD_L >>>> IBRARY_PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 0 ) >>>> setenv DYLD_LIBRARY_PATH /bui ld/openmpi/openmpi-4.0.1/rhel6/lib ; if ( >>>> $?OMPI_have_dllp == 1 ) setenv DYLD_LIBRARY_PATH /build/ope >>>> >>>> nmpi/openmpi-4.0.1/rhel6/lib:$DYLD_LIBRARY_PATH ; >>>> /build/openmpi/openmpi-4.0.1/rhel6/bin/orted -mca >>>> >>>> orte_report_bindings "1" -mca ess "env" -mca ess_base_jobid "540737536" >>>> -mca ess_base_vpid "<templat >>>> >>>> e>" -mca ess_base_num_procs "7" -mca orte_node_regex >>>> >>>> e>"bos[1:2],art[1:8],art[2:10],hpb[1:7],bos[2:15], >>>> >>>> bos[1:1],hpb[2:11]@0(7)" -mca orte_hnp_uri >>>> "540737536.0;tcp://147.34.116.60:50769" --mca routed "dire ct" --mca >>>> orte_base_help_aggregate "0" --mca plm_base_verbose "1" -mca plm "rsh" >>>> --tree-spawn -mca or te_parent_uri "540737536.0;tcp://147.34.116.60:50769" >>>> -mca orte_output_filename "./output:nojobid,noc opy" -mca >>>> hwloc_base_binding_policy "none" -mca hwloc_base_report_bindings "1" -mca >>>> pmix "^s1,s2,cray ,isolated" >>>> >>>> Starting server daemon at host "art10" >>>> >>>> Starting server daemon at host "art8" >>>> >>>> Starting server daemon at host "bos1" >>>> >>>> Starting server daemon at host "hpb7" >>>> >>>> Starting server daemon at host "hpb11" >>>> >>>> Starting server daemon at host "bos15" >>>> >>>> Server daemon successfully started with task id "1.art8" >>>> >>>> Server daemon successfully started with task id "1.bos1" >>>> >>>> Server daemon successfully started with task id "1.art10" >>>> >>>> Server daemon successfully started with task id "1.bos15" >>>> >>>> Server daemon successfully started with task id "1.hpb7" >>>> >>>> Server daemon successfully started with task id "1.hpb11" >>>> >>>> Unmatched ". >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> ORTE was unable to reliably start one or more daemons. >>>> >>>> This usually is caused by: >>>> >>>> >>>> >>>> * not finding the required libraries and/or binaries on >>>> >>>> one or more nodes. Please check your PATH and LD_LIBRARY_PATH >>>> >>>> settings, or configure OMPI with --enable-orterun-prefix-by-default >>>> >>>> >>>> >>>> * lack of authority to execute on one or more specified nodes. >>>> >>>> Please verify your allocation and authorities. >>>> >>>> >>>> >>>> * the inability to write startup files into /tmp >>>> (--tmpdir/orte_tmpdir_base). >>>> >>>> Please check with your sys admin to determine the correct location to use. >>>> >>>> >>>> >>>> * compilation of the orted with dynamic libraries when static are required >>>> >>>> (e.g., on Cray). Please check your configure cmd line and consider using >>>> >>>> one of the contrib/platform definitions for your system type. >>>> >>>> >>>> >>>> * an inability to create a connection back to mpirun due to a >>>> >>>> lack of common network interfaces and/or no route found between >>>> >>>> them. Please check network connectivity (including firewalls >>>> >>>> and network routing requirements). >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> ORTE does not know how to route a message to the specified daemon located >>>> on the indicated node: >>>> >>>> >>>> >>>> my node: bos2 >>>> >>>> target node: art10 >>>> >>>> >>>> >>>> This is usually an internal programming error that should be reported to >>>> the developers. In the meantime, a workaround may be to set the MCA param >>>> routed=direct on the command line or in your environment. We apologize for >>>> the problem. >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> ORTE does not know how to route a message to the specified daemon located >>>> on the indicated node: >>>> >>>> >>>> >>>> my node: bos2 >>>> >>>> target node: hpb7 >>>> >>>> >>>> >>>> This is usually an internal programming error that should be reported to >>>> the developers. In the meantime, a workaround may be to set the MCA param >>>> routed=direct on the command line or in your environment. We apologize for >>>> the problem. >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> ORTE does not know how to route a message to the specified daemon located >>>> on the indicated node: >>>> >>>> >>>> >>>> my node: bos2 >>>> >>>> target node: bos15 >>>> >>>> >>>> >>>> This is usually an internal programming error that should be reported to >>>> the developers. In the meantime, a workaround may be to set the MCA param >>>> routed=direct on the command line or in your environment. We apologize for >>>> the problem. >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> ORTE does not know how to route a message to the specified daemon located >>>> on the indicated node: >>>> >>>> >>>> >>>> my node: bos2 >>>> >>>> target node: bos1 >>>> >>>> >>>> >>>> This is usually an internal programming error that should be reported to >>>> the developers. In the meantime, a workaround may be to set the MCA param >>>> routed=direct on the command line or in your environment. We apologize for >>>> the problem. >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> ORTE does not know how to route a message to the specified daemon located >>>> on the indicated node: >>>> >>>> >>>> >>>> my node: bos2 >>>> >>>> target node: hpb11 >>>> >>>> >>>> >>>> This is usually an internal programming error that should be reported to >>>> the developers. In the meantime, a workaround may be to set the MCA param >>>> routed=direct on the command line or in your environment. We apologize for >>>> the problem. >>>> >>>> -------------------------------------------------------------------------- > >